Performance Optimization

HORUS is already fast by default. This guide helps you squeeze out extra performance when needed.

Cross-Platform Philosophy

HORUS is designed for development on any OS with production deployment on Linux:

PhaseSupported PlatformsPerformance
DevelopmentWindows, macOS, LinuxGood (standard IPC)
TestingWindows, macOS, LinuxGood (standard IPC)
ProductionLinux (recommended)Best (sub-100ns with RT)

All performance features use graceful degradation — your code runs everywhere, with maximum performance on Linux. Advanced features like .prefer_rt() (which enables SCHED_FIFO and mlockall on Linux) and SIMD acceleration automatically fall back to safe defaults on unsupported platforms.

Why HORUS is Fast

Shared Memory Architecture

Zero network overhead: Data written to shared memory, read directly by subscribers

HORUS automatically selects the optimal shared memory backend for your platform (Linux, macOS, Windows). No configuration needed.

Zero serialization: Fixed-size structs copied directly to shared memory

Zero-copy loan pattern: Publishers write directly to shared memory slots

Optimized Data Structures

HORUS uses carefully optimized memory layouts to minimize latency. The communication paths are designed for maximum throughput with predictable timing — same-thread paths achieve ~23ns, cross-thread single-producer paths achieve ~155ns, with only 12ns overhead over raw memcpy.

Benchmark Results

For detailed benchmark methodology, raw data, and comprehensive latency/throughput tables, see the dedicated Benchmarks page.

Headlines: 23ns same-thread latency, 50x faster than raw UDP, 230-575x faster than ROS2 DDS, 40M+ msg/sec throughput for small messages, O(1) topic scaling to 1000+ topics.

Throughput

HORUS can handle:

  • 40M+ messages/second for small messages (8B) with same-thread DirectChannel
  • 10M+ messages/second for small messages (8B) with cross-thread SPSC
  • 1M+ messages/second for medium messages (1KB)
  • 100K+ messages/second for large messages (100KB)

Build Optimization

Always Use Release Mode

Debug builds are 10-100x slower:

# SLOW: Debug build
horus run

# FAST: Release build
horus run --release

Why it matters:

  • Debug: 50µs per tick
  • Release: 500ns per tick
  • 100x difference for the same code

Enable LTO in your Cargo.toml for additional 10-20% speedup:

# Cargo.toml
[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1

Warning: Slower compilation, but faster execution.

Target CPU Features

CPU-Specific Optimizations:

HORUS compiles with Rust compiler optimizations enabled in release mode. For advanced CPU-specific tuning, the framework is optimized for modern x86-64 and ARM64 processors.

Gains: 5-15% from CPU-specific SIMD instructions (automatically enabled in release builds).

Hardware Acceleration

HORUS automatically uses hardware-accelerated memory operations when available (e.g., SIMD on x86_64). No configuration needed — your code runs on any platform, with extra performance on supported hardware.

For maximum performance, compile targeting your specific CPU:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Message Optimization

Use Fixed-Size Types

// FAST: Fixed-size array
pub struct LaserScan {
    pub ranges: [f32; 360],  // Stack-allocated
}

// SLOW: Dynamic vector
pub struct BadLaserScan {
    pub ranges: Vec<f32>,  // Heap-allocated
}

Impact: Fixed-size avoids heap allocations in hot path.

Choose Typed Messages Over Generic

// FAST: Small, fixed-size struct
let topic: Topic<Pose2D> = Topic::new("pose")?;
topic.send(Pose2D { x: 1.0, y: 2.0, theta: 0.5 });
// IPC latency: ~23-155ns depending on topology

// SLOWER: Larger struct with more data
let topic: Topic<SensorBundle> = Topic::new("sensors")?;
// Latency scales linearly with message size

Rule: Use the smallest struct that represents your data. Avoid padding and unused fields.

Choose Appropriate Precision

// f32 (single precision) - sufficient for most robotics
pub struct FastPose {
    pub x: f32,  // 4 bytes
    pub y: f32,  // 4 bytes
}

// f64 (double precision) - scientific applications
pub struct PrecisePose {
    pub x: f64,  // 8 bytes
    pub y: f64,  // 8 bytes
}

Rule: Use f32 unless you need scientific precision.

Minimize Message Size

// GOOD: 8 bytes
struct CompactCmd {
    linear: f32,   // 4 bytes
    angular: f32,  // 4 bytes
}

// BAD: 1KB+ bytes
struct BloatedCmd {
    linear: f32,
    angular: f32,
    metadata: [u8; 256],    // Unused
    debug_info: [u8; 768],  // Unused
}

Every byte matters: Latency scales with message size.

Batch Small Messages

Instead of sending 100 separate f32 values:

// SLOW: 100 separate messages
for value in values {
    topic.send(value);  // 100 IPC operations
}

// FAST: One batched message
pub struct BatchedData {
    values: [f32; 100],
}
topic.send(batched);  // 1 IPC operation

Speedup: 50-100x for batched operations.

Node Optimization

Keep tick() Fast

Target: <1ms per tick for real-time control.

// GOOD: Fast tick
fn tick(&mut self) {
    let data = self.read_sensor();     // Quick read
    self.process_pub.send(data);  // ~500ns
}

// BAD: Slow tick
fn tick(&mut self) {
    let data = std::fs::read_to_string("config.yaml").unwrap();  // 1-10ms!
    // ...
}

File I/O, network calls, sleeps = slow. Do these in init() or separate threads.

Pre-Allocate in init()

fn init(&mut self) -> Result<()> {
    // Pre-allocate buffers
    self.buffer = vec![0.0; 10000];

    // Open connections
    self.device = Device::open()?;

    // Load configuration
    self.config = Config::from_file("config.yaml")?;

    Ok(())
}

fn tick(&mut self) {
    // Use pre-allocated resources - no allocations here!
    self.buffer[0] = self.device.read();
}

Allocations in tick() = slow. Move to init().

Avoid Unnecessary Cloning

// BAD: Unnecessary clone
fn tick(&mut self) {
    if let Some(data) = self.sub.recv() {
        let copy = data.clone();  // Unnecessary!
        self.process(copy);
    }
}

// GOOD: Direct use
fn tick(&mut self) {
    if let Some(data) = self.sub.recv() {
        self.process(data);  // Already cloned by recv()
    }
}

Topic::recv() already clones data. Don't clone again.

Minimize Logging

// BAD: Logging every tick
fn tick(&mut self) {
    hlog!(debug, "Tick #{}", self.counter);  // Slow!
    self.counter += 1;
}

// GOOD: Conditional logging
fn tick(&mut self) {
    if self.counter % 1000 == 0 {  // Log every 1000 ticks
        hlog!(info, "Reached tick #{}", self.counter);
    }
    self.counter += 1;
}

Logging is expensive. Log sparingly in hot paths.

Scheduler Optimization

Understanding Tick Rate

The default scheduler runs at 100 Hz (10ms per tick). Use .tick_rate() to change it:

// Default: 100 Hz
let scheduler = Scheduler::new();

// 10kHz for high-performance control loops
let scheduler = Scheduler::new().tick_rate(10000_u64.hz());

Key Point: Keep individual node tick() methods fast (ideally <1ms) to maintain the target tick rate.

Use Priority Levels

// Critical tasks run first (order 0 = highest)
scheduler.add(safety).order(0).build()?;

// Logging runs last (order 100 = lowest)
scheduler.add(logger).order(100).build()?;

Predictable execution order = better performance. Use lower numbers for higher priority tasks.

Minimize Node Count

// BAD: 50 small nodes
for i in 0..50 {
    scheduler.add(TinyNode::new(i)).order(50).build()?;
}

// GOOD: One aggregated node
scheduler.add(AggregatedNode::new()).order(50).build()?;

Fewer nodes = less scheduling overhead.

Ultra-Low-Latency Networking (Linux)

HORUS provides optional kernel bypass networking for sub-microsecond latency requirements.

Transport Options

TransportLatency (send+recv)ThroughputRequirements
Shared Memory (same thread)~23ns40M+ msg/sLocal only
Shared Memory (cross thread, 1:1)~155ns10M+ msg/sLocal only
io_uring2-3µs500K+ msg/sLinux 5.1+
Batch UDP3-5µs300K+ msg/sLinux 3.0+
Standard UDP5-10µs200K+ msg/sCross-platform

Enable io_uring Transport

io_uring eliminates syscalls on the send path using kernel-side polling:

# Build with io_uring support (Cargo feature flag)
cargo build --release --features io-uring-net

Requirements:

  • Linux 5.1+ (5.6+ recommended for SQ polling)
  • CAP_SYS_NICE capability for SQ_POLL mode

Enable Batch UDP (Linux)

Batch UDP uses sendmmsg/recvmmsg syscalls for efficient batched network I/O:

# Batch UDP is automatically enabled on Linux - no extra dependencies needed
cargo build --release

Requirements:

  • Linux 3.0+ (available on virtually all modern Linux systems)

Enable All Ultra-Low-Latency Features

# Build with all ultra-low-latency features (io_uring)
cargo build --release --features ultra-low-latency

Smart Transport Selection

For network topics, HORUS automatically selects the best transport based on available system features and kernel version. Configure network endpoints through topic configuration rather than the Topic::new() API (which creates local shared memory topics). See Network Backends for details.

Shared Memory Optimization

HORUS uses platform-native shared memory managed by horus_sys — you never need to manage paths manually.

Check Available Space (Linux)

horus doctor    # Includes shared memory space check

If shared memory space is insufficient, messages may be dropped. On Linux, the default tmpfs size is usually 50% of RAM. Consult your OS documentation to increase it if needed.

Clean Up Stale Topics

HORUS automatically cleans up sessions after each run. Manual cleanup is rarely needed:

horus clean --shm

Topic Memory Usage

Topics use shared memory slots proportional to message size. Keep messages small to reduce memory footprint:

// Small messages use less shared memory
let cmd: Topic<CmdVel> = Topic::new("cmd_vel")?;       // 16B per slot

// Large messages use more shared memory
let cloud: Topic<PointCloud> = Topic::new("cloud")?;    // 120KB per slot

Balance: Message size directly affects shared memory consumption.

Profiling and Measurement

Built-In Metrics

HORUS automatically tracks node performance metrics. Use horus monitor to view real-time performance data including tick duration, messages sent, and CPU usage.

Available metrics (on NodeMetrics):

  • total_ticks: Total number of ticks
  • avg_tick_duration_ms: Average tick time in milliseconds
  • max_tick_duration_ms: Worst-case tick time in milliseconds
  • messages_sent: Messages published
  • messages_received: Messages received
  • errors_count: Total error count
  • uptime_seconds: Node uptime in seconds

IPC Latency Logging

HORUS automatically tracks IPC timing for each topic operation. The horus monitor web interface displays per-log-entry metrics:

Tick: 12μs | IPC: 296ns

Each log entry includes tick_us (node tick time in microseconds) and ipc_ns (IPC write time in nanoseconds).

Manual Profiling

use std::time::Instant;

fn tick(&mut self) {
    let start = Instant::now();

    self.expensive_operation();

    let duration = start.elapsed();
    println!("Operation took: {:?}", duration);
}

CPU Profiling

Use perf on Linux:

# Profile your application
perf record --call-graph dwarf horus run --release

# View results
perf report

Hotspots show where CPU time is spent.

Common Performance Pitfalls

Pitfall: Using Debug Builds

# SLOW: 50µs/tick
horus run

# FAST: 500ns/tick
horus run --release

Fix: Always use --release for benchmarks and production.

Pitfall: Allocations in tick()

// BAD
fn tick(&mut self) {
    let buffer = vec![0.0; 1000];  // Heap allocation every tick!
}

// GOOD
struct Node {
    buffer: Vec<f32>,  // Pre-allocated
}

fn init(&mut self) -> Result<()> {
    self.buffer = vec![0.0; 1000];  // Allocate once
    Ok(())
}

Fix: Pre-allocate in init().

Pitfall: Excessive Logging

// BAD: 60 logs per second
fn tick(&mut self) {
    hlog!(debug, "Tick");  // Every 16ms!
}

// GOOD: 1 log per second
fn tick(&mut self) {
    self.tick_count += 1;
    if self.tick_count % 60 == 0 {
        hlog!(info, "60 ticks completed");
    }
}

Fix: Log sparingly.

Pitfall: Large Message Types

// BAD: 1MB per message
pub struct HugeMessage {
    image: [u8; 1_000_000],
}

// GOOD: Compressed or separate channel
pub struct CompressedImage {
    data: Vec<u8>,  // JPEG compressed, ~50KB
}

Fix: Compress or split large data.

Pitfall: Synchronous I/O in tick()

// BAD: Blocking I/O
fn tick(&mut self) {
    let data = std::fs::read("data.txt").unwrap();  // Blocks!
}

// GOOD: Async or pre-loaded
fn init(&mut self) -> Result<()> {
    self.data = std::fs::read("data.txt")?;  // Load once
    Ok(())
}

Fix: Move I/O to init() or use async.

Performance Checklist

Before deployment, verify:

  • Build in release mode (--release)
  • Profile with perf or similar
  • tick() completes in <1ms
  • No allocations in tick()
  • Messages use fixed-size types
  • Logging is rate-limited
  • Shared memory has sufficient space
  • IPC latency is <10µs
  • Priority levels set correctly

Measuring Your Performance

Latency Measurement

use std::time::Instant;

struct BenchmarkNode {
    pub_topic: Topic<f32>,
    sub_topic: Topic<f32>,
    start_time: Option<Instant>,
}

impl Node for BenchmarkNode {
    fn tick(&mut self) {
        // Publish
        self.start_time = Some(Instant::now());
        self.pub_topic.send(42.0);

        // Receive
        if let Some(data) = self.sub_topic.recv() {
            if let Some(start) = self.start_time {
                let latency = start.elapsed();
                println!("Round-trip latency: {:?}", latency);
            }
        }
    }
}

Throughput Measurement

struct ThroughputTest {
    pub_topic: Topic<f32>,
    message_count: u64,
    start_time: Instant,
}

impl Node for ThroughputTest {
    fn tick(&mut self) {
        for _ in 0..1000 {
            self.pub_topic.send(42.0);
            self.message_count += 1;
        }

        if self.message_count % 100_000 == 0 {
            let elapsed = self.start_time.elapsed().as_secs_f64();
            let throughput = self.message_count as f64 / elapsed;
            println!("Throughput: {:.0} msg/s", throughput);
        }
    }
}

Real-Time Configuration

For hard real-time applications requiring bounded latency, use the Scheduler builder API:

use horus::prelude::*;

let mut scheduler = Scheduler::new()
    .tick_rate(1000_u64.hz())
    .require_rt()               // Enables mlockall, SCHED_FIFO (Linux only)
    .cores(&[2, 3]);            // Pin to isolated CPU cores

scheduler.add(MotorController::new())
    .order(0)
    .rate(1000_u64.hz())        // Auto-derives budget + deadline, auto-enables RT
    .priority(80)               // SCHED_FIFO priority (1-99)
    .core(2)                    // Pin this node's thread to core 2
    .build()?;

Linux only: RT features (SCHED_FIFO, mlockall, CPU pinning) require a Linux kernel. On other platforms, .prefer_rt() degrades gracefully to best-effort scheduling.

For detailed configuration options, see the Scheduler Configuration.

Next Steps