Performance Optimization

Why HORUS is Fast

Shared Memory Architecture

Zero network overhead: Data written once to /dev/shm, read directly by subscribers

Zero serialization: Fixed-size structs copied directly to shared memory

Zero-copy loan pattern: Publishers write directly to shared memory slots

Cache-Optimized Structures

64-byte alignment: Matches CPU cache line size

#[repr(align(64))]  // Cache-line aligned
pub struct Hub<T> {
    // Prevents false sharing between cores
}

Padding prevention: False sharing eliminated with explicit padding

Atomic operations: Lock-free operations with appropriate memory ordering

Lock-Free Operations

Compare-and-swap: Atomic slot claiming without locks

Per-consumer tracking: Each subscriber maintains independent position

Wait-free progress: Publishers always make progress

Benchmark Results

Measured Latency

Message TypeSizeHORUStraditional frameworks DDSSpeedup
CmdVel16B296ns50-100µs338x
IMU304B718ns80-150µs139x
LaserScan1.5KB1.31µs150-300µs191x
PointCloud120KB2.8µs500µs-1ms200x

Key insight: Latency scales linearly with message size.

Throughput

HORUS can handle:

  • 10M+ messages/second for small messages (16B)
  • 1M+ messages/second for medium messages (1KB)
  • 100K+ messages/second for large messages (100KB)

Build Optimization

Always Use Release Mode

Debug builds are 10-100x slower:

# SLOW: Debug build
horus run

# FAST: Release build
horus run --release

Why it matters:

  • Debug: 50µs per tick
  • Release: 500ns per tick
  • 100x difference for the same code

Profile-Guided Optimization (PGO)

Enable PGO for additional 10-20% speedup:

# Cargo.toml
[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1

Warning: Slower compilation, but faster execution.

Target CPU Features

CPU-Specific Optimizations:

HORUS compiles with Rust compiler optimizations enabled in release mode. For advanced CPU-specific tuning, the framework is optimized for modern x86-64 and ARM64 processors.

Gains: 5-15% from CPU-specific SIMD instructions (automatically enabled in release builds).

Message Optimization

Use Fixed-Size Types

// FAST: Fixed-size array
pub struct LaserScan {
    pub ranges: [f32; 360],  // Stack-allocated
}

// SLOW: Dynamic vector
pub struct BadLaserScan {
    pub ranges: Vec<f32>,  // Heap-allocated
}

Impact: Fixed-size avoids heap allocations in hot path.

Choose Appropriate Precision

// f32 (single precision) - sufficient for most robotics
pub struct FastPose {
    pub x: f32,  // 4 bytes
    pub y: f32,  // 4 bytes
}

// f64 (double precision) - scientific applications
pub struct PrecisePose {
    pub x: f64,  // 8 bytes
    pub y: f64,  // 8 bytes
}

Rule: Use f32 unless you need scientific precision.

Minimize Message Size

// GOOD: 8 bytes
struct CompactCmd {
    linear: f32,   // 4 bytes
    angular: f32,  // 4 bytes
}

// BAD: 1KB+ bytes
struct BloatedCmd {
    linear: f32,
    angular: f32,
    metadata: [u8; 256],    // Unused
    debug_info: [u8; 768],  // Unused
}

Every byte matters: Latency scales with message size.

Batch Small Messages

Instead of sending 100 separate f32 values:

// SLOW: 100 separate messages
for value in values {
    hub.send(value, ctx).ok();  // 100 IPC operations
}

// FAST: One batched message
pub struct BatchedData {
    values: [f32; 100],
}
hub.send(batched, ctx).ok();  // 1 IPC operation

Speedup: 50-100x for batched operations.

Node Optimization

Keep tick() Fast

Target: <1ms per tick for real-time control.

// GOOD: Fast tick
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    let data = self.read_sensor();     // Quick read
    self.process_pub.send(data, ctx).ok();  // ~500ns
}

// BAD: Slow tick
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    let data = std::fs::read_to_string("config.yaml").unwrap();  // 1-10ms!
    // ...
}

File I/O, network calls, sleeps = slow. Do these in init() or separate threads.

Pre-Allocate in init()

fn init(&mut self, ctx: &mut NodeInfo) -> Result<(), String> {
    // Pre-allocate buffers
    self.buffer = vec![0.0; 10000];

    // Open connections
    self.device = Device::open()?;

    // Load configuration
    self.config = Config::from_file("config.yaml")?;

    Ok(())
}

fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    // Use pre-allocated resources - no allocations here!
    self.buffer[0] = self.device.read();
}

Allocations in tick() = slow. Move to init().

Avoid Unnecessary Cloning

// BAD: Unnecessary clone
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    if let Some(data) = self.sub.recv(ctx) {
        let copy = data.clone();  // Unnecessary!
        self.process(copy);
    }
}

// GOOD: Direct use
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    if let Some(data) = self.sub.recv(ctx) {
        self.process(data);  // Already cloned by recv()
    }
}

Hub::recv() already clones data. Don't clone again.

Minimize Logging

// BAD: Logging every tick
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    ctx.log_debug(&format!("Tick #{}", self.counter));  // Slow!
    self.counter += 1;
}

// GOOD: Conditional logging
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    if self.counter % 1000 == 0 {  // Log every 1000 ticks
        ctx.log_info(&format!("Reached tick #{}", self.counter));
    }
    self.counter += 1;
}

Logging is expensive. Log sparingly in hot paths.

Scheduler Optimization

Understanding Tick Rate

The scheduler runs at a fixed rate of approximately 60 FPS (16ms per tick):

let scheduler = Scheduler::new();
// Runs at ~60 FPS (16ms per tick)

Note: The tick rate is currently hardcoded. If you need different timing for your application, ensure your nodes complete execution well within the 16ms window. Monitor node metrics to verify performance.

Key Point: Keep individual node tick() methods fast (ideally <1ms) to maintain the target frame rate.

Use Priority Levels

// Critical tasks run first
scheduler.register_with_priority(safety, NodePriority::Critical);

// Logging runs last
scheduler.register_with_priority(logger, NodePriority::Background);

Predictable execution order = better performance.

Minimize Node Count

// BAD: 50 small nodes
for i in 0..50 {
    scheduler.register(TinyNode::new(i));
}

// GOOD: One aggregated node
scheduler.register(AggregatedNode::new());

Fewer nodes = less scheduling overhead.

Shared Memory Optimization

Check Available Space

df -h /dev/shm

Insufficient space = message drops.

Increase /dev/shm Size

# Increase to 4GB
sudo mount -o remount,size=4G /dev/shm

More space = larger buffer capacity.

Clean Up Stale Topics

# Remove old shared memory
rm -rf /dev/shm/horus/

Stale topics waste space and cause confusion.

Choose Appropriate Capacity

// Small messages, high frequency
ShmTopic::new("cmd_vel", 100)?;  // 100 slots

// Large messages, low frequency
ShmTopic::new("point_cloud", 10)?;  // 10 slots

Balance: Memory usage vs message buffering.

Profiling and Measurement

Built-In Metrics

Every node tracks performance automatically:

fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    if let Some(ctx) = ctx {
        if ctx.metrics.avg_tick_duration_ms > 1.0 {
            ctx.log_warning("Tick taking too long");
        }
    }
}

Available metrics:

  • total_ticks: Total number of ticks
  • avg_tick_duration_ms: Average tick time in milliseconds
  • max_tick_duration_ms: Worst-case tick time in milliseconds
  • messages_sent: Messages published
  • cpu_usage_percent: CPU utilization (f64)

IPC Latency Logging

HORUS automatically logs IPC timing:

[12:34:56.789] [IPC: 296ns | Tick: 12µs] PublisherNode --PUB--> 'cmd_vel' = 1.5

IPC: Time to write to shared memory Tick: Total node execution time

Manual Profiling

use std::time::Instant;

fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    let start = Instant::now();

    self.expensive_operation();

    let duration = start.elapsed();
    println!("Operation took: {:?}", duration);
}

CPU Profiling

Use perf on Linux:

# Profile your application
perf record --call-graph dwarf horus run --release

# View results
perf report

Hotspots show where CPU time is spent.

Common Performance Pitfalls

Pitfall: Using Debug Builds

# SLOW: 50µs/tick
horus run

# FAST: 500ns/tick
horus run --release

Fix: Always use --release for benchmarks and production.

Pitfall: Allocations in tick()

// BAD
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    let buffer = vec![0.0; 1000];  // Heap allocation every tick!
}

// GOOD
struct Node {
    buffer: Vec<f32>,  // Pre-allocated
}

fn init(&mut self, ctx: &mut NodeInfo) -> Result<(), String> {
    self.buffer = vec![0.0; 1000];  // Allocate once
    Ok(())
}

Fix: Pre-allocate in init().

Pitfall: Excessive Logging

// BAD: 60 logs per second
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    if let Some(ctx) = ctx {
        ctx.log_debug("Tick");  // Every 16ms!
    }
}

// GOOD: 1 log per second
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    self.tick_count += 1;
    if self.tick_count % 60 == 0 {
        if let Some(ctx) = ctx {
            ctx.log_info("60 ticks completed");
        }
    }
}

Fix: Log sparingly.

Pitfall: Large Message Types

// BAD: 1MB per message
pub struct HugeMessage {
    image: [u8; 1_000_000],
}

// GOOD: Compressed or separate channel
pub struct CompressedImage {
    data: Vec<u8>,  // JPEG compressed, ~50KB
}

Fix: Compress or split large data.

Pitfall: Synchronous I/O in tick()

// BAD: Blocking I/O
fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
    let data = std::fs::read("data.txt").unwrap();  // Blocks!
}

// GOOD: Async or pre-loaded
fn init(&mut self, ctx: &mut NodeInfo) -> Result<(), String> {
    self.data = std::fs::read("data.txt")?;  // Load once
    Ok(())
}

Fix: Move I/O to init() or use async.

Performance Checklist

Before deployment, verify:

  • Build in release mode (--release)
  • Profile with perf or similar
  • tick() completes in <1ms
  • No allocations in tick()
  • Messages use fixed-size types
  • Logging is rate-limited
  • /dev/shm has sufficient space
  • IPC latency is <10µs
  • Priority levels set correctly

Measuring Your Performance

Latency Measurement

use std::time::Instant;

struct BenchmarkNode {
    pub_hub: Hub<f32>,
    sub_hub: Hub<f32>,
    start_time: Option<Instant>,
}

impl Node for BenchmarkNode {
    fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
        // Publish
        self.start_time = Some(Instant::now());
        self.pub_hub.send(42.0, ctx).ok();

        // Receive
        if let Some(data) = self.sub_hub.recv(ctx) {
            if let Some(start) = self.start_time {
                let latency = start.elapsed();
                println!("Round-trip latency: {:?}", latency);
            }
        }
    }
}

Throughput Measurement

struct ThroughputTest {
    pub_hub: Hub<f32>,
    message_count: u64,
    start_time: Instant,
}

impl Node for ThroughputTest {
    fn tick(&mut self, ctx: Option<&mut NodeInfo>) {
        for _ in 0..1000 {
            self.pub_hub.send(42.0, ctx).ok();
            self.message_count += 1;
        }

        if self.message_count % 100_000 == 0 {
            let elapsed = self.start_time.elapsed().as_secs_f64();
            let throughput = self.message_count as f64 / elapsed;
            println!("Throughput: {:.0} msg/s", throughput);
        }
    }
}

Next Steps