HORUS Benchmarks

Performance validation with real-world robotics workloads.

Benchmark Methodology

Measurement Approach

  • Statistical sampling: Criterion.rs with 20+ samples per measurement
  • Confidence intervals: Min/mean/max with outlier detection
  • Controlled methodology: 1s warm-up, 5s measurement phases
  • Reproducible: Less than 1% variance across measurements
  • Comprehensive coverage: 5 workload types, 4 scalability points

Workload Testing

  • Real workloads: Control loops, sensor fusion, I/O operations
  • Fault injection: Failure policy recovery testing
  • Scale testing: Validated up to 200 concurrent nodes
  • Mixed patterns: Combined blocking/non-blocking operations
  • Long-running: 25+ second failure recovery tests

Executive Summary

HORUS delivers sub-microsecond to low-microsecond latency for production robotics applications:

Message TypeSizeLatency (Topic N:N)ThroughputTypical RateHeadroom
CmdVel16 B~500 ns2.7M msg/s1000 Hz2,700x
BatteryState104 B~600 ns1.67M msg/s1 Hz1.67M x
IMU304 B~940 ns1.8M msg/s100 Hz18,000x
Odometry736 B~1.1 μs1.3M msg/s50 Hz26,000x
LaserScan1.5 KB~2.2 μs633K msg/s10 Hz63,300x
PointCloud (1K)~12 KB~12 μs83K msg/s30 Hz2,767x
PointCloud (10K)~120 KB~360 μs4.7K msg/s30 Hz157x

Latency Comparison: HORUS vs ROS2

Lower is better. Logarithmic scale (send-only latency in μs)

HORUS Link (SPSC, wait-free)
HORUS Hub (MPMC, lock-free)
ROS2 DDS (typical)

Performance Highlights

Key Findings

Sub-microsecond latency for messages up to 1.5KB Serde integration works flawlessly with complex nested structs Linear scaling with message size (predictable performance) Massive headroom for all typical robotics frequencies

Production Readiness

  • Real-time control: ~500 ns latency supports 1000Hz+ control loops with 2,700x headroom
  • Sensor fusion: Mixed workload maintains sub-microsecond performance (648 ns avg)
  • Perception pipelines: 10K point clouds @ 30Hz with 189x headroom
  • Multi-robot systems: Throughput supports 100+ robots on single node

Detailed Results

CmdVel (Motor Control Command)

Use Case: Real-time motor control @ 1000Hz Structure: { timestamp: u64, linear: f32, angular: f32 }

Average Latency: ~500 ns (Topic N:N)
Throughput:      2.7M msg/s
Topic 1:1:       ~85 ns median

Analysis: Sub-microsecond performance suitable for 1000Hz control loops with 2,700x headroom.


LaserScan (2D Lidar Data)

Use Case: 2D lidar sensor data @ 10Hz Structure: { ranges: [f32; 360], angle_min/max, metadata }

Average Latency: ~2.2 μs (Topic N:N)
Throughput:      633K msg/s
Topic 1:1:       ~900 ns estimated

Analysis: Consistent low-microsecond latency for 1.5KB messages. Can easily handle 10Hz lidar updates with 63,300x headroom.


IMU (Inertial Measurement Unit)

Use Case: Orientation and acceleration @ 100Hz Structure: { orientation: [f64; 4], angular_velocity: [f64; 3], linear_acceleration: [f64; 3], covariances: [f64; 27] }

Average Latency: ~940 ns (Topic N:N)
Throughput:      1.8M msg/s
Topic 1:1:       ~400 ns estimated

Analysis: Sub-microsecond performance with complex nested arrays and 27-element covariance matrices.


Odometry (Pose + Velocity)

Use Case: Robot localization @ 50Hz Structure: { pose: Pose2D, twist: Twist, pose_covariance: [f64; 36], twist_covariance: [f64; 36] }

Average Latency: ~1.1 μs (Topic N:N)
Throughput:      1.3M msg/s
Topic 1:1:       ~600 ns estimated

Analysis: Low-microsecond latency for 736-byte messages with extensive covariance data.


PointCloud (3D Perception)

Small (100 points @ 30Hz)

Average Latency: 1.85 μs
Throughput:      539,529 msg/s
Data Size:       ~1.2 KB

Medium (1,000 points @ 30Hz)

Average Latency: 7.55 μs
Throughput:      132,432 msg/s
Data Size:       ~12 KB

Large (10,000 points @ 30Hz)

Average Latency: ~360 μs (Topic N:N)
Throughput:      4.7K msg/s
Data Size:       ~120 KB

Analysis: Linear scaling with point count. Even 10K point clouds process in ~360 μs (sufficient for 30Hz perception with 157x headroom).


Mixed Workload (Realistic Robot Loop)

Simulation: Real robot control loop @ 100Hz Components: CmdVel @ 100Hz + IMU @ 100Hz + BatteryState @ 1Hz

Total Operations: 20,100 messages
Average Latency:  ~1.0 μs (Topic N:N)
Throughput:       ~1.5M msg/s
Range:            ~500-1200 ns

Analysis: Low-microsecond average latency for mixed message types simulating realistic robotics workload.


Comparison with traditional frameworks

Latency Comparison

Measurement Note: Topic 1:1 values below are send-only (one-direction). For round-trip (send+receive), approximately double these values (e.g., 87ns send-only → ~175ns round-trip).

FrameworkSmall Msg (send-only)Medium Msg (send-only)Large Msg (send-only)
HORUS Topic (1:1)87 ns~160 ns~400 ns
HORUS Topic (N:N)313 ns~500 ns~1.1 μs
ROS2 (DDS)50-100 μs100-500 μs1-10 ms
ROS2 (FastDDS)20-50 μs50-200 μs500 μs - 5 ms

Performance Advantage: HORUS is 230-575x faster than ROS2 for typical message sizes.

HORUS Speedup vs ROS2

How many times faster HORUS Link is compared to ROS2 DDS

>500x faster
100-500x faster
<100x faster

Latency by Message Size

Measurement Note: All latencies below are send-only (one-direction publish). "1:1" = single producer/consumer, "N:N" = multiple producers and consumers.

Message SizeMessage TypeN:N (send-only)1:1 (send-only)vs ROS2
16 BCmdVel~313 ns87 ns230-575x faster
104 BBatteryState~600 ns~350 ns83-286x faster
304 BIMU~940 ns~400 ns53-250x faster
736 BOdometry~1.1 μs~600 ns45-167x faster
1,480 BLaserScan~2.2 μs~900 ns23-111x faster

Observation: Near-linear scaling with message size demonstrates efficient serialization and IPC.

Latency vs Message Size

HORUS shows linear scaling. Values in nanoseconds.


Python Performance

The HORUS Python bindings (PyO3) call directly into the Rust shared memory layer. The cost of crossing the Python/Rust boundary is constant (~1.5μs) regardless of message size — large data (images, point clouds) uses zero-copy shared memory and bypasses this overhead entirely.

FFI Overhead (Rust vs Python)

The key question: what does Python cost you over pure Rust?

OperationRustPythonOverheadFactor
CmdVel send+recv (typed)~22ns~1,500ns~1,478ns~68x
Imu send+recv (typed)~30ns~1,700ns~1,670ns~57x
dict send+recv (generic)~22ns~5,400ns~5,378ns~245x

The ~1.5μs Python overhead comes from: PyO3 boundary crossing (~500ns), GIL acquisition (~500ns), and Python object allocation (~500ns). This is constant — a 1KB typed message has the same overhead as an 8B one.

When to use Python vs Rust:

  • Python: AI inference (PyTorch, YOLO), prototyping, data science nodes — the ~1.5μs overhead is negligible compared to 10-200ms inference time
  • Rust: Motor controllers, safety monitors, sensor fusion at 1kHz — where 22ns matters

Python IPC Latency

Measured with sustained runs (10s+, 100K+ samples) via research_bench_python.py:

Message TypePathp50p99p999
CmdVel (typed)Zero-copy~1.5μs~2.2μs~5.0μs
Pose2D (typed)Zero-copy~1.6μs~2.4μs~5.5μs
Imu (typed)Zero-copy~1.7μs~2.6μs~6.0μs
dict smallMessagePack~5.4μs~8μs~15μs
dict mediumMessagePack~9.1μs~14μs~25μs
dict ~1KBMessagePack~52μs~65μs~90μs

Typed messages are 3-30x faster than dicts because they bypass MessagePack serialization and use direct Pod memcpy through the Rust layer.

Image Zero-Copy

to_numpy() returns a view into shared memory — O(1) regardless of image size:

Resolutionto_numpy()np.copy()Speedup
320x240 (225KB)~3μs~3μs1x
640x480 (900KB)~3μs~13μs4x
1280x720 (2.7MB)~3μs~75μs25x
1920x1080 (6MB)~3μs~178μs59x

At 1080p, zero-copy is 59x faster than copying. For 4K frames, the speedup is even larger.

Running Python Benchmarks

# Quick validation (2s per test)
python3 horus_py/benchmarks/research_bench_python.py --duration 2

# Full research run (10s per test, CSV output)
python3 horus_py/benchmarks/research_bench_python.py --duration 10 --csv python_results.csv

# JSON summary
python3 horus_py/benchmarks/research_bench_python.py --json python_summary.json

Running Rust Benchmarks

Quick Run

cd horus
cargo run --release -p horus_benchmarks --bin robotics_messages_benchmark

Available Benchmarks

BinaryDescription
robotics_messages_benchmarkIPC latency with real robotics message types
all_paths_latencyAll 10 backend paths with RDTSC cycle precision
cross_process_benchmarkCross-process shared memory IPC
scalability_benchmarkScaling with producer/consumer thread counts
determinism_benchmarkExecution determinism and jitter
dds_comparison_benchmarkComparison with DDS middleware (requires --features dds)

Extended Benchmarks

Sustained runs, size sweeps, histograms, and competitor comparison:

BinaryDescription
raw_baselinesHardware floor (memcpy, atomic, mmap) — no HORUS overhead
research_latencySustained measurement + message size sweep (8B-4KB), CSV output
research_throughputPer-second throughput timeseries over 60s+
research_jitterRT tick jitter histogram + IPC latency under CPU contention
research_scalabilityNode scaling (1-100) + topic scaling (1-1000)
competitor_comparisonHORUS vs raw UDP (+ Zenoh with --features zenoh)
# Run full benchmark suite (~30 minutes)
./benchmarks/research/run_all.sh

# Quick validation (~3 minutes)
./benchmarks/research/run_all.sh --quick

# Individual benchmark with CSV output
cargo run --release -p horus_benchmarks --bin research_latency -- --duration 60 --csv results.csv

Run any benchmark with:

cargo run --release -p horus_benchmarks --bin <name>

# JSON output for CI/regression tracking
cargo run --release -p horus_benchmarks --bin <name> -- --json results.json

Criterion micro-benchmarks:

cd horus
cargo bench -p horus_benchmarks

Expected Output


  HORUS Production Message Benchmark Suite
  Testing with real robotics message types


  CmdVel (Motor Control Command)
    Size: 16 bytes | Typical rate: 1000Hz
    Latency (avg): ~500 ns (Topic N:N) / ~85 ns (Topic 1:1)
    Throughput: 2.7M msg/s (Topic N:N)


  LaserScan (2D Lidar Data)
    Size: 1480 bytes | Typical rate: 10Hz
    Latency (avg): ~2.2 μs (Topic N:N) / ~900 ns (Topic 1:1)
    Throughput: 633K msg/s (Topic N:N)


Use Case Selection

Message Type Guidelines

CmdVel (~500 ns N:N / ~85 ns 1:1)

  • Motor control @ 1000Hz
  • Real-time actuation commands
  • Safety-critical control loops

IMU (~940 ns N:N / ~400 ns 1:1)

  • High-frequency sensor fusion @ 100Hz
  • State estimation pipelines
  • Orientation tracking

LaserScan (~2.2 μs N:N / ~900 ns 1:1)

  • 2D lidar @ 10Hz
  • Obstacle detection
  • SLAM front-end

Odometry (~1.1 μs N:N / ~600 ns 1:1)

  • Pose estimation @ 50Hz
  • Dead reckoning
  • Filter updates

PointCloud (~360 μs for 10K pts)

  • 3D perception @ 30Hz
  • Object detection pipelines
  • Dense mapping

Performance Characteristics

Strengths

  1. Sub-microsecond latency for messages up to 1.5KB
  2. Consistent performance across message types (low variance)
  3. Linear scaling with message size
  4. Production-ready throughput with large headroom
  5. Serde integration handles complex nested structs efficiently

Additional Notes

  • Complex structs (IMU with 27-element covariances): Still sub-microsecond
  • Variable-size messages (PointCloud with Vec): Linear scaling

Overhead Attribution

How much latency does HORUS add over raw memory operations? Measured via raw_baselines benchmark:

Operation8B LatencyWhat it measures
Raw memcpy~11nsHardware floor (cache-to-cache copy)
Raw atomic store+load~11nsSignaling floor
HORUS same-process~23ns+12ns over raw (ring buffer overhead)
HORUS cross-thread~155ns+144ns (atomic coordination between threads)
Raw UDP loopback~1,158nsKernel network stack

HORUS adds 12ns over raw memory for same-thread IPC — roughly 2 cache line accesses of overhead. Cross-thread adds atomic coordination cost but stays under 200ns.

HORUS is 50x faster than raw UDP on the same machine — the kernel network stack adds ~1,100ns of overhead that shared memory eliminates entirely.


Scalability

Node Scaling

Scheduler tick overhead with increasing node count (measured via research_scalability):

NodesTick DurationOverhead vs 1 Node
11,058 μsbaseline
101,072 μs+1.3%
201,139 μs+7.7%
501,178 μs+11%
1001,209 μs+14%

Near-linear scaling: 100 nodes adds only 14% overhead. Each additional node costs approximately 1.5μs.

Topic Scaling

IPC latency with increasing topic count (all topics created simultaneously):

Topicsp50 LatencyDegradation
123nsbaseline
1024ns0%
10023ns0%
50023ns0%
1,00023ns0%

O(1) topic lookup: Latency is constant regardless of how many topics exist in the system. Users can add sensors, monitors, and debug topics without affecting performance.

IPC Latency by Topology and Message Size

Complete IPC latency across all 5 topologies (measured via research_latency with sustained runs, 200K+ samples each):

Topology8B p50256B p501KB p50Samples/2s
Same-process22ns30ns51ns40M
1:1 cross-thread154ns177ns235ns11M
3 pubs → 1 sub53ns57ns107ns20M
1 pub → 3 subs92ns138ns195ns16M
3 pubs × 3 subs184ns218ns233ns9M

Key observations:

  • Linear size scaling: Latency grows proportionally with message size (memcpy-dominated)
  • Sub-200ns for all topologies at 8B — every IPC path is under 200 nanoseconds
  • MPSC is faster than SPSC at 8B (53ns vs 154ns) — multiple producers amortize coordination overhead
  • MPMC worst case is 184ns for 8B — even with 6 concurrent participants, under 200ns

Real-World Applications

ApplicationFrequencyHORUS (Topic 1:1)HORUS (Topic N:N)ROS2Speedup
Motor control1000 Hz~85 ns~500 ns50 μs200-588x
IMU fusion100 Hz~400 ns~940 ns50 μs53-125x
Lidar SLAM10 Hz~900 ns~2.2 μs100 μs45-111x
Vision30 Hz~120 μs~360 μs5 ms14-42x
Planning100 Hz~600 ns~1.1 μs100 μs91-167x

Throughput Comparison

Messages per second (millions). Higher is better.


Methodology

Benchmark Pattern: Ping-Pong

HORUS uses the industry-standard ping-pong benchmark pattern for IPC latency measurement:

Loading diagram...
Ping-Pong Benchmark Pattern

Why Ping-Pong?

  • Industry standard: Used by ROS2, iceoryx2, ZeroMQ benchmarks
  • Prevents queue buildup: Each message acknowledged before next send
  • Realistic: Models request-response patterns in robotics
  • Comparable: Direct apples-to-apples comparison with other frameworks
  • Conservative: Measures true round-trip latency, not just one-way send

What we measure:

  • Round-trip time: Producer Consumer ACK Producer
  • Includes serialization, IPC, deserialization, and synchronization
  • Cross-core communication (Core 0 ↔ Core 1)

What we DON'T measure:

  • Burst throughput (no backpressure)
  • One-way send time without acknowledgment
  • Same-core communication (unrealistic for multi-process IPC)

Test Environment

  • Build: cargo build --release with full optimizations
  • CPU Governor: Performance mode
  • CPU Affinity: Producer pinned to Core 0, Consumer pinned to Core 1
  • Process Isolation: Dedicated topics per benchmark
  • Warmup: 1,000 iterations before measurement
  • Measurement: RDTSC (cycle-accurate timestamps)

Message Realism

  • Actual HORUS library message types
  • Serde serialization (production path)
  • Realistic field values and sizes
  • Complex nested structures (IMU, Odometry)

Statistical Methodology

  • 10,000 iterations per test
  • Median, P95, P99 latency tracking
  • Variance tracking (min/max ranges)
  • Multiple message sizes
  • Mixed workload testing

Measurement Details

RDTSC Calibration:

  • Null cost (back-to-back rdtsc): ~36 cycles
  • Target on modern x86_64: 20-30 cycles
  • Timestamp embedded directly in message payload

Cross-Core Testing:

  • Producer and consumer on different CPU cores
  • Simulates real multi-process robotics systems
  • Includes cache coherency overhead (~60 cycles theoretical minimum)

Scheduler Performance

Enhanced Smart Scheduler

HORUS now includes an intelligent scheduler that automatically optimizes node execution based on runtime behavior:

Key Enhancements:

  • Tiered Execution: Explicit tier annotation (UltraFast, Fast, Normal)
  • Failure Policies: Per-node failure handling with automatic recovery
  • Predictable by Default: Sequential execution with consistent priority ordering
  • Safety Monitoring: WCET enforcement, watchdogs, and emergency stop

Comprehensive Benchmark Results

Test Configuration:

  • Workload duration: 5 seconds per test
  • Sample size: 20 measurements per benchmark
  • Platform: Modern x86_64 Linux system
Workload TypeMean TimeDescriptionKey Achievement
UltraFastControl2.387sHigh-frequency control loopsOptimized for high-frequency control
FastSensor2.382sRapid sensor processingMaintains sub-μs sensor fusion
HeavyIO3.988sI/O-intensive operationsAsync tier prevents blocking
MixedRealistic4.064sReal-world mixed workloadBalanced optimization across tiers
FaultTolerance25.485sWith simulated failuresFailure policy recovery working

Scalability Performance

The scheduler demonstrates excellent linear scaling:

Node CountExecution TimeScaling Factor
10 nodes106.93msBaseline
50 nodes113.93ms1.07x (5x nodes)
100 nodes116.49ms1.09x (10x nodes)
200 nodes119.55ms1.12x (20x nodes)

Key Insights:

  • Near-linear scaling from 10 to 200 nodes
  • Only 13ms increase for 20x more nodes
  • Maintains sub-120ms for large systems
  • Automatic tier classification optimizes execution order

Scheduler Scalability

Near-constant execution time regardless of node count

106.9ms
10 nodes
baseline
113.9ms
50 nodes
6.5% overhead
116.5ms
100 nodes
8.9% overhead
119.5ms
200 nodes
11.8% overhead

Real-Time Performance

RtNode Support

HORUS now provides industrial-grade real-time support for safety-critical applications:

RT Features:

  • WCET Enforcement: Worst-Case Execution Time monitoring
  • Deadline Tracking: Count and handle deadline misses
  • Safety Monitor: Emergency stop on critical failures
  • Watchdog Timers: Detect hung or crashed nodes

RT Performance Characteristics

MetricPerformanceDescription
WCET Overhead<5μsCost of monitoring execution time
Deadline Precision±10μsJitter in deadline detection
Watchdog Resolution1msMinimum detection time
Emergency Stop<100μsTime to halt all nodes
Context Switch<1μsPriority preemption overhead

Safety-Critical Configuration

Running with full safety monitoring enabled:

let scheduler = Scheduler::new().tick_rate(1000_u64.hz());
FeatureOverheadImpact
WCET Tracking~1μs per nodeNegligible for >100μs tasks
Deadline Monitor~500ns per nodeSub-microsecond overhead
Watchdog Feed~100ns per tickMinimal impact
Safety Checks~2μs totalWorth it for safety
Memory LockingOne-time 10msPrevents page faults

Real-Time Test Results

Test: Mixed RT and Normal Nodes

  • 2 critical RT nodes @ 1kHz
  • 2 normal nodes @ 100Hz
  • 2 background nodes @ 10Hz
Node TypeTarget RateAchievedJitterMisses
RT Critical1000 Hz999.8 Hz±10μs0
RT High500 Hz499.9 Hz±15μs0
Normal100 Hz99.9 Hz±50μs<0.1%
Background10 Hz10 Hz±200μs<0.5%

Zero deadline misses for critical RT nodes over 1M iterations.

Real-Time Node Performance

Target rate achievement and jitter measurements

RT Critical 1000 Hz
100.0%
rate achieved
±10μs jitter
0% deadline misses
RT High 500 Hz
100.0%
rate achieved
±15μs jitter
0% deadline misses
Normal 100 Hz
99.9%
rate achieved
±50μs jitter
0.1% deadline misses
Background 10 Hz
100.0%
rate achieved
±200μs jitter
0.5% deadline misses

All-Routes Latency

HORUS automatically selects the optimal communication path based on topology (same-thread, cross-thread, cross-process) and producer/consumer count. This benchmark measures the latency of each automatically-selected route.

Benchmark Results

ScenarioLatencyTargetNotes
Same thread, 1:116ns60nsUltra-fast direct path
Cross-thread, 1:111ns60nsOptimized single-producer path
Cross-process, 1:1182ns100nsShared memory path
Cross-process, N:1244ns150nsMulti-producer shared memory
Cross-process, N:N187ns200nsGeneral cross-process

Latency by Topology

TopologyProducersConsumersLatency
Same thread11~16ns
Same process11~11ns
Same processN1~15ns
Same process1N~15ns
Same processNN~20ns
Cross process11~180ns
Cross processN1~250ns
Cross process1N~200ns
Cross processNN~190ns

Key Achievements

  • Sub-20ns for same-process communication
  • Sub-200ns for cross-process 1:1
  • Sub-300ns for multi-producer cross-process
  • Zero configuration — optimal path selected automatically
  • Seamless migration — path upgrades transparently as topology changes

Running the Benchmark

cd horus
cargo build --release -p horus_benchmarks
./target/release/all_paths_latency

Summary

HORUS provides production-grade performance for real robotics applications:

Automatic Path Selection (Recommended):

  • 16 ns — Same-thread
  • 11 ns — Cross-thread, 1:1
  • 182 ns — Cross-process, 1:1
  • 244 ns — Cross-process, multi-producer
  • 187 ns — Cross-process, multi-producer/consumer

Point-to-Point (1:1):

  • 87 ns — Send only (ultra-low latency)
  • 161 ns — CmdVel (motor control)
  • 262 ns — Send+Recv round-trip
  • ~400 ns — IMU (sensor fusion)
  • ~120 μs — PointCloud with 10K points

Multi-Producer/Consumer (N:N):

  • ~313 ns — CmdVel (motor control)
  • ~500 ns — IMU (sensor fusion)
  • ~2.2 μs — LaserScan (2D lidar)
  • ~1.1 μs — Odometry (localization)
  • ~360 μs — PointCloud with 10K points

Ready for production deployment in demanding robotics applications requiring real-time performance with complex data types.



Python Benchmarks

Real measurements from horus_py/benchmarks/bench_python.py. Python 3.12, Linux x86_64.

Message Send/Recv Latency

Single-process Topic roundtrip (send + recv):

Message typeMedianPath
CmdVel (typed)1.5μsZero-copy Pod memcpy
Pose2D (typed)1.6μsZero-copy Pod memcpy
Imu (typed)1.6μsZero-copy Pod memcpy
dict {"v": 1.0}5.4μsGenericMessage + MessagePack
dict {"x", "y", "z"}9.1μsGenericMessage + MessagePack
dict ~1KB52μsGenericMessage + MessagePack

Typed messages are 6x faster than dicts because they skip serialization entirely.

Zero-Copy Image/PointCloud

to_numpy() returns a view into shared memory — constant time regardless of data size:

Datato_numpy() (zero-copy)np.copy() (naive)Speedup
Image 320×240 (225KB)3.0μs3.0μs1x
Image 640×480 (900KB)3.0μs13μs4x
Image 1280×720 (2.7MB)3.0μs75μs25x
Image 1920×1080 (6MB)3.0μs178μs59x
PointCloud 10K pts (120KB)2.8μs
PointCloud 100K pts (1.2MB)2.8μs
DepthImage 640×480 (1.2MB)2.8μs
np.from_dlpack() (DLPack)979ns

The key insight: 3μs for a 6MB 1080p image vs 178μs to copy it. This is the DLPack/shared memory pool advantage — Python gets a pointer to the data, not a copy.

Node Tick Overhead

How fast can the Rust scheduler drive Python nodes:

ScenarioThroughputPer-tick
Empty tick (Rust → Python → Rust)~530 Hz1.9ms
Tick + send(dict)~525 Hz1.9ms
Tick + send(dict) + recv(dict)~525 Hz1.9ms

The bottleneck is Python's GIL (~1.8ms per acquisition), not the Rust binding (~30μs). The Rust scheduler, IPC, and safety monitoring add negligible overhead.

Generic Message Sizes

MessagePack serialization for common robotics data:

PayloadBytesFits in GenericMessage?
Empty dict1Yes (4KB max)
CmdVel-like {linear, angular}34Yes
IMU-like (accel + gyro + mag)100Yes
LaserScan 360 rays3,251Yes
10 detections374Yes

Running Python Benchmarks

cd horus_py
PYTHONPATH=. python3 benchmarks/bench_python.py

Next Steps

Build faster. Debug easier. Deploy with confidence.