HORUS Benchmarks
Performance validation with real-world robotics workloads.
Benchmark Methodology
Measurement Approach
- Statistical sampling: Criterion.rs with 20+ samples per measurement
- Confidence intervals: Min/mean/max with outlier detection
- Controlled methodology: 1s warm-up, 5s measurement phases
- Reproducible: Less than 1% variance across measurements
- Comprehensive coverage: 5 workload types, 4 scalability points
Workload Testing
- Real workloads: Control loops, sensor fusion, I/O operations
- Fault injection: Failure policy recovery testing
- Scale testing: Validated up to 200 concurrent nodes
- Mixed patterns: Combined blocking/non-blocking operations
- Long-running: 25+ second failure recovery tests
Executive Summary
HORUS delivers sub-microsecond to low-microsecond latency for production robotics applications:
| Message Type | Size | Latency (Topic N:N) | Throughput | Typical Rate | Headroom |
|---|---|---|---|---|---|
| CmdVel | 16 B | ~500 ns | 2.7M msg/s | 1000 Hz | 2,700x |
| BatteryState | 104 B | ~600 ns | 1.67M msg/s | 1 Hz | 1.67M x |
| IMU | 304 B | ~940 ns | 1.8M msg/s | 100 Hz | 18,000x |
| Odometry | 736 B | ~1.1 μs | 1.3M msg/s | 50 Hz | 26,000x |
| LaserScan | 1.5 KB | ~2.2 μs | 633K msg/s | 10 Hz | 63,300x |
| PointCloud (1K) | ~12 KB | ~12 μs | 83K msg/s | 30 Hz | 2,767x |
| PointCloud (10K) | ~120 KB | ~360 μs | 4.7K msg/s | 30 Hz | 157x |
Latency Comparison: HORUS vs ROS2
Lower is better. Logarithmic scale (send-only latency in μs)
Performance Highlights
Key Findings
Sub-microsecond latency for messages up to 1.5KB Serde integration works flawlessly with complex nested structs Linear scaling with message size (predictable performance) Massive headroom for all typical robotics frequencies
Production Readiness
- Real-time control: ~500 ns latency supports 1000Hz+ control loops with 2,700x headroom
- Sensor fusion: Mixed workload maintains sub-microsecond performance (648 ns avg)
- Perception pipelines: 10K point clouds @ 30Hz with 189x headroom
- Multi-robot systems: Throughput supports 100+ robots on single node
Detailed Results
CmdVel (Motor Control Command)
Use Case: Real-time motor control @ 1000Hz
Structure: { timestamp: u64, linear: f32, angular: f32 }
Average Latency: ~500 ns (Topic N:N)
Throughput: 2.7M msg/s
Topic 1:1: ~85 ns median
Analysis: Sub-microsecond performance suitable for 1000Hz control loops with 2,700x headroom.
LaserScan (2D Lidar Data)
Use Case: 2D lidar sensor data @ 10Hz
Structure: { ranges: [f32; 360], angle_min/max, metadata }
Average Latency: ~2.2 μs (Topic N:N)
Throughput: 633K msg/s
Topic 1:1: ~900 ns estimated
Analysis: Consistent low-microsecond latency for 1.5KB messages. Can easily handle 10Hz lidar updates with 63,300x headroom.
IMU (Inertial Measurement Unit)
Use Case: Orientation and acceleration @ 100Hz
Structure: { orientation: [f64; 4], angular_velocity: [f64; 3], linear_acceleration: [f64; 3], covariances: [f64; 27] }
Average Latency: ~940 ns (Topic N:N)
Throughput: 1.8M msg/s
Topic 1:1: ~400 ns estimated
Analysis: Sub-microsecond performance with complex nested arrays and 27-element covariance matrices.
Odometry (Pose + Velocity)
Use Case: Robot localization @ 50Hz
Structure: { pose: Pose2D, twist: Twist, pose_covariance: [f64; 36], twist_covariance: [f64; 36] }
Average Latency: ~1.1 μs (Topic N:N)
Throughput: 1.3M msg/s
Topic 1:1: ~600 ns estimated
Analysis: Low-microsecond latency for 736-byte messages with extensive covariance data.
PointCloud (3D Perception)
Small (100 points @ 30Hz)
Average Latency: 1.85 μs
Throughput: 539,529 msg/s
Data Size: ~1.2 KB
Medium (1,000 points @ 30Hz)
Average Latency: 7.55 μs
Throughput: 132,432 msg/s
Data Size: ~12 KB
Large (10,000 points @ 30Hz)
Average Latency: ~360 μs (Topic N:N)
Throughput: 4.7K msg/s
Data Size: ~120 KB
Analysis: Linear scaling with point count. Even 10K point clouds process in ~360 μs (sufficient for 30Hz perception with 157x headroom).
Mixed Workload (Realistic Robot Loop)
Simulation: Real robot control loop @ 100Hz Components: CmdVel @ 100Hz + IMU @ 100Hz + BatteryState @ 1Hz
Total Operations: 20,100 messages
Average Latency: ~1.0 μs (Topic N:N)
Throughput: ~1.5M msg/s
Range: ~500-1200 ns
Analysis: Low-microsecond average latency for mixed message types simulating realistic robotics workload.
Comparison with traditional frameworks
Latency Comparison
Measurement Note: Topic 1:1 values below are send-only (one-direction). For round-trip (send+receive), approximately double these values (e.g., 87ns send-only → ~175ns round-trip).
| Framework | Small Msg (send-only) | Medium Msg (send-only) | Large Msg (send-only) |
|---|---|---|---|
| HORUS Topic (1:1) | 87 ns | ~160 ns | ~400 ns |
| HORUS Topic (N:N) | 313 ns | ~500 ns | ~1.1 μs |
| ROS2 (DDS) | 50-100 μs | 100-500 μs | 1-10 ms |
| ROS2 (FastDDS) | 20-50 μs | 50-200 μs | 500 μs - 5 ms |
Performance Advantage: HORUS is 230-575x faster than ROS2 for typical message sizes.
HORUS Speedup vs ROS2
How many times faster HORUS Link is compared to ROS2 DDS
Latency by Message Size
Measurement Note: All latencies below are send-only (one-direction publish). "1:1" = single producer/consumer, "N:N" = multiple producers and consumers.
| Message Size | Message Type | N:N (send-only) | 1:1 (send-only) | vs ROS2 |
|---|---|---|---|---|
| 16 B | CmdVel | ~313 ns | 87 ns | 230-575x faster |
| 104 B | BatteryState | ~600 ns | ~350 ns | 83-286x faster |
| 304 B | IMU | ~940 ns | ~400 ns | 53-250x faster |
| 736 B | Odometry | ~1.1 μs | ~600 ns | 45-167x faster |
| 1,480 B | LaserScan | ~2.2 μs | ~900 ns | 23-111x faster |
Observation: Near-linear scaling with message size demonstrates efficient serialization and IPC.
Latency vs Message Size
HORUS shows linear scaling. Values in nanoseconds.
Python Performance
The HORUS Python bindings (PyO3) call directly into the Rust shared memory layer. The cost of crossing the Python/Rust boundary is constant (~1.5μs) regardless of message size — large data (images, point clouds) uses zero-copy shared memory and bypasses this overhead entirely.
FFI Overhead (Rust vs Python)
The key question: what does Python cost you over pure Rust?
| Operation | Rust | Python | Overhead | Factor |
|---|---|---|---|---|
| CmdVel send+recv (typed) | ~22ns | ~1,500ns | ~1,478ns | ~68x |
| Imu send+recv (typed) | ~30ns | ~1,700ns | ~1,670ns | ~57x |
| dict send+recv (generic) | ~22ns | ~5,400ns | ~5,378ns | ~245x |
The ~1.5μs Python overhead comes from: PyO3 boundary crossing (~500ns), GIL acquisition (~500ns), and Python object allocation (~500ns). This is constant — a 1KB typed message has the same overhead as an 8B one.
When to use Python vs Rust:
- Python: AI inference (PyTorch, YOLO), prototyping, data science nodes — the ~1.5μs overhead is negligible compared to 10-200ms inference time
- Rust: Motor controllers, safety monitors, sensor fusion at 1kHz — where 22ns matters
Python IPC Latency
Measured with sustained runs (10s+, 100K+ samples) via research_bench_python.py:
| Message Type | Path | p50 | p99 | p999 |
|---|---|---|---|---|
| CmdVel (typed) | Zero-copy | ~1.5μs | ~2.2μs | ~5.0μs |
| Pose2D (typed) | Zero-copy | ~1.6μs | ~2.4μs | ~5.5μs |
| Imu (typed) | Zero-copy | ~1.7μs | ~2.6μs | ~6.0μs |
| dict small | MessagePack | ~5.4μs | ~8μs | ~15μs |
| dict medium | MessagePack | ~9.1μs | ~14μs | ~25μs |
| dict ~1KB | MessagePack | ~52μs | ~65μs | ~90μs |
Typed messages are 3-30x faster than dicts because they bypass MessagePack serialization and use direct Pod memcpy through the Rust layer.
Image Zero-Copy
to_numpy() returns a view into shared memory — O(1) regardless of image size:
| Resolution | to_numpy() | np.copy() | Speedup |
|---|---|---|---|
| 320x240 (225KB) | ~3μs | ~3μs | 1x |
| 640x480 (900KB) | ~3μs | ~13μs | 4x |
| 1280x720 (2.7MB) | ~3μs | ~75μs | 25x |
| 1920x1080 (6MB) | ~3μs | ~178μs | 59x |
At 1080p, zero-copy is 59x faster than copying. For 4K frames, the speedup is even larger.
Running Python Benchmarks
# Quick validation (2s per test)
python3 horus_py/benchmarks/research_bench_python.py --duration 2
# Full research run (10s per test, CSV output)
python3 horus_py/benchmarks/research_bench_python.py --duration 10 --csv python_results.csv
# JSON summary
python3 horus_py/benchmarks/research_bench_python.py --json python_summary.json
Running Rust Benchmarks
Quick Run
cd horus
cargo run --release -p horus_benchmarks --bin robotics_messages_benchmark
Available Benchmarks
| Binary | Description |
|---|---|
robotics_messages_benchmark | IPC latency with real robotics message types |
all_paths_latency | All 10 backend paths with RDTSC cycle precision |
cross_process_benchmark | Cross-process shared memory IPC |
scalability_benchmark | Scaling with producer/consumer thread counts |
determinism_benchmark | Execution determinism and jitter |
dds_comparison_benchmark | Comparison with DDS middleware (requires --features dds) |
Extended Benchmarks
Sustained runs, size sweeps, histograms, and competitor comparison:
| Binary | Description |
|---|---|
raw_baselines | Hardware floor (memcpy, atomic, mmap) — no HORUS overhead |
research_latency | Sustained measurement + message size sweep (8B-4KB), CSV output |
research_throughput | Per-second throughput timeseries over 60s+ |
research_jitter | RT tick jitter histogram + IPC latency under CPU contention |
research_scalability | Node scaling (1-100) + topic scaling (1-1000) |
competitor_comparison | HORUS vs raw UDP (+ Zenoh with --features zenoh) |
# Run full benchmark suite (~30 minutes)
./benchmarks/research/run_all.sh
# Quick validation (~3 minutes)
./benchmarks/research/run_all.sh --quick
# Individual benchmark with CSV output
cargo run --release -p horus_benchmarks --bin research_latency -- --duration 60 --csv results.csv
Run any benchmark with:
cargo run --release -p horus_benchmarks --bin <name>
# JSON output for CI/regression tracking
cargo run --release -p horus_benchmarks --bin <name> -- --json results.json
Criterion micro-benchmarks:
cd horus
cargo bench -p horus_benchmarks
Expected Output
HORUS Production Message Benchmark Suite
Testing with real robotics message types
CmdVel (Motor Control Command)
Size: 16 bytes | Typical rate: 1000Hz
Latency (avg): ~500 ns (Topic N:N) / ~85 ns (Topic 1:1)
Throughput: 2.7M msg/s (Topic N:N)
LaserScan (2D Lidar Data)
Size: 1480 bytes | Typical rate: 10Hz
Latency (avg): ~2.2 μs (Topic N:N) / ~900 ns (Topic 1:1)
Throughput: 633K msg/s (Topic N:N)
Use Case Selection
Message Type Guidelines
CmdVel (~500 ns N:N / ~85 ns 1:1)
- Motor control @ 1000Hz
- Real-time actuation commands
- Safety-critical control loops
IMU (~940 ns N:N / ~400 ns 1:1)
- High-frequency sensor fusion @ 100Hz
- State estimation pipelines
- Orientation tracking
LaserScan (~2.2 μs N:N / ~900 ns 1:1)
- 2D lidar @ 10Hz
- Obstacle detection
- SLAM front-end
Odometry (~1.1 μs N:N / ~600 ns 1:1)
- Pose estimation @ 50Hz
- Dead reckoning
- Filter updates
PointCloud (~360 μs for 10K pts)
- 3D perception @ 30Hz
- Object detection pipelines
- Dense mapping
Performance Characteristics
Strengths
- Sub-microsecond latency for messages up to 1.5KB
- Consistent performance across message types (low variance)
- Linear scaling with message size
- Production-ready throughput with large headroom
- Serde integration handles complex nested structs efficiently
Additional Notes
- Complex structs (IMU with 27-element covariances): Still sub-microsecond
- Variable-size messages (PointCloud with Vec): Linear scaling
Overhead Attribution
How much latency does HORUS add over raw memory operations? Measured via raw_baselines benchmark:
| Operation | 8B Latency | What it measures |
|---|---|---|
Raw memcpy | ~11ns | Hardware floor (cache-to-cache copy) |
| Raw atomic store+load | ~11ns | Signaling floor |
| HORUS same-process | ~23ns | +12ns over raw (ring buffer overhead) |
| HORUS cross-thread | ~155ns | +144ns (atomic coordination between threads) |
| Raw UDP loopback | ~1,158ns | Kernel network stack |
HORUS adds 12ns over raw memory for same-thread IPC — roughly 2 cache line accesses of overhead. Cross-thread adds atomic coordination cost but stays under 200ns.
HORUS is 50x faster than raw UDP on the same machine — the kernel network stack adds ~1,100ns of overhead that shared memory eliminates entirely.
Scalability
Node Scaling
Scheduler tick overhead with increasing node count (measured via research_scalability):
| Nodes | Tick Duration | Overhead vs 1 Node |
|---|---|---|
| 1 | 1,058 μs | baseline |
| 10 | 1,072 μs | +1.3% |
| 20 | 1,139 μs | +7.7% |
| 50 | 1,178 μs | +11% |
| 100 | 1,209 μs | +14% |
Near-linear scaling: 100 nodes adds only 14% overhead. Each additional node costs approximately 1.5μs.
Topic Scaling
IPC latency with increasing topic count (all topics created simultaneously):
| Topics | p50 Latency | Degradation |
|---|---|---|
| 1 | 23ns | baseline |
| 10 | 24ns | 0% |
| 100 | 23ns | 0% |
| 500 | 23ns | 0% |
| 1,000 | 23ns | 0% |
O(1) topic lookup: Latency is constant regardless of how many topics exist in the system. Users can add sensors, monitors, and debug topics without affecting performance.
IPC Latency by Topology and Message Size
Complete IPC latency across all 5 topologies (measured via research_latency with sustained runs, 200K+ samples each):
| Topology | 8B p50 | 256B p50 | 1KB p50 | Samples/2s |
|---|---|---|---|---|
| Same-process | 22ns | 30ns | 51ns | 40M |
| 1:1 cross-thread | 154ns | 177ns | 235ns | 11M |
| 3 pubs → 1 sub | 53ns | 57ns | 107ns | 20M |
| 1 pub → 3 subs | 92ns | 138ns | 195ns | 16M |
| 3 pubs × 3 subs | 184ns | 218ns | 233ns | 9M |
Key observations:
- Linear size scaling: Latency grows proportionally with message size (memcpy-dominated)
- Sub-200ns for all topologies at 8B — every IPC path is under 200 nanoseconds
- MPSC is faster than SPSC at 8B (53ns vs 154ns) — multiple producers amortize coordination overhead
- MPMC worst case is 184ns for 8B — even with 6 concurrent participants, under 200ns
Real-World Applications
| Application | Frequency | HORUS (Topic 1:1) | HORUS (Topic N:N) | ROS2 | Speedup |
|---|---|---|---|---|---|
| Motor control | 1000 Hz | ~85 ns | ~500 ns | 50 μs | 200-588x |
| IMU fusion | 100 Hz | ~400 ns | ~940 ns | 50 μs | 53-125x |
| Lidar SLAM | 10 Hz | ~900 ns | ~2.2 μs | 100 μs | 45-111x |
| Vision | 30 Hz | ~120 μs | ~360 μs | 5 ms | 14-42x |
| Planning | 100 Hz | ~600 ns | ~1.1 μs | 100 μs | 91-167x |
Throughput Comparison
Messages per second (millions). Higher is better.
Methodology
Benchmark Pattern: Ping-Pong
HORUS uses the industry-standard ping-pong benchmark pattern for IPC latency measurement:
Why Ping-Pong?
- Industry standard: Used by ROS2, iceoryx2, ZeroMQ benchmarks
- Prevents queue buildup: Each message acknowledged before next send
- Realistic: Models request-response patterns in robotics
- Comparable: Direct apples-to-apples comparison with other frameworks
- Conservative: Measures true round-trip latency, not just one-way send
What we measure:
- Round-trip time: Producer Consumer ACK Producer
- Includes serialization, IPC, deserialization, and synchronization
- Cross-core communication (Core 0 ↔ Core 1)
What we DON'T measure:
- Burst throughput (no backpressure)
- One-way send time without acknowledgment
- Same-core communication (unrealistic for multi-process IPC)
Test Environment
- Build:
cargo build --releasewith full optimizations - CPU Governor: Performance mode
- CPU Affinity: Producer pinned to Core 0, Consumer pinned to Core 1
- Process Isolation: Dedicated topics per benchmark
- Warmup: 1,000 iterations before measurement
- Measurement: RDTSC (cycle-accurate timestamps)
Message Realism
- Actual HORUS library message types
- Serde serialization (production path)
- Realistic field values and sizes
- Complex nested structures (IMU, Odometry)
Statistical Methodology
- 10,000 iterations per test
- Median, P95, P99 latency tracking
- Variance tracking (min/max ranges)
- Multiple message sizes
- Mixed workload testing
Measurement Details
RDTSC Calibration:
- Null cost (back-to-back rdtsc): ~36 cycles
- Target on modern x86_64: 20-30 cycles
- Timestamp embedded directly in message payload
Cross-Core Testing:
- Producer and consumer on different CPU cores
- Simulates real multi-process robotics systems
- Includes cache coherency overhead (~60 cycles theoretical minimum)
Scheduler Performance
Enhanced Smart Scheduler
HORUS now includes an intelligent scheduler that automatically optimizes node execution based on runtime behavior:
Key Enhancements:
- Tiered Execution: Explicit tier annotation (UltraFast, Fast, Normal)
- Failure Policies: Per-node failure handling with automatic recovery
- Predictable by Default: Sequential execution with consistent priority ordering
- Safety Monitoring: WCET enforcement, watchdogs, and emergency stop
Comprehensive Benchmark Results
Test Configuration:
- Workload duration: 5 seconds per test
- Sample size: 20 measurements per benchmark
- Platform: Modern x86_64 Linux system
| Workload Type | Mean Time | Description | Key Achievement |
|---|---|---|---|
| UltraFastControl | 2.387s | High-frequency control loops | Optimized for high-frequency control |
| FastSensor | 2.382s | Rapid sensor processing | Maintains sub-μs sensor fusion |
| HeavyIO | 3.988s | I/O-intensive operations | Async tier prevents blocking |
| MixedRealistic | 4.064s | Real-world mixed workload | Balanced optimization across tiers |
| FaultTolerance | 25.485s | With simulated failures | Failure policy recovery working |
Scalability Performance
The scheduler demonstrates excellent linear scaling:
| Node Count | Execution Time | Scaling Factor |
|---|---|---|
| 10 nodes | 106.93ms | Baseline |
| 50 nodes | 113.93ms | 1.07x (5x nodes) |
| 100 nodes | 116.49ms | 1.09x (10x nodes) |
| 200 nodes | 119.55ms | 1.12x (20x nodes) |
Key Insights:
- Near-linear scaling from 10 to 200 nodes
- Only 13ms increase for 20x more nodes
- Maintains sub-120ms for large systems
- Automatic tier classification optimizes execution order
Scheduler Scalability
Near-constant execution time regardless of node count
Real-Time Performance
RtNode Support
HORUS now provides industrial-grade real-time support for safety-critical applications:
RT Features:
- WCET Enforcement: Worst-Case Execution Time monitoring
- Deadline Tracking: Count and handle deadline misses
- Safety Monitor: Emergency stop on critical failures
- Watchdog Timers: Detect hung or crashed nodes
RT Performance Characteristics
| Metric | Performance | Description |
|---|---|---|
| WCET Overhead | <5μs | Cost of monitoring execution time |
| Deadline Precision | ±10μs | Jitter in deadline detection |
| Watchdog Resolution | 1ms | Minimum detection time |
| Emergency Stop | <100μs | Time to halt all nodes |
| Context Switch | <1μs | Priority preemption overhead |
Safety-Critical Configuration
Running with full safety monitoring enabled:
let scheduler = Scheduler::new().tick_rate(1000_u64.hz());
| Feature | Overhead | Impact |
|---|---|---|
| WCET Tracking | ~1μs per node | Negligible for >100μs tasks |
| Deadline Monitor | ~500ns per node | Sub-microsecond overhead |
| Watchdog Feed | ~100ns per tick | Minimal impact |
| Safety Checks | ~2μs total | Worth it for safety |
| Memory Locking | One-time 10ms | Prevents page faults |
Real-Time Test Results
Test: Mixed RT and Normal Nodes
- 2 critical RT nodes @ 1kHz
- 2 normal nodes @ 100Hz
- 2 background nodes @ 10Hz
| Node Type | Target Rate | Achieved | Jitter | Misses |
|---|---|---|---|---|
| RT Critical | 1000 Hz | 999.8 Hz | ±10μs | 0 |
| RT High | 500 Hz | 499.9 Hz | ±15μs | 0 |
| Normal | 100 Hz | 99.9 Hz | ±50μs | <0.1% |
| Background | 10 Hz | 10 Hz | ±200μs | <0.5% |
Zero deadline misses for critical RT nodes over 1M iterations.
Real-Time Node Performance
Target rate achievement and jitter measurements
All-Routes Latency
HORUS automatically selects the optimal communication path based on topology (same-thread, cross-thread, cross-process) and producer/consumer count. This benchmark measures the latency of each automatically-selected route.
Benchmark Results
| Scenario | Latency | Target | Notes |
|---|---|---|---|
| Same thread, 1:1 | 16ns | 60ns | Ultra-fast direct path |
| Cross-thread, 1:1 | 11ns | 60ns | Optimized single-producer path |
| Cross-process, 1:1 | 182ns | 100ns | Shared memory path |
| Cross-process, N:1 | 244ns | 150ns | Multi-producer shared memory |
| Cross-process, N:N | 187ns | 200ns | General cross-process |
Latency by Topology
| Topology | Producers | Consumers | Latency |
|---|---|---|---|
| Same thread | 1 | 1 | ~16ns |
| Same process | 1 | 1 | ~11ns |
| Same process | N | 1 | ~15ns |
| Same process | 1 | N | ~15ns |
| Same process | N | N | ~20ns |
| Cross process | 1 | 1 | ~180ns |
| Cross process | N | 1 | ~250ns |
| Cross process | 1 | N | ~200ns |
| Cross process | N | N | ~190ns |
Key Achievements
- Sub-20ns for same-process communication
- Sub-200ns for cross-process 1:1
- Sub-300ns for multi-producer cross-process
- Zero configuration — optimal path selected automatically
- Seamless migration — path upgrades transparently as topology changes
Running the Benchmark
cd horus
cargo build --release -p horus_benchmarks
./target/release/all_paths_latency
Summary
HORUS provides production-grade performance for real robotics applications:
Automatic Path Selection (Recommended):
- 16 ns — Same-thread
- 11 ns — Cross-thread, 1:1
- 182 ns — Cross-process, 1:1
- 244 ns — Cross-process, multi-producer
- 187 ns — Cross-process, multi-producer/consumer
Point-to-Point (1:1):
- 87 ns — Send only (ultra-low latency)
- 161 ns — CmdVel (motor control)
- 262 ns — Send+Recv round-trip
- ~400 ns — IMU (sensor fusion)
- ~120 μs — PointCloud with 10K points
Multi-Producer/Consumer (N:N):
- ~313 ns — CmdVel (motor control)
- ~500 ns — IMU (sensor fusion)
- ~2.2 μs — LaserScan (2D lidar)
- ~1.1 μs — Odometry (localization)
- ~360 μs — PointCloud with 10K points
Ready for production deployment in demanding robotics applications requiring real-time performance with complex data types.
Python Benchmarks
Real measurements from horus_py/benchmarks/bench_python.py. Python 3.12, Linux x86_64.
Message Send/Recv Latency
Single-process Topic roundtrip (send + recv):
| Message type | Median | Path |
|---|---|---|
CmdVel (typed) | 1.5μs | Zero-copy Pod memcpy |
Pose2D (typed) | 1.6μs | Zero-copy Pod memcpy |
Imu (typed) | 1.6μs | Zero-copy Pod memcpy |
dict {"v": 1.0} | 5.4μs | GenericMessage + MessagePack |
dict {"x", "y", "z"} | 9.1μs | GenericMessage + MessagePack |
| dict ~1KB | 52μs | GenericMessage + MessagePack |
Typed messages are 6x faster than dicts because they skip serialization entirely.
Zero-Copy Image/PointCloud
to_numpy() returns a view into shared memory — constant time regardless of data size:
| Data | to_numpy() (zero-copy) | np.copy() (naive) | Speedup |
|---|---|---|---|
| Image 320×240 (225KB) | 3.0μs | 3.0μs | 1x |
| Image 640×480 (900KB) | 3.0μs | 13μs | 4x |
| Image 1280×720 (2.7MB) | 3.0μs | 75μs | 25x |
| Image 1920×1080 (6MB) | 3.0μs | 178μs | 59x |
| PointCloud 10K pts (120KB) | 2.8μs | — | — |
| PointCloud 100K pts (1.2MB) | 2.8μs | — | — |
| DepthImage 640×480 (1.2MB) | 2.8μs | — | — |
np.from_dlpack() (DLPack) | 979ns | — | — |
The key insight: 3μs for a 6MB 1080p image vs 178μs to copy it. This is the DLPack/shared memory pool advantage — Python gets a pointer to the data, not a copy.
Node Tick Overhead
How fast can the Rust scheduler drive Python nodes:
| Scenario | Throughput | Per-tick |
|---|---|---|
| Empty tick (Rust → Python → Rust) | ~530 Hz | 1.9ms |
| Tick + send(dict) | ~525 Hz | 1.9ms |
| Tick + send(dict) + recv(dict) | ~525 Hz | 1.9ms |
The bottleneck is Python's GIL (~1.8ms per acquisition), not the Rust binding (~30μs). The Rust scheduler, IPC, and safety monitoring add negligible overhead.
Generic Message Sizes
MessagePack serialization for common robotics data:
| Payload | Bytes | Fits in GenericMessage? |
|---|---|---|
| Empty dict | 1 | Yes (4KB max) |
CmdVel-like {linear, angular} | 34 | Yes |
| IMU-like (accel + gyro + mag) | 100 | Yes |
| LaserScan 360 rays | 3,251 | Yes |
| 10 detections | 374 | Yes |
Running Python Benchmarks
cd horus_py
PYTHONPATH=. python3 benchmarks/bench_python.py
Next Steps
- Learn how to maximize performance: Performance Optimization
- Explore message types: Message Types
- See usage examples: Examples
- Get started: Quick Start
Build faster. Debug easier. Deploy with confidence.