Shared Memory
Your robot's camera publishes 30 frames/sec at 2MB each. Copying through kernel pipes would consume 60MB/s of bandwidth and add milliseconds of latency. HORUS puts the data in shared memory — zero copies, sub-200ns latency, no kernel involvement.
This page explains how it works underneath. You never interact with SHM directly — Topic::new() handles everything. But understanding the architecture helps you debug performance issues, choose the right message types, and work with multi-process systems.
SHM Directory Structure
When a HORUS application starts, horus_sys creates a namespace directory on the filesystem:
/dev/shm/horus_<namespace>/ # Linux (tmpfs — backed by RAM)
├── topics/ # Ring buffer files, one per topic
│ ├── horus_cmd_vel # SHM region for "cmd_vel" topic
│ ├── horus_cmd_vel.meta # Discovery metadata (JSON)
│ ├── horus_scan # SHM region for "scan" topic
│ └── horus_scan.meta
├── nodes/ # Node presence files (JSON)
│ ├── motor_controller.json
│ └── lidar_driver.json
├── tensors/ # TensorPool regions (for Image, PointCloud)
│ └── tensor_pool_a3f2c1d0
├── scheduler/ # Scheduler state
├── network/ # Network transport state
└── logs/ # Structured log files
Namespace generation (priority order):
HORUS_NAMESPACEenv var — set byhorus launchfor multi-robot deployments- Auto-generated
sid{N}_uid{N}from the session ID and user ID — isolates different users and terminal sessions
This means two horus run commands in different terminals get different namespaces and don't interfere. Use HORUS_NAMESPACE=shared to explicitly share topics across terminals.
Ring Buffer Architecture
Every topic is backed by a lock-free ring buffer in a single mmap'd SHM region:
┌─────────────────────────────────────────────────────────────────┐
│ SHM Region (mmap'd file) │
├─────────────┬──────────────────┬────────────────────────────────┤
│ TopicHeader │ Sequence Array │ Data Slots │
│ 640 bytes │ capacity × 8B │ capacity × slot_size │
│ (10 cache │ (per-slot write │ (message data) │
│ lines) │ completion flag) │ │
└─────────────┴──────────────────┴────────────────────────────────┘
TopicHeader is exactly 640 bytes (10 cache lines), #[repr(C, align(64))]. The critical optimization: producer and consumer state live on separate cache lines to prevent false sharing:
Cache Line 2 (bytes 64-127): head, capacity, capacity_mask, slot_size
↑ ONLY the producer writes here
Cache Line 3 (bytes 128-191): tail
↑ ONLY the consumer writes here
This separation is the single most important optimization for achieving sub-20ns intra-process latency. Without it, every send() would invalidate the consumer's cache line and vice versa.
The remaining cache lines hold: publisher/subscriber counts, participant tracking (up to 16 participants with PID, thread ID, role, and lease expiry), and the type name for runtime validation.
Capacity is always a power of two. Index calculation uses a bitmask (seq & (capacity - 1)) instead of modulo — one CPU cycle instead of ~20 cycles for division. Auto-sizing: PAGE_SIZE / sizeof(T), clamped to [16, 1024].
Slot sizing for POD types:
- Small types (size + 8 ≤ 64): co-located layout — sequence number and data share one 64-byte cache line. This puts the write-completion flag and data on the same cache line, so the reader needs only one memory load.
- Larger types: slot size equals
sizeof(T), with the sequence number in a separate array.
Three Transport Paths
When you call topic.send(msg), one of three paths executes depending on the message type:
1. POD Zero-Copy (~3-150ns)
For types with only primitive fields and no heap pointers (CmdVel, Imu, Pose2D, any #[repr(C)] struct without String, Vec, or Box):
// Auto-detected — no user annotation needed
message! {
#[fixed]
WheelSpeed {
left: f32,
right: f32,
timestamp_ns: u64,
}
}
How it works: Raw memcpy from your struct directly into the ring buffer slot. No serialization, no allocation. For buffers ≥4KB, HORUS uses AVX2 non-temporal streaming stores (runtime-detected) to bypass the CPU cache and write directly to RAM.
POD auto-detection: !std::mem::needs_drop::<T>() && std::mem::size_of::<T>() > 1. No user annotation needed — HORUS checks at compile time via monomorphization.
2. Serde Serialization (~167ns cross-process)
For types with heap-allocated fields (String, Vec<T>, HashMap):
message! {
LogEntry {
message: String,
tags: Vec<String>,
timestamp_ns: u64,
}
}
How it works: Bincode serialization into the ring buffer slot (up to slot_size, default 8KB). Deserialization on the reader side. Slower than POD but works with any Serialize + Deserialize type.
Spill mechanism: Messages with serialized size above 4KB are "spilled" to a TensorPool region instead of inline storage. A 40-byte SpillDescriptor with a magic sentinel is written to the ring slot. The receiver detects the sentinel and reads from the pool.
3. Pool-Backed Descriptors (~50ns for descriptor)
For large data types (Image, PointCloud, DepthImage):
let topic: Topic<Image> = Topic::new("camera.rgb")?;
let mut img = Image::new(640, 480, "rgb8");
// Write pixels directly into pool-backed memory
img.as_mut_slice().copy_from_slice(&frame_data);
topic.send(img);
How it works: The actual pixel/point data lives in a TensorPool SHM region (separate from the ring buffer). Only a small descriptor (288-336 bytes, POD) flows through the ring buffer. The pool handles reference counting via atomics.
Performance comparison:
| Transport | Latency (same process) | Latency (cross process) | Suitable for |
|---|---|---|---|
| POD zero-copy | ~3-36ns | ~50-167ns | Fixed-size sensor data, commands |
| Serde | ~50ns | ~167ns+ | Variable-length strings, logs |
| Pool-backed | ~50ns (descriptor) | ~50ns (descriptor) | Images, point clouds, depth maps |
Backend Selection
HORUS automatically selects the optimal ring buffer backend based on the runtime topology. You never choose a backend — it's detected from how many publishers and subscribers exist.
| Backend | Latency | When Used |
|---|---|---|
| DirectChannel | ~3ns | Same thread, POD type |
| SpscIntra | ~18ns | Same process, 1 pub, 1 sub |
| SpmcIntra | ~24ns | Same process, 1 pub, N subs |
| MpscIntra | ~26ns | Same process, N pubs, 1 sub |
| MpmcIntra | ~36ns | Same process, N pubs, N subs |
| PodShm | ~50ns | Cross-process, POD type |
| MpscShm | ~65ns | Cross-process, N pubs, 1 sub |
| SpmcShm | ~70ns | Cross-process, 1 pub, N subs |
| SpscShm | ~85ns | Cross-process, 1 pub, 1 sub, non-POD |
| MpmcShm | ~167ns | Cross-process, N pubs, N subs |
Intra-process backends use heap memory (Box<[UnsafeCell<MaybeUninit<T>>]>) — no mmap overhead. The SHM file still exists for the control plane (header for cross-process discovery) but the data plane is heap-backed.
Live migration: When a second process joins a topic, the backend migrates from intra-process to cross-process automatically. The migration uses a CAS-based lock, drains in-flight messages, increments an epoch counter, and updates the backend mode atomically. Existing publishers and subscribers adapt within 32 messages (~1-10μs).
See Topics for the full backend selection details and communication patterns.
Cross-Process Discovery
Two processes sharing a topic need no configuration — they discover each other via the SHM filesystem.
How it works:
- Process A calls
Topic::new("imu")→ creates/dev/shm/horus_<ns>/topics/horus_imu(owner) +.metaJSON file - Process B calls
Topic::new("imu")→ opens the existing file (non-owner) → reads header → validates type compatibility → registers as participant - Both processes mmap the same file — writes by A are immediately visible to B
Join protocol: The owner writes the header and sets a magic number last (with a memory fence). Joiners wait for the magic with exponential backoff (1ms base, 2s deadline). This handles the race where the file exists but the header isn't initialized yet.
Type validation: When a joiner opens an existing topic, it checks type_size and type_name in the header against its own type. If they don't match (e.g., Process A uses CmdVel but Process B uses Twist), the join fails with an error. This prevents silent data corruption from type mismatches across processes.
Participant tracking: The header tracks up to 16 concurrent participants (publishers and subscribers). Each entry records the PID, thread ID hash, role (publisher/subscriber), and a lease expiry timestamp. Stale participants are detected by expired leases and reclaimed automatically.
Meta files: Each topic creates a .meta JSON file alongside the SHM file containing the topic name, size, creator PID, and creation timestamp. These are used by horus topic list and the discovery system on platforms where direct SHM file scanning isn't available (macOS, Windows).
Discovery tools:
horus topic listscans the SHM directory and reads.metafiles to show all active topicshorus node listreads node presence files fromnodes/to show running nodes- Node presence files include PID, subscribed topics, published topics, and are validated via liveness checks
Platform Differences
| Feature | Linux | macOS | Windows | Fallback |
|---|---|---|---|---|
| SHM mechanism | /dev/shm tmpfs + mmap | shm_open() (Mach VM) | CreateFileMappingW | /tmp file + mmap |
| Base directory | /dev/shm/horus_<ns>/ | /tmp/horus_<ns>/ | %TEMP%\horus_<ns>\ | /tmp/horus_<ns>/ |
| Stale detection | flock(LOCK_EX|LOCK_NB) | PID check via .meta | PID check via .meta | flock |
| Topic name rule | Any valid filename chars | No slashes — shm_open limitation | Any valid chars | Any valid filename chars |
| Discovery | File scan + .meta | .meta files only | .meta files only | File scan + .meta |
macOS topic naming: shm_open() does not allow / characters in SHM object names (beyond the leading /). Use dots instead of slashes: sensor.imu not sensor/imu. This is enforced across all platforms for portability.
macOS initialization race: Between shm_open(O_CREAT) and ftruncate(), a non-owner process sees st_size == 0. HORUS handles this with wait_for_shm_init() — exponential backoff (1ms base, 10 retries). If the timeout expires (creator assumed dead), it calls shm_unlink and re-creates.
Windows SHM: Uses named memory-mapped files backed by the system pagefile (CreateFileMappingW). Automatically released when all handles close — no manual cleanup needed. Discovery relies on .meta files in the %TEMP% directory.
Cleanup
Automatic Cleanup (you rarely need to do anything)
HORUS has three layers of automatic cleanup — users almost never need manual intervention:
Layer 1: Drop-based (normal exit)
When a process exits normally (Ctrl+C, .stop(), scope exit), Drop for ShmRegion removes the SHM file and .meta if it's the owner. The flock(LOCK_SH) is automatically released by the kernel when the file descriptor closes — even on panic.
Layer 2: Startup namespace cleanup (every command)
Every horus CLI command and every Scheduler::new() call runs cleanup_stale_namespaces() automatically. This scans for horus_sid*_uid* directories from dead sessions and removes them. Cost: <1ms. You don't call this — it happens silently.
Layer 3: Pre-run stale topic cleanup
Before every horus run, stale topics (no live processes AND older than 5 minutes) are removed automatically.
Stale SHM Detection
When a process crashes (even via SIGKILL), the kernel closes its file descriptors and releases flock locks. HORUS detects stale SHM files by attempting an exclusive lock:
flock(fd, LOCK_EX | LOCK_NB)
├── EWOULDBLOCK → file is alive (another process holds a shared lock)
└── success → file is stale (no shared locks held)
This is more reliable than PID-based detection — PIDs can be reused, and kill(pid, 0) races with process creation.
Manual Cleanup (escape hatch)
horus clean --shm is a manual escape hatch for edge cases. You should almost never need it because the automatic cleanup layers handle normal crashes. It exists for:
- Debugging SHM issues when automatic cleanup hasn't triggered yet
- Forcing a clean slate before benchmarking
- After
kill -9in rapid succession (before the nexthoruscommand auto-cleans)
# Preview what would be cleaned
horus clean --shm --dry-run
# Manual cleanup (rarely needed)
horus clean --shm
# Nuclear option: clean everything (SHM + build cache)
horus clean --all
Advanced: SIMD Optimization
For messages ≥4KB, HORUS uses AVX2 non-temporal streaming stores to bypass the CPU cache:
Standard memcpy: src → L1 cache → L2 cache → RAM → L2 cache → L1 cache → dst
Streaming stores: src → RAM → dst (bypasses cache hierarchy)
This matters for large messages (images, point clouds) where the data would evict useful cache entries. Runtime detection via is_x86_feature_detected!("avx2") with fallback to standard memcpy on non-x86 or older CPUs. The threshold is configurable but defaults to 4KB.
Advanced: Dispatch Optimization
The ring buffer hot path uses function-pointer dispatch (not enum matching) for send/recv. Each Topic caches a function pointer to the correct backend-specific send/recv implementation. This eliminates ~7ns per message compared to a match chain.
Epoch checking is amortized: a process-local AtomicU64 is checked every message (~1ns L1 read), but the SHM header epoch is only read every 32 messages (~20ns mmap read). This keeps the fast path fast while still detecting backend migrations promptly.
Design Decisions
Why mmap, not pipes or sockets? Pipes and sockets require kernel transitions for every message — write() + read() system calls add 1-5μs of overhead. Mmap'd shared memory is just regular memory access — the CPU reads/writes at RAM speed without entering the kernel. The tradeoff is complexity: ring buffer synchronization via atomics is harder to get right than write()/read().
Why flock for stale detection, not PID files? PID files require writing a file, reading it, and calling kill(pid, 0) — all of which race with process creation/destruction. flock is kernel-managed: the lock is automatically released when the process exits (even via SIGKILL), and the exclusive lock test is atomic. No race conditions.
Why cache-line separation for head/tail? On x86, when two cores write to the same cache line, the MESI protocol bounces ownership between them (false sharing). Each bounce costs ~40-80ns. By placing the producer's head and consumer's tail on separate 64-byte cache lines, we eliminate this entirely. The header is 640 bytes (10 cache lines) specifically to accommodate this layout.
Why auto-detect POD instead of requiring user annotation? needs_drop::<T>() is a compile-time check that's always correct — if a type has no destructor, it has no heap pointers, so raw memcpy is safe. Requiring users to annotate types is error-prone (they forget, or annotate incorrectly). Auto-detection means zero-copy "just works" for the right types.
Why auto-capacity as a power of two? Ring buffer index calculation via bitmask (seq & (capacity - 1)) is a single AND instruction (~1 cycle). Integer modulo (seq % capacity) requires division (~20 cycles). For a hot path that runs millions of times per second, this 19-cycle saving is significant.
Trade-offs
| Choice | Benefit | Cost |
|---|---|---|
| SHM ring buffers | Zero-copy, sub-200ns, no kernel | Complex synchronization, platform-specific code |
| Automatic backend selection | User doesn't think about topology | Migration latency on topology change (~10μs) |
| flock-based stale detection | Kernel-reliable, no race conditions | Linux/fallback only — macOS uses PID-based |
| Cache-line-aligned header | Eliminates false sharing (~40-80ns saved) | 640 bytes per topic (mostly unused padding) |
| AVX2 streaming stores | Bypass cache for large messages | x86-only, runtime detection overhead |
| Auto POD detection | Zero-copy without user annotation | Types with Drop always use Serde, even if logically copyable |
Common Issues
"Topic not found" across processes
Two processes must use the same namespace to see each other's topics. If you run horus run in two terminals, they get different auto-generated namespaces by default. Fix:
# Terminal 1
HORUS_NAMESPACE=robot horus run src/sensor.rs
# Terminal 2
HORUS_NAMESPACE=robot horus run src/controller.rs
Or use horus launch which sets the namespace automatically for all nodes.
Stale topics after crash
If a process crashes (SIGKILL, power loss), its SHM files persist. Other processes see stale data or fail to create topics with the same name.
# Check for stale files
horus clean --shm --dry-run
# Remove them
horus clean --shm
Type mismatch across processes
If Process A publishes CmdVel on "cmd_vel" but Process B subscribes with a different struct (different field layout, different size), the header validation fails. Both processes must use the exact same message type definition. Use horus msg hash CmdVel to verify type compatibility.
Topic names with slashes fail on macOS
shm_open() on macOS does not allow / in SHM object names. Always use dots: sensor.imu not sensor/imu. This is enforced across all platforms for portability.
Ring buffer overflow (dropped messages)
If a subscriber is slower than a publisher, the ring buffer overwrites old messages. The subscriber sees this via dropped_count():
let scan = topic.try_recv();
if topic.dropped_count() > 0 {
hlog!(warn, "Dropped {} messages — subscriber too slow", topic.dropped_count());
}
This is by design — HORUS prioritizes freshness over completeness. If you need guaranteed delivery, increase capacity or reduce the publisher rate.
Memory usage is high
Each topic allocates capacity × slot_size bytes. For a Topic<Image> with 1024 capacity and 2MB frames, that's 2GB of SHM. Use smaller capacity for large types:
let topic: Topic<Image> = Topic::with_capacity("camera.rgb", 4)?; // Only 4 slots
Or use pool-backed types (Image, PointCloud) which store data in a shared TensorPool with refcounting.
Inspecting SHM at Runtime
# List all active topics with backends and latency
horus topic list --verbose
# Watch a topic in real-time
horus topic echo sensor.imu
# Measure publishing rate
horus topic hz sensor.imu
# Measure bandwidth
horus topic bw camera.rgb
# See what nodes are running
horus node list
# On Linux, inspect SHM files directly
ls -la /dev/shm/horus_*/topics/
The horus monitor web UI and TUI also show real-time topic topology, message rates, and backend types for every topic.
See Also
- Topics — Backend selection, communication patterns, topic API
- Custom Messages & Performance — POD types,
#[fixed]attribute, zero-copy - Multi-Process — Running nodes across processes, cross-process topics
- Topic API —
Topic::new(),send(),recv(),try_recv() - Performance Optimization — Benchmarks, tuning, profiling
- CLI Reference —
horus clean --shm,horus topic list