Shared Memory

Your robot's camera publishes 30 frames/sec at 2MB each. Copying through kernel pipes would consume 60MB/s of bandwidth and add milliseconds of latency. HORUS puts the data in shared memory — zero copies, sub-200ns latency, no kernel involvement.

This page explains how it works underneath. You never interact with SHM directly — Topic::new() handles everything. But understanding the architecture helps you debug performance issues, choose the right message types, and work with multi-process systems.


SHM Directory Structure

When a HORUS application starts, horus_sys creates a namespace directory on the filesystem:

/dev/shm/horus_<namespace>/        # Linux (tmpfs — backed by RAM)
├── topics/                        # Ring buffer files, one per topic
│   ├── horus_cmd_vel              # SHM region for "cmd_vel" topic
│   ├── horus_cmd_vel.meta         # Discovery metadata (JSON)
│   ├── horus_scan                 # SHM region for "scan" topic
│   └── horus_scan.meta
├── nodes/                         # Node presence files (JSON)
│   ├── motor_controller.json
│   └── lidar_driver.json
├── tensors/                       # TensorPool regions (for Image, PointCloud)
│   └── tensor_pool_a3f2c1d0
├── scheduler/                     # Scheduler state
├── network/                       # Network transport state
└── logs/                          # Structured log files

Namespace generation (priority order):

  1. HORUS_NAMESPACE env var — set by horus launch for multi-robot deployments
  2. Auto-generated sid{N}_uid{N} from the session ID and user ID — isolates different users and terminal sessions

This means two horus run commands in different terminals get different namespaces and don't interfere. Use HORUS_NAMESPACE=shared to explicitly share topics across terminals.


Ring Buffer Architecture

Every topic is backed by a lock-free ring buffer in a single mmap'd SHM region:

┌─────────────────────────────────────────────────────────────────┐
│                    SHM Region (mmap'd file)                      │
├─────────────┬──────────────────┬────────────────────────────────┤
│ TopicHeader │  Sequence Array  │          Data Slots            │
│  640 bytes  │ capacity × 8B    │    capacity × slot_size        │
│  (10 cache  │ (per-slot write  │    (message data)              │
│   lines)    │  completion flag) │                               │
└─────────────┴──────────────────┴────────────────────────────────┘

TopicHeader is exactly 640 bytes (10 cache lines), #[repr(C, align(64))]. The critical optimization: producer and consumer state live on separate cache lines to prevent false sharing:

Cache Line 2 (bytes 64-127):  head, capacity, capacity_mask, slot_size
                               ↑ ONLY the producer writes here

Cache Line 3 (bytes 128-191): tail
                               ↑ ONLY the consumer writes here

This separation is the single most important optimization for achieving sub-20ns intra-process latency. Without it, every send() would invalidate the consumer's cache line and vice versa.

The remaining cache lines hold: publisher/subscriber counts, participant tracking (up to 16 participants with PID, thread ID, role, and lease expiry), and the type name for runtime validation.

Capacity is always a power of two. Index calculation uses a bitmask (seq & (capacity - 1)) instead of modulo — one CPU cycle instead of ~20 cycles for division. Auto-sizing: PAGE_SIZE / sizeof(T), clamped to [16, 1024].

Slot sizing for POD types:

  • Small types (size + 8 ≤ 64): co-located layout — sequence number and data share one 64-byte cache line. This puts the write-completion flag and data on the same cache line, so the reader needs only one memory load.
  • Larger types: slot size equals sizeof(T), with the sequence number in a separate array.

Three Transport Paths

When you call topic.send(msg), one of three paths executes depending on the message type:

1. POD Zero-Copy (~3-150ns)

For types with only primitive fields and no heap pointers (CmdVel, Imu, Pose2D, any #[repr(C)] struct without String, Vec, or Box):

// Auto-detected — no user annotation needed
message! {
    #[fixed]
    WheelSpeed {
        left: f32,
        right: f32,
        timestamp_ns: u64,
    }
}

How it works: Raw memcpy from your struct directly into the ring buffer slot. No serialization, no allocation. For buffers ≥4KB, HORUS uses AVX2 non-temporal streaming stores (runtime-detected) to bypass the CPU cache and write directly to RAM.

POD auto-detection: !std::mem::needs_drop::<T>() && std::mem::size_of::<T>() > 1. No user annotation needed — HORUS checks at compile time via monomorphization.

2. Serde Serialization (~167ns cross-process)

For types with heap-allocated fields (String, Vec<T>, HashMap):

message! {
    LogEntry {
        message: String,
        tags: Vec<String>,
        timestamp_ns: u64,
    }
}

How it works: Bincode serialization into the ring buffer slot (up to slot_size, default 8KB). Deserialization on the reader side. Slower than POD but works with any Serialize + Deserialize type.

Spill mechanism: Messages with serialized size above 4KB are "spilled" to a TensorPool region instead of inline storage. A 40-byte SpillDescriptor with a magic sentinel is written to the ring slot. The receiver detects the sentinel and reads from the pool.

3. Pool-Backed Descriptors (~50ns for descriptor)

For large data types (Image, PointCloud, DepthImage):

let topic: Topic<Image> = Topic::new("camera.rgb")?;
let mut img = Image::new(640, 480, "rgb8");
// Write pixels directly into pool-backed memory
img.as_mut_slice().copy_from_slice(&frame_data);
topic.send(img);

How it works: The actual pixel/point data lives in a TensorPool SHM region (separate from the ring buffer). Only a small descriptor (288-336 bytes, POD) flows through the ring buffer. The pool handles reference counting via atomics.

Performance comparison:

TransportLatency (same process)Latency (cross process)Suitable for
POD zero-copy~3-36ns~50-167nsFixed-size sensor data, commands
Serde~50ns~167ns+Variable-length strings, logs
Pool-backed~50ns (descriptor)~50ns (descriptor)Images, point clouds, depth maps

Backend Selection

HORUS automatically selects the optimal ring buffer backend based on the runtime topology. You never choose a backend — it's detected from how many publishers and subscribers exist.

BackendLatencyWhen Used
DirectChannel~3nsSame thread, POD type
SpscIntra~18nsSame process, 1 pub, 1 sub
SpmcIntra~24nsSame process, 1 pub, N subs
MpscIntra~26nsSame process, N pubs, 1 sub
MpmcIntra~36nsSame process, N pubs, N subs
PodShm~50nsCross-process, POD type
MpscShm~65nsCross-process, N pubs, 1 sub
SpmcShm~70nsCross-process, 1 pub, N subs
SpscShm~85nsCross-process, 1 pub, 1 sub, non-POD
MpmcShm~167nsCross-process, N pubs, N subs

Intra-process backends use heap memory (Box<[UnsafeCell<MaybeUninit<T>>]>) — no mmap overhead. The SHM file still exists for the control plane (header for cross-process discovery) but the data plane is heap-backed.

Live migration: When a second process joins a topic, the backend migrates from intra-process to cross-process automatically. The migration uses a CAS-based lock, drains in-flight messages, increments an epoch counter, and updates the backend mode atomically. Existing publishers and subscribers adapt within 32 messages (~1-10μs).

See Topics for the full backend selection details and communication patterns.


Cross-Process Discovery

Two processes sharing a topic need no configuration — they discover each other via the SHM filesystem.

How it works:

  1. Process A calls Topic::new("imu") → creates /dev/shm/horus_<ns>/topics/horus_imu (owner) + .meta JSON file
  2. Process B calls Topic::new("imu") → opens the existing file (non-owner) → reads header → validates type compatibility → registers as participant
  3. Both processes mmap the same file — writes by A are immediately visible to B

Join protocol: The owner writes the header and sets a magic number last (with a memory fence). Joiners wait for the magic with exponential backoff (1ms base, 2s deadline). This handles the race where the file exists but the header isn't initialized yet.

Type validation: When a joiner opens an existing topic, it checks type_size and type_name in the header against its own type. If they don't match (e.g., Process A uses CmdVel but Process B uses Twist), the join fails with an error. This prevents silent data corruption from type mismatches across processes.

Participant tracking: The header tracks up to 16 concurrent participants (publishers and subscribers). Each entry records the PID, thread ID hash, role (publisher/subscriber), and a lease expiry timestamp. Stale participants are detected by expired leases and reclaimed automatically.

Meta files: Each topic creates a .meta JSON file alongside the SHM file containing the topic name, size, creator PID, and creation timestamp. These are used by horus topic list and the discovery system on platforms where direct SHM file scanning isn't available (macOS, Windows).

Discovery tools:

  • horus topic list scans the SHM directory and reads .meta files to show all active topics
  • horus node list reads node presence files from nodes/ to show running nodes
  • Node presence files include PID, subscribed topics, published topics, and are validated via liveness checks

Platform Differences

FeatureLinuxmacOSWindowsFallback
SHM mechanism/dev/shm tmpfs + mmapshm_open() (Mach VM)CreateFileMappingW/tmp file + mmap
Base directory/dev/shm/horus_<ns>//tmp/horus_<ns>/%TEMP%\horus_<ns>\/tmp/horus_<ns>/
Stale detectionflock(LOCK_EX|LOCK_NB)PID check via .metaPID check via .metaflock
Topic name ruleAny valid filename charsNo slashesshm_open limitationAny valid charsAny valid filename chars
DiscoveryFile scan + .meta.meta files only.meta files onlyFile scan + .meta

macOS topic naming: shm_open() does not allow / characters in SHM object names (beyond the leading /). Use dots instead of slashes: sensor.imu not sensor/imu. This is enforced across all platforms for portability.

macOS initialization race: Between shm_open(O_CREAT) and ftruncate(), a non-owner process sees st_size == 0. HORUS handles this with wait_for_shm_init() — exponential backoff (1ms base, 10 retries). If the timeout expires (creator assumed dead), it calls shm_unlink and re-creates.

Windows SHM: Uses named memory-mapped files backed by the system pagefile (CreateFileMappingW). Automatically released when all handles close — no manual cleanup needed. Discovery relies on .meta files in the %TEMP% directory.


Cleanup

Automatic Cleanup (you rarely need to do anything)

HORUS has three layers of automatic cleanup — users almost never need manual intervention:

Layer 1: Drop-based (normal exit) When a process exits normally (Ctrl+C, .stop(), scope exit), Drop for ShmRegion removes the SHM file and .meta if it's the owner. The flock(LOCK_SH) is automatically released by the kernel when the file descriptor closes — even on panic.

Layer 2: Startup namespace cleanup (every command) Every horus CLI command and every Scheduler::new() call runs cleanup_stale_namespaces() automatically. This scans for horus_sid*_uid* directories from dead sessions and removes them. Cost: <1ms. You don't call this — it happens silently.

Layer 3: Pre-run stale topic cleanup Before every horus run, stale topics (no live processes AND older than 5 minutes) are removed automatically.

Stale SHM Detection

When a process crashes (even via SIGKILL), the kernel closes its file descriptors and releases flock locks. HORUS detects stale SHM files by attempting an exclusive lock:

flock(fd, LOCK_EX | LOCK_NB)
├── EWOULDBLOCK → file is alive (another process holds a shared lock)
└── success     → file is stale (no shared locks held)

This is more reliable than PID-based detection — PIDs can be reused, and kill(pid, 0) races with process creation.

Manual Cleanup (escape hatch)

horus clean --shm is a manual escape hatch for edge cases. You should almost never need it because the automatic cleanup layers handle normal crashes. It exists for:

  • Debugging SHM issues when automatic cleanup hasn't triggered yet
  • Forcing a clean slate before benchmarking
  • After kill -9 in rapid succession (before the next horus command auto-cleans)
# Preview what would be cleaned
horus clean --shm --dry-run

# Manual cleanup (rarely needed)
horus clean --shm

# Nuclear option: clean everything (SHM + build cache)
horus clean --all

Advanced: SIMD Optimization

For messages ≥4KB, HORUS uses AVX2 non-temporal streaming stores to bypass the CPU cache:

Standard memcpy:      src → L1 cache → L2 cache → RAM → L2 cache → L1 cache → dst
Streaming stores:     src → RAM → dst  (bypasses cache hierarchy)

This matters for large messages (images, point clouds) where the data would evict useful cache entries. Runtime detection via is_x86_feature_detected!("avx2") with fallback to standard memcpy on non-x86 or older CPUs. The threshold is configurable but defaults to 4KB.


Advanced: Dispatch Optimization

The ring buffer hot path uses function-pointer dispatch (not enum matching) for send/recv. Each Topic caches a function pointer to the correct backend-specific send/recv implementation. This eliminates ~7ns per message compared to a match chain.

Epoch checking is amortized: a process-local AtomicU64 is checked every message (~1ns L1 read), but the SHM header epoch is only read every 32 messages (~20ns mmap read). This keeps the fast path fast while still detecting backend migrations promptly.


Design Decisions

Why mmap, not pipes or sockets? Pipes and sockets require kernel transitions for every message — write() + read() system calls add 1-5μs of overhead. Mmap'd shared memory is just regular memory access — the CPU reads/writes at RAM speed without entering the kernel. The tradeoff is complexity: ring buffer synchronization via atomics is harder to get right than write()/read().

Why flock for stale detection, not PID files? PID files require writing a file, reading it, and calling kill(pid, 0) — all of which race with process creation/destruction. flock is kernel-managed: the lock is automatically released when the process exits (even via SIGKILL), and the exclusive lock test is atomic. No race conditions.

Why cache-line separation for head/tail? On x86, when two cores write to the same cache line, the MESI protocol bounces ownership between them (false sharing). Each bounce costs ~40-80ns. By placing the producer's head and consumer's tail on separate 64-byte cache lines, we eliminate this entirely. The header is 640 bytes (10 cache lines) specifically to accommodate this layout.

Why auto-detect POD instead of requiring user annotation? needs_drop::<T>() is a compile-time check that's always correct — if a type has no destructor, it has no heap pointers, so raw memcpy is safe. Requiring users to annotate types is error-prone (they forget, or annotate incorrectly). Auto-detection means zero-copy "just works" for the right types.

Why auto-capacity as a power of two? Ring buffer index calculation via bitmask (seq & (capacity - 1)) is a single AND instruction (~1 cycle). Integer modulo (seq % capacity) requires division (~20 cycles). For a hot path that runs millions of times per second, this 19-cycle saving is significant.


Trade-offs

ChoiceBenefitCost
SHM ring buffersZero-copy, sub-200ns, no kernelComplex synchronization, platform-specific code
Automatic backend selectionUser doesn't think about topologyMigration latency on topology change (~10μs)
flock-based stale detectionKernel-reliable, no race conditionsLinux/fallback only — macOS uses PID-based
Cache-line-aligned headerEliminates false sharing (~40-80ns saved)640 bytes per topic (mostly unused padding)
AVX2 streaming storesBypass cache for large messagesx86-only, runtime detection overhead
Auto POD detectionZero-copy without user annotationTypes with Drop always use Serde, even if logically copyable

Common Issues

"Topic not found" across processes

Two processes must use the same namespace to see each other's topics. If you run horus run in two terminals, they get different auto-generated namespaces by default. Fix:

# Terminal 1
HORUS_NAMESPACE=robot horus run src/sensor.rs

# Terminal 2
HORUS_NAMESPACE=robot horus run src/controller.rs

Or use horus launch which sets the namespace automatically for all nodes.

Stale topics after crash

If a process crashes (SIGKILL, power loss), its SHM files persist. Other processes see stale data or fail to create topics with the same name.

# Check for stale files
horus clean --shm --dry-run

# Remove them
horus clean --shm

Type mismatch across processes

If Process A publishes CmdVel on "cmd_vel" but Process B subscribes with a different struct (different field layout, different size), the header validation fails. Both processes must use the exact same message type definition. Use horus msg hash CmdVel to verify type compatibility.

Topic names with slashes fail on macOS

shm_open() on macOS does not allow / in SHM object names. Always use dots: sensor.imu not sensor/imu. This is enforced across all platforms for portability.

Ring buffer overflow (dropped messages)

If a subscriber is slower than a publisher, the ring buffer overwrites old messages. The subscriber sees this via dropped_count():

let scan = topic.try_recv();
if topic.dropped_count() > 0 {
    hlog!(warn, "Dropped {} messages — subscriber too slow", topic.dropped_count());
}

This is by design — HORUS prioritizes freshness over completeness. If you need guaranteed delivery, increase capacity or reduce the publisher rate.

Memory usage is high

Each topic allocates capacity × slot_size bytes. For a Topic<Image> with 1024 capacity and 2MB frames, that's 2GB of SHM. Use smaller capacity for large types:

let topic: Topic<Image> = Topic::with_capacity("camera.rgb", 4)?;  // Only 4 slots

Or use pool-backed types (Image, PointCloud) which store data in a shared TensorPool with refcounting.


Inspecting SHM at Runtime

# List all active topics with backends and latency
horus topic list --verbose

# Watch a topic in real-time
horus topic echo sensor.imu

# Measure publishing rate
horus topic hz sensor.imu

# Measure bandwidth
horus topic bw camera.rgb

# See what nodes are running
horus node list

# On Linux, inspect SHM files directly
ls -la /dev/shm/horus_*/topics/

The horus monitor web UI and TUI also show real-time topic topology, message rates, and backend types for every topic.


See Also