Tutorial: Debug with Record & Replay

Looking for the Python version? The recording workflow below uses <LanguageTabs> with both Rust and Python. Programmatic replay (replay_from, add_replay, overrides) is Rust-only — use horus record replay from the CLI for Python projects.

Every robotics engineer has hit the bug that only happens in the field. The robot drifts, the arm overshoots, the planner freezes — but only under specific sensor conditions you cannot reproduce at your desk. In ROS, you would reach for rosbag — an external tool that records DDS messages to a file. HORUS takes a different approach: recording is built into the scheduler, capturing every node's inputs and outputs at tick granularity with zero-copy overhead. This tutorial walks you through a real debugging workflow: record a timing bug, replay it deterministically, isolate the cause, and verify your fix — all without touching the robot again.

Prerequisites

What You'll Build

A 3-node robot system with a deliberate timing bug, then:

  1. Record the buggy execution
  2. Replay the recording to reproduce the bug deterministically
  3. Use horus record diff to compare runs
  4. Fix the bug and verify the fix with mixed replay

Time estimate: ~20 minutes

Step 1: Create the Project

horus new replay-debug -r
cd replay-debug

Step 2: Build a Buggy Robot

Replace src/main.rs with a 3-node system: a simulated sensor, a controller with a timing bug, and a monitor. The controller accumulates drift when the sensor value crosses a threshold — a realistic bug that only manifests under certain input patterns.

use horus::prelude::*;

// ── Messages ──────────────────────────────────────────────────

#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize, LogSummary)]
#[repr(C)]
struct SensorReading {
    value: f32,
    timestamp_ns: u64,
}

#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize, LogSummary)]
#[repr(C)]
struct ControlOutput {
    command: f32,
    error: f32,
    integral: f32,
}

// ── Sensor Node ───────────────────────────────────────────────

struct Sensor {
    pub_reading: Topic<SensorReading>,
    tick_count: u64,
}

impl Sensor {
    fn new() -> Result<Self> {
        Ok(Self {
            pub_reading: Topic::new("sensor.reading")?,
            tick_count: 0,
        })
    }
}

impl Node for Sensor {
    fn name(&self) -> &str { "sensor" }

    fn tick(&mut self) {
        let t = self.tick_count as f32 * 0.01;
        // Sine wave with occasional spikes (the trigger condition)
        let spike = if self.tick_count % 73 == 0 { 5.0 } else { 0.0 };
        let value = (t * 2.0).sin() * 10.0 + spike;

        let _ = self.pub_reading.send(SensorReading {
            value,
            timestamp_ns: horus::now().as_nanos() as u64,
        });
        self.tick_count += 1;
    }
}

// ── Controller Node (with bug) ────────────────────────────────

struct Controller {
    sub_reading: Topic<SensorReading>,
    pub_output: Topic<ControlOutput>,
    integral: f32,
    last_error: f32,
    setpoint: f32,
}

impl Controller {
    fn new() -> Result<Self> {
        Ok(Self {
            sub_reading: Topic::new("sensor.reading")?,
            pub_output: Topic::new("ctrl.output")?,
            integral: 0.0,
            last_error: 0.0,
            setpoint: 0.0,
        })
    }
}

impl Node for Controller {
    fn name(&self) -> &str { "controller" }

    fn tick(&mut self) {
        let reading = match self.sub_reading.recv() {
            Some(r) => r,
            None => return,
        };

        let error = self.setpoint - reading.value;
        let dt = horus::dt().as_secs_f32();

        // BUG: No anti-windup on integral term.
        // When sensor spikes, the integral accumulates without bound,
        // causing the controller to drift after the spike passes.
        self.integral += error * dt;

        let derivative = if dt > 0.0 { (error - self.last_error) / dt } else { 0.0 };
        self.last_error = error;

        let command = 0.5 * error + 0.1 * self.integral + 0.01 * derivative;

        let _ = self.pub_output.send(ControlOutput {
            command,
            error,
            integral: self.integral,
        });
    }
}

// ── Monitor Node ──────────────────────────────────────────────

struct Monitor {
    sub_output: Topic<ControlOutput>,
    tick_count: u64,
}

impl Monitor {
    fn new() -> Result<Self> {
        Ok(Self {
            sub_output: Topic::new("ctrl.output")?,
            tick_count: 0,
        })
    }
}

impl Node for Monitor {
    fn name(&self) -> &str { "monitor" }

    fn tick(&mut self) {
        if let Some(output) = self.sub_output.recv() {
            // Print every 100 ticks
            if self.tick_count % 100 == 0 {
                hlog!(info, "cmd={:.2}  err={:.2}  integral={:.2}",
                    output.command, output.error, output.integral);
            }
        }
        self.tick_count += 1;
    }
}

// ── Main ──────────────────────────────────────────────────────

fn main() -> Result<()> {
    let mut scheduler = Scheduler::new()
        .tick_rate(100_u64.hz());

    scheduler.add(Sensor::new()?).order(0).build()?;
    scheduler.add(Controller::new()?).order(1).build()?;
    scheduler.add(Monitor::new()?).order(2).build()?;

    scheduler.run()
}

Run it briefly and observe the integral drifting:

horus run
# Watch for a few seconds, then Ctrl+C
# You'll see the integral climbing after spike events

Step 3: Record the Bug

Now run the same system with recording enabled. This captures every node's inputs and outputs at every tick:

// Change the scheduler setup in main():
let mut scheduler = Scheduler::new()
    .tick_rate(100_u64.hz())
    .with_recording();  // <-- Enable recording

Or record without changing code using the CLI:

horus run --record buggy_run
# Run for 10 seconds, then Ctrl+C

Check the recording:

horus record list --long
# Output:
#   buggy_run    3 nodes    1000 ticks    245 KB    2026-03-28 14:32

Inspect what was captured:

horus record info buggy_run
# Shows per-node tick ranges, topics recorded, file sizes

Step 4: Replay and Reproduce

Replay the recording. The bug reproduces identically every time — same inputs, same timing, same outputs:

horus record replay buggy_run

Slow it down to watch the spike events:

horus record replay buggy_run --speed 0.25

Jump directly to the problematic region (if you noticed the drift starting around tick 400):

horus record replay buggy_run --start-tick 350 --stop-tick 500
ℹ️Why is replay deterministic?

Unlike rosbag which replays messages at approximate timestamps, HORUS replay injects recorded data at the exact tick it was originally produced. Combined with deterministic mode (SimClock + tick-seeded RNG), every replay produces bit-identical results. This is because recording happens inside the scheduler — it captures the causal ordering, not just the timestamps.

Step 5: Export and Analyze

Export the recording for offline analysis:

horus record export buggy_run --output buggy.json --format json
horus record export buggy_run --output buggy.csv --format csv

The JSON export lets you script analysis — grep for the integral crossing a threshold, plot the spike correlation, etc.

Step 6: Fix the Bug

The fix is simple — add integral anti-windup clamping. Update the controller:

impl Node for Controller {
    fn name(&self) -> &str { "controller" }

    fn tick(&mut self) {
        let reading = match self.sub_reading.recv() {
            Some(r) => r,
            None => return,
        };

        let error = self.setpoint - reading.value;
        let dt = horus::dt().as_secs_f32();

        // FIX: Clamp integral to prevent windup
        self.integral = (self.integral + error * dt).clamp(-10.0, 10.0);

        let derivative = if dt > 0.0 { (error - self.last_error) / dt } else { 0.0 };
        self.last_error = error;

        let command = 0.5 * error + 0.1 * self.integral + 0.01 * derivative;

        let _ = self.pub_output.send(ControlOutput {
            command,
            error,
            integral: self.integral,
        });
    }
}

Step 7: Verify the Fix with Mixed Replay

This is the key step that rosbag cannot do. Use mixed replay to feed the exact same sensor data from the buggy recording into your fixed controller:

# Inject the recorded sensor node, run the fixed controller live
horus record inject buggy_run --nodes sensor

Or programmatically in Rust:

use horus::prelude::*;
use std::path::PathBuf;

fn main() -> Result<()> {
    let mut scheduler = Scheduler::new()
        .tick_rate(100_u64.hz())
        .with_recording();  // Record the fixed run too

    // Replay the recorded sensor (same spikes, same timing)
    scheduler.add_replay(
        PathBuf::from("~/.local/share/horus/recordings/buggy_run/sensor@001.horus"),
        0,
    )?;

    // Run the FIXED controller live against recorded sensor data
    scheduler.add(Controller::new()?).order(1).build()?;
    scheduler.add(Monitor::new()?).order(2).build()?;

    scheduler.run()
}

Now record this fixed run and compare:

horus run --record fixed_run
# Let it run the same duration, Ctrl+C

horus record diff buggy_run fixed_run
# Shows tick-by-tick differences in ctrl.output
# The integral no longer drifts past +/-10.0

Step 8: Clean Up

# Keep the fixed run, delete the buggy one
horus record delete buggy_run

# Or clean all recordings older than 7 days
horus record clean --max-age-days 7

Key Takeaways

  • Recording is zero-overhead: It reads directly from shared memory slots — no extra serialization, no external process
  • Replay is deterministic: Same inputs at the same tick produce identical outputs every time
  • Mixed replay is the killer feature: Replay recorded sensors while running new code live — impossible with external bag tools
  • horus record diff lets you prove your fix works by comparing buggy vs fixed runs on identical inputs
  • No hardware needed: Once recorded, you debug entirely at your desk

Challenges

(a) Add a regression test: Write a script that runs horus record inject buggy_run --nodes sensor, checks the integral stays within bounds, and returns exit code 0/1. Add it to your CI.

(b) Override a sensor value: Use --override sensor.reading ... during replay to test what happens if the spike amplitude doubles. Does your fix still hold?

(c) Compare algorithm versions: Record a session, then modify the PID gains. Use mixed replay + diff to find the gains that minimize overshoot on the recorded input data.

Common Errors

SymptomCauseFix
horus record list shows nothingRecording was not enabledAdd .with_recording() or use --record flag
Replay produces different outputCode changed between record and replayExpected for mixed replay; use replay_from for exact reproduction
inject node not receiving dataTopic name mismatchTopic names are case-sensitive and must match exactly
Recording files are largeLong session or high-frequency large messagesUse horus record clean or set max_snapshots to bound size

See Also