Tutorial: Debug with Record & Replay
<LanguageTabs> with both Rust and Python. Programmatic replay (replay_from, add_replay, overrides) is Rust-only — use horus record replay from the CLI for Python projects.Every robotics engineer has hit the bug that only happens in the field. The robot drifts, the arm overshoots, the planner freezes — but only under specific sensor conditions you cannot reproduce at your desk. In ROS, you would reach for rosbag — an external tool that records DDS messages to a file. HORUS takes a different approach: recording is built into the scheduler, capturing every node's inputs and outputs at tick granularity with zero-copy overhead. This tutorial walks you through a real debugging workflow: record a timing bug, replay it deterministically, isolate the cause, and verify your fix — all without touching the robot again.
Prerequisites
- Quick Start completed
- Familiarity with Nodes and Topics
- Understanding of Record & Replay concepts
What You'll Build
A 3-node robot system with a deliberate timing bug, then:
- Record the buggy execution
- Replay the recording to reproduce the bug deterministically
- Use
horus record diffto compare runs - Fix the bug and verify the fix with mixed replay
Time estimate: ~20 minutes
Step 1: Create the Project
horus new replay-debug -r
cd replay-debug
Step 2: Build a Buggy Robot
Replace src/main.rs with a 3-node system: a simulated sensor, a controller with a timing bug, and a monitor. The controller accumulates drift when the sensor value crosses a threshold — a realistic bug that only manifests under certain input patterns.
use horus::prelude::*;
// ── Messages ──────────────────────────────────────────────────
#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize, LogSummary)]
#[repr(C)]
struct SensorReading {
value: f32,
timestamp_ns: u64,
}
#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize, LogSummary)]
#[repr(C)]
struct ControlOutput {
command: f32,
error: f32,
integral: f32,
}
// ── Sensor Node ───────────────────────────────────────────────
struct Sensor {
pub_reading: Topic<SensorReading>,
tick_count: u64,
}
impl Sensor {
fn new() -> Result<Self> {
Ok(Self {
pub_reading: Topic::new("sensor.reading")?,
tick_count: 0,
})
}
}
impl Node for Sensor {
fn name(&self) -> &str { "sensor" }
fn tick(&mut self) {
let t = self.tick_count as f32 * 0.01;
// Sine wave with occasional spikes (the trigger condition)
let spike = if self.tick_count % 73 == 0 { 5.0 } else { 0.0 };
let value = (t * 2.0).sin() * 10.0 + spike;
let _ = self.pub_reading.send(SensorReading {
value,
timestamp_ns: horus::now().as_nanos() as u64,
});
self.tick_count += 1;
}
}
// ── Controller Node (with bug) ────────────────────────────────
struct Controller {
sub_reading: Topic<SensorReading>,
pub_output: Topic<ControlOutput>,
integral: f32,
last_error: f32,
setpoint: f32,
}
impl Controller {
fn new() -> Result<Self> {
Ok(Self {
sub_reading: Topic::new("sensor.reading")?,
pub_output: Topic::new("ctrl.output")?,
integral: 0.0,
last_error: 0.0,
setpoint: 0.0,
})
}
}
impl Node for Controller {
fn name(&self) -> &str { "controller" }
fn tick(&mut self) {
let reading = match self.sub_reading.recv() {
Some(r) => r,
None => return,
};
let error = self.setpoint - reading.value;
let dt = horus::dt().as_secs_f32();
// BUG: No anti-windup on integral term.
// When sensor spikes, the integral accumulates without bound,
// causing the controller to drift after the spike passes.
self.integral += error * dt;
let derivative = if dt > 0.0 { (error - self.last_error) / dt } else { 0.0 };
self.last_error = error;
let command = 0.5 * error + 0.1 * self.integral + 0.01 * derivative;
let _ = self.pub_output.send(ControlOutput {
command,
error,
integral: self.integral,
});
}
}
// ── Monitor Node ──────────────────────────────────────────────
struct Monitor {
sub_output: Topic<ControlOutput>,
tick_count: u64,
}
impl Monitor {
fn new() -> Result<Self> {
Ok(Self {
sub_output: Topic::new("ctrl.output")?,
tick_count: 0,
})
}
}
impl Node for Monitor {
fn name(&self) -> &str { "monitor" }
fn tick(&mut self) {
if let Some(output) = self.sub_output.recv() {
// Print every 100 ticks
if self.tick_count % 100 == 0 {
hlog!(info, "cmd={:.2} err={:.2} integral={:.2}",
output.command, output.error, output.integral);
}
}
self.tick_count += 1;
}
}
// ── Main ──────────────────────────────────────────────────────
fn main() -> Result<()> {
let mut scheduler = Scheduler::new()
.tick_rate(100_u64.hz());
scheduler.add(Sensor::new()?).order(0).build()?;
scheduler.add(Controller::new()?).order(1).build()?;
scheduler.add(Monitor::new()?).order(2).build()?;
scheduler.run()
}
Run it briefly and observe the integral drifting:
horus run
# Watch for a few seconds, then Ctrl+C
# You'll see the integral climbing after spike events
Step 3: Record the Bug
Now run the same system with recording enabled. This captures every node's inputs and outputs at every tick:
// Change the scheduler setup in main():
let mut scheduler = Scheduler::new()
.tick_rate(100_u64.hz())
.with_recording(); // <-- Enable recording
Or record without changing code using the CLI:
horus run --record buggy_run
# Run for 10 seconds, then Ctrl+C
Check the recording:
horus record list --long
# Output:
# buggy_run 3 nodes 1000 ticks 245 KB 2026-03-28 14:32
Inspect what was captured:
horus record info buggy_run
# Shows per-node tick ranges, topics recorded, file sizes
Step 4: Replay and Reproduce
Replay the recording. The bug reproduces identically every time — same inputs, same timing, same outputs:
horus record replay buggy_run
Slow it down to watch the spike events:
horus record replay buggy_run --speed 0.25
Jump directly to the problematic region (if you noticed the drift starting around tick 400):
horus record replay buggy_run --start-tick 350 --stop-tick 500
Unlike rosbag which replays messages at approximate timestamps, HORUS replay injects recorded data at the exact tick it was originally produced. Combined with deterministic mode (SimClock + tick-seeded RNG), every replay produces bit-identical results. This is because recording happens inside the scheduler — it captures the causal ordering, not just the timestamps.
Step 5: Export and Analyze
Export the recording for offline analysis:
horus record export buggy_run --output buggy.json --format json
horus record export buggy_run --output buggy.csv --format csv
The JSON export lets you script analysis — grep for the integral crossing a threshold, plot the spike correlation, etc.
Step 6: Fix the Bug
The fix is simple — add integral anti-windup clamping. Update the controller:
impl Node for Controller {
fn name(&self) -> &str { "controller" }
fn tick(&mut self) {
let reading = match self.sub_reading.recv() {
Some(r) => r,
None => return,
};
let error = self.setpoint - reading.value;
let dt = horus::dt().as_secs_f32();
// FIX: Clamp integral to prevent windup
self.integral = (self.integral + error * dt).clamp(-10.0, 10.0);
let derivative = if dt > 0.0 { (error - self.last_error) / dt } else { 0.0 };
self.last_error = error;
let command = 0.5 * error + 0.1 * self.integral + 0.01 * derivative;
let _ = self.pub_output.send(ControlOutput {
command,
error,
integral: self.integral,
});
}
}
Step 7: Verify the Fix with Mixed Replay
This is the key step that rosbag cannot do. Use mixed replay to feed the exact same sensor data from the buggy recording into your fixed controller:
# Inject the recorded sensor node, run the fixed controller live
horus record inject buggy_run --nodes sensor
Or programmatically in Rust:
use horus::prelude::*;
use std::path::PathBuf;
fn main() -> Result<()> {
let mut scheduler = Scheduler::new()
.tick_rate(100_u64.hz())
.with_recording(); // Record the fixed run too
// Replay the recorded sensor (same spikes, same timing)
scheduler.add_replay(
PathBuf::from("~/.local/share/horus/recordings/buggy_run/sensor@001.horus"),
0,
)?;
// Run the FIXED controller live against recorded sensor data
scheduler.add(Controller::new()?).order(1).build()?;
scheduler.add(Monitor::new()?).order(2).build()?;
scheduler.run()
}
Now record this fixed run and compare:
horus run --record fixed_run
# Let it run the same duration, Ctrl+C
horus record diff buggy_run fixed_run
# Shows tick-by-tick differences in ctrl.output
# The integral no longer drifts past +/-10.0
Step 8: Clean Up
# Keep the fixed run, delete the buggy one
horus record delete buggy_run
# Or clean all recordings older than 7 days
horus record clean --max-age-days 7
Key Takeaways
- Recording is zero-overhead: It reads directly from shared memory slots — no extra serialization, no external process
- Replay is deterministic: Same inputs at the same tick produce identical outputs every time
- Mixed replay is the killer feature: Replay recorded sensors while running new code live — impossible with external bag tools
horus record difflets you prove your fix works by comparing buggy vs fixed runs on identical inputs- No hardware needed: Once recorded, you debug entirely at your desk
Challenges
(a) Add a regression test: Write a script that runs horus record inject buggy_run --nodes sensor, checks the integral stays within bounds, and returns exit code 0/1. Add it to your CI.
(b) Override a sensor value: Use --override sensor.reading ... during replay to test what happens if the spike amplitude doubles. Does your fix still hold?
(c) Compare algorithm versions: Record a session, then modify the PID gains. Use mixed replay + diff to find the gains that minimize overshoot on the recorded input data.
Common Errors
| Symptom | Cause | Fix |
|---|---|---|
horus record list shows nothing | Recording was not enabled | Add .with_recording() or use --record flag |
| Replay produces different output | Code changed between record and replay | Expected for mixed replay; use replay_from for exact reproduction |
inject node not receiving data | Topic name mismatch | Topic names are case-sensitive and must match exactly |
| Recording files are large | Long session or high-frequency large messages | Use horus record clean or set max_snapshots to bound size |
See Also
- Record & Replay Reference — Full API documentation
- BlackBox Flight Recorder — Lightweight crash forensics (always-on, bounded storage)
- Deterministic Mode — Required for bit-identical replay
- PID Controller Recipe — Production PID with anti-windup