Fault Tolerance
Every node in a robot can fail — a sensor disconnects, a planning algorithm panics, a network request times out. HORUS failure policies control what happens next: stop the robot, restart the node, skip it temporarily, or ignore the error entirely.
The Problem
Without failure policies, one crashing node kills the entire system:
Tick 100: sensor_driver panics (USB disconnected)
Tick 100: scheduler stops
Tick 100: motor_controller stops receiving commands
Result: robot stops moving in the middle of a task
With failure policies, the system adapts:
Tick 100: sensor_driver panics (USB disconnected)
Tick 100: FailurePolicy::Restart → re-init sensor_driver (10ms backoff)
Tick 101: sensor_driver panics again → restart (20ms backoff)
Tick 102: USB reconnects → sensor_driver.init() succeeds → normal operation
Result: robot paused briefly, then resumed automatically
The Four Policies
Fatal — Stop Everything
scheduler.add(motor_controller)
.order(0)
.rate(1000_u64.hz())
.failure_policy(FailurePolicy::Fatal)
.build()?;
First failure stops the scheduler immediately. Use for nodes where continued operation after failure is unsafe:
- Motor controllers (stale commands = uncontrolled motion)
- Safety monitors (can't monitor safety if the monitor is broken)
- Emergency stop handlers
When it triggers: node.tick() panics or returns an error. The scheduler calls stop() and shuts down all nodes cleanly.
Restart — Re-Initialize with Backoff
scheduler.add(lidar_driver)
.order(1)
.rate(100_u64.hz())
.failure_policy(FailurePolicy::restart(3, 50_u64.ms()))
.build()?;
Re-initializes the node with exponential backoff. After max_restarts exhausted, escalates to fatal stop.
failure 1 → restart, wait 50ms
failure 2 → restart, wait 100ms (2x backoff)
failure 3 → restart, wait 200ms (2x backoff)
failure 4 → max_restarts exceeded → fatal stop
After a successful tick, the backoff timer clears. Use for nodes that can recover from transient failures:
- Sensor drivers (hardware reconnection)
- Network clients (server temporarily unavailable)
- Camera nodes (USB reset)
Skip — Tolerate with Cooldown
scheduler.add(telemetry_uploader)
.order(200)
.async_io()
.failure_policy(FailurePolicy::skip(5, 1_u64.secs()))
.build()?;
After max_failures consecutive failures, the node is suppressed for the cooldown period. After cooldown, the node is allowed again and the failure counter resets.
failure 1 → continue
failure 2 → continue
failure 3 → continue
failure 4 → continue
failure 5 → node suppressed for 1 second
... 1 second passes ...
node allowed again, failure count = 0
Use for nodes whose absence doesn't affect core robot operation:
- Logging and telemetry upload
- Diagnostics reporting
- Cloud sync
- Non-critical monitoring
Ignore — Swallow Failures
scheduler.add(stats_collector)
.order(100)
.failure_policy(FailurePolicy::Ignore)
.build()?;
Failures are completely ignored. The node keeps ticking every cycle regardless of errors. Use only when partial results are acceptable:
- Statistics collectors (missing one sample is fine)
- Best-effort visualization
- Debug output nodes
Severity-Aware Handling
HORUS errors carry severity levels that can override the configured policy:
| Severity | Effect |
|---|---|
| Fatal (e.g., shared memory corruption) | Always stops the scheduler, even with Ignore policy |
| Transient (e.g., topic full, network timeout) | De-escalates Fatal policy to Restart (transient errors are recoverable) |
| Permanent (e.g., invalid configuration) | Follows the configured policy |
This means a safety-critical node with Fatal policy won't kill the system on a transient network glitch — it'll restart instead. But a shared-memory corruption always stops, even on an Ignore node.
Complete Robot Example
use horus::prelude::*;
fn main() -> Result<()> {
let mut scheduler = Scheduler::new()
.tick_rate(500_u64.hz())
.prefer_rt()
.watchdog(500_u64.ms());
// CRITICAL: Motor controller — stop if it fails
scheduler.add(MotorController::new())
.order(0)
.rate(500_u64.hz())
.on_miss(Miss::SafeMode)
.failure_policy(FailurePolicy::Fatal)
.build()?;
// RECOVERABLE: Lidar driver — restart on USB disconnect
scheduler.add(LidarDriver::new())
.order(1)
.rate(100_u64.hz())
.failure_policy(FailurePolicy::restart(5, 100_u64.ms()))
.build()?;
// RECOVERABLE: Camera — restart up to 3 times
scheduler.add(CameraNode::new())
.order(2)
.rate(30_u64.hz())
.failure_policy(FailurePolicy::restart(3, 200_u64.ms()))
.build()?;
// NON-CRITICAL: Path planner — skip if it fails repeatedly
scheduler.add(PathPlanner::new())
.order(5)
.compute()
.failure_policy(FailurePolicy::skip(3, 2_u64.secs()))
.build()?;
// BEST-EFFORT: Telemetry — ignore failures
scheduler.add(TelemetryUploader::new())
.order(200)
.async_io()
.rate(1_u64.hz())
.failure_policy(FailurePolicy::Ignore)
.build()?;
scheduler.run()
}
Choosing the Right Policy
| Node Type | Policy | Why |
|---|---|---|
| Motor control, safety | Fatal | Unsafe to continue without these |
| Sensor drivers | Restart(3-5, 50-200ms) | Hardware reconnects are common |
| Perception pipelines | Restart(3, 100ms) or Skip(5, 2s) | Can recover or degrade gracefully |
| Logging, telemetry | Skip(5, 1s) or Ignore | Non-critical, absence is tolerable |
| Debug/visualization | Ignore | Partial results are fine |
Monitoring Failures
Failure events are recorded in the BlackBox flight recorder:
# View failure events from the blackbox
horus blackbox show --filter errors
# Monitor live
horus log -f --level error
In code:
// Inspect anomalies via CLI: horus blackbox --anomalies
if let Some(bb) = scheduler.get_blackbox() {
for record in bb.lock().expect("blackbox lock").anomalies() {
println!("[tick {}] {:?}", record.tick, record.event);
}
}
See Also
- Safety Monitor — Watchdog and graduated degradation
- BlackBox Recorder — Crash forensics
- Scheduler Configuration — Per-node builder API