Fault Tolerance

Every node in a robot can fail — a sensor disconnects, a planning algorithm panics, a network request times out. HORUS failure policies control what happens next: stop the robot, restart the node, skip it temporarily, or ignore the error entirely.

The Problem

Without failure policies, one crashing node kills the entire system:

Tick 100: sensor_driver panics (USB disconnected)
Tick 100: scheduler stops
Tick 100: motor_controller stops receiving commands
Result: robot stops moving in the middle of a task

With failure policies, the system adapts:

Tick 100: sensor_driver panics (USB disconnected)
Tick 100: FailurePolicy::Restart → re-init sensor_driver (10ms backoff)
Tick 101: sensor_driver panics again → restart (20ms backoff)
Tick 102: USB reconnects → sensor_driver.init() succeeds → normal operation
Result: robot paused briefly, then resumed automatically

The Four Policies

Fatal — Stop Everything

scheduler.add(motor_controller)
    .order(0)
    .rate(1000_u64.hz())
    .failure_policy(FailurePolicy::Fatal)
    .build()?;

First failure stops the scheduler immediately. Use for nodes where continued operation after failure is unsafe:

  • Motor controllers (stale commands = uncontrolled motion)
  • Safety monitors (can't monitor safety if the monitor is broken)
  • Emergency stop handlers

When it triggers: node.tick() panics or returns an error. The scheduler calls stop() and shuts down all nodes cleanly.

Restart — Re-Initialize with Backoff

scheduler.add(lidar_driver)
    .order(1)
    .rate(100_u64.hz())
    .failure_policy(FailurePolicy::restart(3, 50_u64.ms()))
    .build()?;

Re-initializes the node with exponential backoff. After max_restarts exhausted, escalates to fatal stop.

failure 1 → restart, wait 50ms
failure 2 → restart, wait 100ms (2x backoff)
failure 3 → restart, wait 200ms (2x backoff)
failure 4 → max_restarts exceeded → fatal stop

After a successful tick, the backoff timer clears. Use for nodes that can recover from transient failures:

  • Sensor drivers (hardware reconnection)
  • Network clients (server temporarily unavailable)
  • Camera nodes (USB reset)

Skip — Tolerate with Cooldown

scheduler.add(telemetry_uploader)
    .order(200)
    .async_io()
    .failure_policy(FailurePolicy::skip(5, 1_u64.secs()))
    .build()?;

After max_failures consecutive failures, the node is suppressed for the cooldown period. After cooldown, the node is allowed again and the failure counter resets.

failure 1 → continue
failure 2 → continue
failure 3 → continue
failure 4 → continue
failure 5 → node suppressed for 1 second
... 1 second passes ...
node allowed again, failure count = 0

Use for nodes whose absence doesn't affect core robot operation:

  • Logging and telemetry upload
  • Diagnostics reporting
  • Cloud sync
  • Non-critical monitoring

Ignore — Swallow Failures

scheduler.add(stats_collector)
    .order(100)
    .failure_policy(FailurePolicy::Ignore)
    .build()?;

Failures are completely ignored. The node keeps ticking every cycle regardless of errors. Use only when partial results are acceptable:

  • Statistics collectors (missing one sample is fine)
  • Best-effort visualization
  • Debug output nodes

Severity-Aware Handling

HORUS errors carry severity levels that can override the configured policy:

SeverityEffect
Fatal (e.g., shared memory corruption)Always stops the scheduler, even with Ignore policy
Transient (e.g., topic full, network timeout)De-escalates Fatal policy to Restart (transient errors are recoverable)
Permanent (e.g., invalid configuration)Follows the configured policy

This means a safety-critical node with Fatal policy won't kill the system on a transient network glitch — it'll restart instead. But a shared-memory corruption always stops, even on an Ignore node.

Complete Robot Example

use horus::prelude::*;

fn main() -> Result<()> {
    let mut scheduler = Scheduler::new()
        .tick_rate(500_u64.hz())
        .prefer_rt()
        .watchdog(500_u64.ms());

    // CRITICAL: Motor controller — stop if it fails
    scheduler.add(MotorController::new())
        .order(0)
        .rate(500_u64.hz())
        .on_miss(Miss::SafeMode)
        .failure_policy(FailurePolicy::Fatal)
        .build()?;

    // RECOVERABLE: Lidar driver — restart on USB disconnect
    scheduler.add(LidarDriver::new())
        .order(1)
        .rate(100_u64.hz())
        .failure_policy(FailurePolicy::restart(5, 100_u64.ms()))
        .build()?;

    // RECOVERABLE: Camera — restart up to 3 times
    scheduler.add(CameraNode::new())
        .order(2)
        .rate(30_u64.hz())
        .failure_policy(FailurePolicy::restart(3, 200_u64.ms()))
        .build()?;

    // NON-CRITICAL: Path planner — skip if it fails repeatedly
    scheduler.add(PathPlanner::new())
        .order(5)
        .compute()
        .failure_policy(FailurePolicy::skip(3, 2_u64.secs()))
        .build()?;

    // BEST-EFFORT: Telemetry — ignore failures
    scheduler.add(TelemetryUploader::new())
        .order(200)
        .async_io()
        .rate(1_u64.hz())
        .failure_policy(FailurePolicy::Ignore)
        .build()?;

    scheduler.run()
}

Choosing the Right Policy

Node TypePolicyWhy
Motor control, safetyFatalUnsafe to continue without these
Sensor driversRestart(3-5, 50-200ms)Hardware reconnects are common
Perception pipelinesRestart(3, 100ms) or Skip(5, 2s)Can recover or degrade gracefully
Logging, telemetrySkip(5, 1s) or IgnoreNon-critical, absence is tolerable
Debug/visualizationIgnorePartial results are fine

Monitoring Failures

Failure events are recorded in the BlackBox flight recorder:

# View failure events from the blackbox
horus blackbox show --filter errors

# Monitor live
horus log -f --level error

In code:

// Inspect anomalies via CLI: horus blackbox --anomalies
if let Some(bb) = scheduler.get_blackbox() {
    for record in bb.lock().expect("blackbox lock").anomalies() {
        println!("[tick {}] {:?}", record.tick, record.event);
    }
}

See Also