Telemetry Export

You need to send HORUS runtime metrics (node tick durations, message counts, deadline misses) to an external monitoring system like Grafana, Prometheus, or a custom dashboard. Here is how to enable telemetry export with a single builder method.

When To Use This

  • Sending scheduler metrics to Grafana, InfluxDB, or Prometheus via an HTTP receiver
  • Logging metrics to a file for offline analysis
  • Streaming metrics over UDP to a LAN monitoring tool
  • Debugging performance with stdout metric output

Use Monitor instead if you want a built-in web/TUI dashboard without external tooling.

Use Logging instead if you need text-based diagnostic messages, not numeric metrics.

Prerequisites

  • A HORUS Rust project with use horus::prelude::*;
  • The telemetry feature enabled on horus_core (enabled by default)
  • For HTTP export: a running HTTP endpoint that accepts JSON POST requests

Quick Start

Enable telemetry with a single builder method:

// simplified
let mut scheduler = Scheduler::new()
    .tick_rate(100_u64.hz())
    .telemetry("http://localhost:9090/metrics");  // HTTP POST endpoint

scheduler.add(MyNode::new()?).order(0).rate(100_u64.hz()).build()?;
scheduler.run()?;

The scheduler exports a JSON snapshot every 1 second (default interval).


Architecture

Telemetry in HORUS is designed around one constraint: the scheduler tick loop must never block on I/O. The architecture achieves this with a three-stage pipeline that decouples metric collection from serialization and delivery.

┌─────────────────────────────────────────────────────────────┐
│  Scheduler Tick Loop  (RT thread)                           │
│                                                             │
│   for each node:                                            │
│     tick() → measure duration → record in Profiler          │
│                                                             │
│   if export_interval elapsed:                               │
│     Profiler.node_stats → TelemetryManager.gauge/counter    │
│     TelemetryManager.build_snapshot()                       │
│     TelemetryManager.export()                               │
│       ├─ HTTP  → try_send(snapshot) on bounded channel ──┐  │
│       ├─ UDP   → sendto() (single syscall, fire-forget)  │  │
│       ├─ File  → write_all() (buffered, local disk)      │  │
│       └─ Stdout→ println() (debugging only)              │  │
└──────────────────────────────────────────────────────┬───┘  │
                                                       │      │
                 ┌─────────────────────────────────────┘      │
                 │  Bounded Channel (capacity: 4)             │
                 │  try_send() — non-blocking                 │
                 │  If full → snapshot silently dropped        │
                 └──────────────┬──────────────────────────────┘
                                │
                 ┌──────────────▼──────────────────────────────┐
                 │  Background Thread  ("horus-telemetry-http") │
                 │                                              │
                 │  loop:                                       │
                 │    snapshot = rx.recv()                      │
                 │    TcpStream::connect(url)                   │
                 │    HTTP POST (JSON body)                     │
                 │    read status line (200/201/204 = ok)       │
                 │                                              │
                 │  On Drop: sender dropped → recv returns Err  │
                 │           → thread exits, handle.join()      │
                 └──────────────────────────────────────────────┘

Data flow step by step

  1. Collect -- The scheduler tick loop executes each node's tick() and records timing data in the internal Profiler. This is just a HashMap update per node, with no allocation on the hot path.

  2. Package -- At the configured export interval (default: 1 second), TelemetryManager reads the profiler stats and calls gauge_with_labels() / counter_with_labels() for each node. It then calls build_snapshot(), which clones the accumulated HashMap<String, Metric> into a TelemetrySnapshot struct.

  3. Deliver -- export() dispatches the snapshot to the configured endpoint:

    • HTTP: The snapshot is sent via try_send() on a bounded mpsc::sync_channel(4). If the background thread is busy with a slow POST and the channel is full, the snapshot is silently dropped. The scheduler thread returns immediately in all cases.
    • UDP: A single sendto() syscall. The socket is pre-bound at startup and cached.
    • File: File::create() + write_all() with pretty-printed JSON.
    • Stdout: Direct println!() for debugging.

Why the background thread only applies to HTTP

UDP, file, and stdout exports are inherently fast and bounded -- a UDP sendto() completes in microseconds, file writes go to the kernel page cache, and stdout is local. Only HTTP involves potentially unbounded network latency (DNS, TCP handshake, slow receivers), which is why it gets its own dedicated thread.

Graceful shutdown

When the TelemetryManager is dropped (scheduler exit), it drops the channel sender. The background thread's rx.recv() returns Err, breaking the loop. The Drop implementation then calls handle.join() to drain any queued snapshots before the process exits, ensuring no data is lost on clean shutdown.


Endpoint Types

The .telemetry(endpoint) method accepts a string that determines the export backend:

String FormatEndpointBehavior
"http://host:port/path"HTTP POSTNon-blocking — background thread handles network I/O
"https://host:port/path"HTTPS POSTSame as HTTP with TLS
"udp://host:port"UDP datagramCompact JSON, single packet per snapshot
"file:///path/to/metrics.json"Local filePretty-printed JSON, overwritten each export
"/path/to/metrics.json"Local fileSame as file:// prefix
"stdout" or "local"StdoutPretty-printed to terminal (debugging)
"disabled" or ""DisabledNo export (default)
// simplified
let mut scheduler = Scheduler::new()
    .telemetry("http://localhost:9090/metrics");

HTTP export is fully non-blocking for the scheduler:

  1. Scheduler calls export() on its cycle — posts a snapshot to a bounded channel (capacity 4)
  2. A dedicated background thread reads from the channel and performs the HTTP POST
  3. If the channel is full (receiver slow), the snapshot is silently dropped — the scheduler never blocks

UDP Endpoint (Low Overhead)

// simplified
let mut scheduler = Scheduler::new()
    .telemetry("udp://192.168.1.100:9999");

Sends compact single-line JSON per snapshot. Good for LAN monitoring where packet loss is acceptable.

File Endpoint (Debugging & Logging)

// simplified
let mut scheduler = Scheduler::new()
    .telemetry("/tmp/horus-metrics.json");

Overwrites the file on each export cycle. Useful for debugging or feeding into log aggregation pipelines.


JSON Payload Format

Every export produces a TelemetrySnapshot:

{
  "timestamp_secs": 1710547200,
  "scheduler_name": "motor_control",
  "uptime_secs": 42.5,
  "metrics": [
    {
      "name": "node.tick_duration_us",
      "value": { "Gauge": 145.2 },
      "labels": { "node": "MotorCtrl" },
      "timestamp_secs": 1710547200
    },
    {
      "name": "node.total_ticks",
      "value": { "Counter": 4250 },
      "labels": { "node": "MotorCtrl" },
      "timestamp_secs": 1710547200
    },
    {
      "name": "scheduler.deadline_misses",
      "value": { "Counter": 0 },
      "labels": {},
      "timestamp_secs": 1710547200
    }
  ]
}

Metric Value Types

TypeJSONDescription
Counter{ "Counter": 42 }Monotonically increasing (total ticks, messages sent)
Gauge{ "Gauge": 3.14 }Current value (tick duration, CPU usage)
Histogram{ "Histogram": [0.1, 0.2, 0.15] }Distribution of values
Text{ "Text": "Healthy" }String status

Auto-Collected Metrics

When telemetry is enabled, the scheduler automatically exports:

Metric NameTypeLabelsDescription
node.total_ticksCounternodeTotal ticks executed
node.tick_duration_usGaugenodeLast tick duration in microseconds
node.errorsCounternodeTotal tick errors
scheduler.deadline_missesCounterTotal deadline misses across all nodes
scheduler.uptime_secsGaugeScheduler uptime

Feature Flag

Telemetry requires the telemetry feature on horus_coreenabled by default.

To disable at compile time (saves binary size):

[dependencies]
horus = { version = "0.1", default-features = false, features = ["macros", "blackbox"] }

Integration with External Tools

Grafana + Custom Receiver

Write a small HTTP server that receives the JSON POST and forwards metrics to Prometheus/InfluxDB:

# receiver.py — minimal Flask example
from flask import Flask, request
app = Flask(__name__)

@app.route("/metrics", methods=["POST"])
def metrics():
    snapshot = request.json
    for m in snapshot["metrics"]:
        print(f"{m['name']} = {m['value']}")
    return "ok", 200

app.run(port=9090)

horus monitor (Alternative)

For local debugging, horus monitor provides a built-in TUI dashboard — no external setup needed. See Monitor for details.

Prometheus Integration

Prometheus scrapes metrics over HTTP in its own exposition format, not JSON. HORUS exports JSON, so you need a small bridge that converts the JSON payload into Prometheus-compatible text. There are two approaches.

Approach A: Python bridge (recommended for quick setup)

This script receives HORUS JSON POSTs and serves a /metrics endpoint that Prometheus scrapes:

# prometheus_bridge.py
from flask import Flask, request
from prometheus_client import Gauge, Counter, generate_latest, REGISTRY

app = Flask(__name__)

# Dynamic metric storage
gauges = {}
counters = {}

def get_or_create_gauge(name, labels):
    key = name
    if key not in gauges:
        label_keys = list(labels.keys()) if labels else []
        gauges[key] = Gauge(
            name.replace(".", "_"),
            f"HORUS metric: {name}",
            label_keys,
        )
    return gauges[key]

def get_or_create_counter(name, labels):
    key = name
    if key not in counters:
        label_keys = list(labels.keys()) if labels else []
        counters[key] = Counter(
            name.replace(".", "_"),
            f"HORUS metric: {name}",
            label_keys,
        )
    return counters[key]

@app.route("/ingest", methods=["POST"])
def ingest():
    """Receives HORUS telemetry JSON and updates Prometheus metrics."""
    snapshot = request.json
    for m in snapshot["metrics"]:
        labels = m.get("labels", {})
        value_dict = m["value"]

        if "Gauge" in value_dict:
            g = get_or_create_gauge(m["name"], labels)
            if labels:
                g.labels(**labels).set(value_dict["Gauge"])
            else:
                g.set(value_dict["Gauge"])
        elif "Counter" in value_dict:
            c = get_or_create_counter(m["name"], labels)
            # Prometheus counters only increment; set to the HORUS total
            current = c._value.get() if not labels else 0
            delta = value_dict["Counter"] - current
            if delta > 0:
                if labels:
                    c.labels(**labels).inc(delta)
                else:
                    c.inc(delta)
    return "ok", 200

@app.route("/metrics")
def metrics():
    """Prometheus scrape endpoint."""
    return generate_latest(REGISTRY), 200, {"Content-Type": "text/plain; charset=utf-8"}

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=9090)

Configure HORUS to push to the bridge:

// simplified
let mut scheduler = Scheduler::new()
    .tick_rate(100_u64.hz())
    .telemetry("http://localhost:9090/ingest");  // POST to the bridge

Add the bridge to your prometheus.yml:

scrape_configs:
  - job_name: "horus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]

Approach B: File-based with node_exporter textfile collector

For environments where you cannot run an extra HTTP service, write telemetry to a file that Prometheus node_exporter picks up:

// simplified
let mut scheduler = Scheduler::new()
    .tick_rate(100_u64.hz())
    .telemetry("/var/lib/node_exporter/textfile/horus.json");

Then use a cron job or sidecar script to convert the JSON file to Prometheus exposition format:

#!/usr/bin/env python3
# json_to_prom.py — convert HORUS JSON to Prometheus textfile format
import json, sys

with open(sys.argv[1]) as f:
    snapshot = json.load(f)

output = sys.argv[2] if len(sys.argv) > 2 else "/var/lib/node_exporter/textfile/horus.prom"

lines = []
for m in snapshot["metrics"]:
    name = m["name"].replace(".", "_")
    labels = m.get("labels", {})
    label_str = ",".join(f'{k}="{v}"' for k, v in labels.items())
    label_part = "{" + label_str + "}" if label_str else ""

    value_dict = m["value"]
    for vtype in ("Gauge", "Counter"):
        if vtype in value_dict:
            lines.append(f"horus_{name}{label_part} {value_dict[vtype]}")

with open(output, "w") as f:
    f.write("\n".join(lines) + "\n")

Run it periodically: watch -n 1 python3 json_to_prom.py /var/lib/node_exporter/textfile/horus.json


Performance Overhead

All telemetry export happens outside the scheduler's critical timing path. The overhead per export cycle depends on the endpoint:

EndpointOverhead per exportThreadNotes
HTTP~50us on the scheduler threadBackground thread does the actual POSTScheduler only pays for build_snapshot() + try_send() into the bounded channel. Network I/O is fully off the RT thread.
UDP~5usScheduler thread (single syscall)sendto() is fire-and-forget. Socket is pre-bound at startup. No connection overhead per export.
File~10usScheduler thread (buffered write)File::create() + write_all(). Goes to kernel page cache, not disk. Actual disk flush is asynchronous.
StdoutNegligibleScheduler threadprintln!() to terminal. Only useful for debugging; not recommended in production.
DisabledZeroN/ANo-op. should_export() returns false immediately.

What "overhead" means in practice

At a 100 Hz tick rate, the scheduler has a 10 ms budget per tick. A 50us telemetry export (the worst case for HTTP) consumes 0.5% of that budget, and it only runs once per second (every 100th tick), not every tick. The amortized per-tick cost is effectively 0.005%.

For UDP and file endpoints, the overhead is even smaller. The should_export() check (a single Instant::elapsed() comparison) runs every tick and costs <1us.

Memory overhead

TelemetryManager maintains a HashMap<String, Metric> that grows with the number of unique metric names. For a typical system with 10 nodes, this is approximately 2-4 KB. The bounded channel for HTTP holds at most 4 snapshots, each roughly 1-2 KB serialized.


Complete Example

// simplified
use horus::prelude::*;

struct SensorNode {
    pub_topic: Topic<Imu>,
}

impl SensorNode {
    fn new() -> Result<Self> {
        Ok(Self { pub_topic: Topic::new("imu.raw")? })
    }
}

impl Node for SensorNode {
    fn name(&self) -> &str { "Sensor" }
    fn tick(&mut self) {
        self.pub_topic.send(Imu {
            orientation: [1.0, 0.0, 0.0, 0.0],
            angular_velocity: [0.0, 0.0, 0.0],
            linear_acceleration: [0.0, 0.0, 9.81],
        });
    }
}

fn main() -> Result<()> {
    let mut scheduler = Scheduler::new()
        .tick_rate(100_u64.hz())
        .telemetry("http://localhost:9090/metrics");  // export every 1s

    scheduler.add(SensorNode::new()?)
        .order(0)
        .rate(100_u64.hz())
        .build()?;

    scheduler.run()
}

Common Errors

SymptomCauseFix
No metrics exportedTelemetry endpoint not configuredAdd .telemetry("http://localhost:9090/metrics") to Scheduler::new() builder
HTTP metrics silently droppedReceiver is slow or unreachable, bounded channel full (capacity 4)Check receiver is running, increase receiver throughput — the scheduler never blocks
Compile error: telemetry method not foundtelemetry feature disabledEnsure horus dependency has telemetry feature (enabled by default)
File endpoint not updatingFile overwritten each cycle but viewer cachesRe-read the file or use watch cat /tmp/horus-metrics.json
UDP metrics lostNetwork packet loss on LANExpected behavior for UDP; use HTTP for reliable delivery

Design Decisions

Understanding why telemetry works this way helps you make the right integration choices for your system.

Why non-blocking (RT safety)

HORUS is a real-time robotics framework. The scheduler tick loop has hard timing budgets -- a motor control node running at 1 kHz has exactly 1 ms per tick. If the telemetry export blocked on a slow HTTP server or a full TCP buffer, it would cause deadline misses and potentially unsafe behavior. The non-blocking design ensures that telemetry is always best-effort and never degrades control loop timing.

The bounded channel with try_send() is the key mechanism. On the scheduler thread, export() either succeeds instantly (channel has space) or drops the snapshot instantly (channel full). Both paths complete in nanoseconds. The actual network I/O happens on a separate thread that the scheduler never waits on.

Why bounded channel with drop (not backpressure)

Backpressure means "slow the producer when the consumer is slow." In telemetry for a real-time system, this is exactly wrong -- you never want your 1 kHz motor controller to slow down because Grafana's ingest endpoint is overloaded.

The bounded channel (capacity 4) provides a small buffer for transient slowness (e.g., a brief network hiccup) while hard-capping memory usage. When the buffer is full, the oldest-unsent snapshot becomes stale anyway -- dropping it is the correct behavior because a newer snapshot with fresher data will arrive within a second.

The capacity of 4 is deliberate: at the default 1-second export interval, it absorbs up to 4 seconds of receiver downtime before dropping begins. This handles common transient issues (GC pauses, brief network congestion) without unbounded memory growth.

Why JSON (not Protocol Buffers or MessagePack)

Three reasons:

  1. Debuggability -- You can pipe HORUS telemetry to jq, read it in a browser, or cat the file endpoint directly. Binary formats require dedicated tools to inspect.

  2. No build dependency -- Protocol Buffers require protoc and .proto files. MessagePack requires additional codec libraries. JSON serialization via serde_json is already a transitive dependency of HORUS and adds zero extra build complexity.

  3. Adequate performance -- Serializing a typical telemetry snapshot (~10 metrics) to JSON takes <10us. For a 1 Hz export, this is negligible. If HORUS ever needs sub-millisecond export rates (unlikely for telemetry), a binary format would be reconsidered.

Why feature-gated

The telemetry feature on horus_core is enabled by default, but it can be disabled for two reasons:

  • Binary size -- Disabling telemetry removes serde_json, the TelemetryManager, and the HTTP background thread infrastructure. For deeply embedded targets with tight flash constraints, this matters.
  • Attack surface -- In locked-down production deployments, some teams prefer to compile out all network-facing code paths. Disabling the feature guarantees at the type level that no telemetry endpoint is reachable.

When disabled, all telemetry builder methods are either absent (compile error if used) or no-op, depending on the feature gate configuration. There is no runtime overhead.


Trade-offs

DecisionBenefitCost
Non-blocking try_send() for HTTPScheduler never stalls on network I/OSnapshots may be silently dropped if receiver is slow
Bounded channel (capacity 4)Predictable memory usage, absorbs transient slownessOnly 4 seconds of buffering before drops begin
Silent drop (no error log on full channel)No log spam when receiver is down for extended periodsHarder to notice when telemetry delivery is degraded (check receiver-side metrics)
JSON serializationHuman-readable, no build dependencies, easy debugging~5x larger payload than protobuf; ~3x slower serialization (still <10us per snapshot)
Single background thread (HTTP only)Minimal resource usage, simple shutdown semanticsHTTP throughput limited to one in-flight POST at a time; pipelining not supported
File endpoint overwrites (not appends)Simple, bounded disk usage, always shows latest stateNo history; use HTTP + a time-series database for historical data
UDP fire-and-forgetLowest overhead (~5us), good for LAN monitoringNo delivery guarantees; packet loss on congested networks
1-second default intervalLow overhead, sufficient for most dashboardsNot suitable for sub-second alerting (use horus monitor for real-time views)
Feature-gated (compile-time opt-out)Zero overhead when disabled, smaller binaryMust recompile to enable/disable; no runtime toggle
Per-metric HashMap accumulationDeduplicates metrics by name, latest value always winsSmall allocation per unique metric name; not lock-free (acceptable since only the scheduler thread writes)

See Also

  • Monitor — Built-in web and TUI dashboards (no external setup needed)
  • Logging — Structured text logging with hlog! macros
  • Scheduler API — Scheduler builder methods including .telemetry()
  • Debugging Workflows — Using telemetry data to diagnose performance issues
  • Operations — Production monitoring and deployment patterns