Production Deployment (Python)

Your Python HORUS nodes work on your laptop. Now they need to run 24/7 on a robot with no keyboard, no monitor, and nobody watching. This page covers virtual environments, dependency pinning, systemd services, logging, monitoring, garbage collection tuning, memory profiling, and the decision of what stays in Python versus what gets rewritten.

Virtual Environment Setup

Always isolate HORUS Python nodes in a virtual environment. System Python packages drift between OS updates and break silently.

# Create a dedicated venv for your project
python3 -m venv /opt/myrobot/venv

# Activate and install horus
source /opt/myrobot/venv/bin/activate
pip install maturin
cd /path/to/horus/horus_py
maturin develop --release

# Install your project dependencies
pip install -r requirements.txt

venv in horus.toml Projects

If you are using horus.toml for project management, HORUS generates a .horus/pyproject.toml from your manifest. The venv still works -- install the generated project after activation:

source /opt/myrobot/venv/bin/activate
cd /path/to/your/project
horus build  # generates .horus/pyproject.toml from horus.toml
pip install -e .horus/

Dependency Pinning

Pin every dependency version. An unpinned numpy upgrade at 3 AM will crash your robot at 3:01 AM.

requirements.txt

numpy==1.26.4
opencv-python-headless==4.9.0.80
torch==2.2.1+cpu
onnxruntime==1.17.1
scipy==1.12.0

Generate from your working environment:

pip freeze > requirements.txt

horus.toml

For HORUS-managed projects, pin in the manifest:

[dependencies]
numpy = { version = "1.26.4", source = "pypi" }
opencv-python-headless = { version = "4.9.0.80", source = "pypi" }
torch = { version = "2.2.1+cpu", source = "pypi" }

HORUS generates horus.lock (lockfile v3) with exact resolved versions for reproducible installs across machines.

CPU-Only PyTorch

Production robots rarely have datacenter GPUs. Use the CPU-only torch build to save 2 GB of disk and avoid CUDA driver version mismatches:

pip install torch==2.2.1+cpu --index-url https://download.pytorch.org/whl/cpu

systemd Service Files

Run HORUS Python nodes as systemd services for automatic restart, logging, and boot-time startup.

Basic Service

# /etc/systemd/system/horus-myrobot.service
[Unit]
Description=HORUS MyRobot Nodes
After=network.target

[Service]
Type=simple
User=robot
Group=robot
WorkingDirectory=/opt/myrobot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u main.py
Restart=on-failure
RestartSec=3
StandardOutput=journal
StandardError=journal

# Shared memory access
SupplementaryGroups=

# Real-time scheduling (optional)
LimitMEMLOCK=infinity
LimitRTPRIO=99

[Install]
WantedBy=multi-user.target

Key Settings

Setting	Value	Why
`Type=simple`	Required	HORUS blocks on `horus.run()`
`User=robot`	Dedicated user	Never run as root in production
`-u` flag on Python	Required	Unbuffered output so journald gets logs immediately
`Restart=on-failure`	Auto-restart	systemd restarts if the process exits non-zero
`RestartSec=3`	3 second delay	Prevents restart loops from burning CPU
`LimitMEMLOCK=infinity`	For RT nodes	Allows memory locking to prevent page faults
`LimitRTPRIO=99`	For RT nodes	Allows real-time scheduling priority

Enable and Start

sudo systemctl daemon-reload
sudo systemctl enable horus-myrobot.service
sudo systemctl start horus-myrobot.service

# Check status
sudo systemctl status horus-myrobot.service

# View logs
journalctl -u horus-myrobot.service -f

Multi-Node Service with Separate Processes

For process isolation, run each node as its own service:

# /etc/systemd/system/horus-camera.service
[Unit]
Description=HORUS Camera Node
After=network.target

[Service]
Type=simple
User=robot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u nodes/camera_node.py
Restart=on-failure
RestartSec=2

[Install]
WantedBy=horus-myrobot.target

# /etc/systemd/system/horus-planner.service
[Unit]
Description=HORUS Planner Node
After=horus-camera.service

[Service]
Type=simple
User=robot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u nodes/planner_node.py
Restart=on-failure
RestartSec=2

[Install]
WantedBy=horus-myrobot.target

Use After= to express startup order between nodes. Use a shared .target to start/stop the entire robot stack as one unit.

Log Collection

HORUS nodes produce two streams of logs: structured logs from node.log_*() calls and standard output from print().

horus logs

The horus logs CLI command reads the structured log stream:

# Follow logs from all running nodes
horus logs -f

# Filter by node name
horus logs -f --node camera

# Filter by level
horus logs -f --level warning

node.log_* Output

Inside your tick function, use the structured logging methods:

# simplified
def tick(node):
    node.log_info("Frame processed")
    node.log_warning("Latency spike: 12ms")
    node.log_error("Motor timeout")
    node.log_debug("Raw encoder: 4821")

These go through the scheduler's logging pipeline, tagged with the node name and timestamp. They appear in horus logs and, when running under systemd, in the journal.

journald Integration

When running as a systemd service, all output (structured logs and print statements) goes to the journal:

# Live follow
journalctl -u horus-myrobot.service -f

# Last 100 lines
journalctl -u horus-myrobot.service -n 100

# Since last boot
journalctl -u horus-myrobot.service -b

# Export for analysis
journalctl -u horus-myrobot.service --output=json > logs.json

Log Rotation

journald handles rotation automatically. For long-running deployments, configure retention:

# /etc/systemd/journald.conf.d/horus.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7d

Performance Tuning

What Stays in Python

Python is the right choice for nodes that are I/O-bound, compute-heavy-but-batchable, or change frequently:

Node type	Why Python works	Typical rate
ML inference	PyTorch/ONNX ecosystem, GPU offload	10-30 Hz
Data logging	I/O-bound (disk, database, network)	1-10 Hz
Path planning	Scipy/numpy, compute=True offloads to thread pool	1-10 Hz
Visualization	matplotlib, OpenCV display	1-30 Hz
HTTP/API integration	aiohttp, async nodes handle I/O naturally	0.1-10 Hz
Prototyping	Fast iteration, no compile step	Any

When to Rewrite in Python to Another Language

Rewrite a node when Python becomes the bottleneck, not before. Profile first:

Signal	What it means	Action
`tick()` consistently exceeds budget	CPU-bound work is too slow	Profile, optimize, then rewrite hot path
Deadline misses under load	GIL contention or GC pauses	Try `gc.disable()`, then rewrite if still missing
Memory growing unbounded	Python object overhead	Profile with tracemalloc, rewrite if unfixable
Latency jitter >1ms at >100Hz	Python overhead is inherent	Rewrite -- Python cannot do sub-ms deterministic ticks

The practical threshold: If your node needs deterministic ticks above 100 Hz, or sub-millisecond jitter, rewrite it. Below that, Python is fine.

Monitoring Python Nodes

horus monitor

The horus monitor command shows a live dashboard of all running nodes:

horus monitor

This shows per-node tick rate, budget usage, deadline misses, error counts, and health state.

Programmatic Monitoring

Use the Scheduler API to query stats from within your code:

# simplified
import horus

sched = horus.Scheduler(tick_rate=1000, rt=True)
sched.add(sensor_node)
sched.add(planner_node)

# Start in background or check after running
stats = sched.get_node_stats("sensor")
print(f"Total ticks: {stats['total_ticks']}")
print(f"Errors: {stats['errors_count']}")

safety_stats()

For safety-critical deployments, query the safety monitor:

# simplified
safety = sched.safety_stats()
if safety:
    print(f"Watchdog: {safety}")
    # Returns dict with watchdog stats, deadline misses, health states

Health Checks

Build a health-check endpoint for external monitoring (Prometheus, Grafana, fleet manager):

# simplified
import horus
import json
from http.server import HTTPServer, BaseHTTPRequestHandler

sched = None

class HealthHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        stats = {}
        for name in ["camera", "planner", "motor"]:
            stats[name] = sched.get_node_stats(name)

        healthy = all(s.get("errors_count", 0) == 0 for s in stats.values())

        self.send_response(200 if healthy else 503)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(stats).encode())

    def log_message(self, format, *args):
        pass  # Suppress access logs

def start_health_server():
    server = HTTPServer(("0.0.0.0", 8080), HealthHandler)
    server.serve_forever()

Run the health server in a background thread or as a separate async node.

Garbage Collection Tuning

Python's garbage collector introduces non-deterministic pauses. For nodes with timing constraints, tune or disable it.

Disable GC for Real-Time Nodes

If your node has a tight budget (sub-10ms) and allocates few objects per tick, disable GC entirely:

# simplified
import gc
import horus

def init(node):
    gc.disable()
    node.log_info("GC disabled for RT node")

def tick(node):
    # Pre-allocated buffers only -- no new objects per tick
    cmd = node.recv("cmd_vel")
    if cmd:
        apply_command(cmd)

motor = horus.Node(
    name="motor",
    tick=tick,
    init=init,
    rate=1000,
    subs=["cmd_vel"],
    failure_policy="fatal",
)

Requirement: When GC is disabled, you must not create circular references. Use pre-allocated buffers, avoid closures that capture self, and avoid building data structures in tick(). If you leak memory with GC disabled, it is never reclaimed.

Tune GC Thresholds for Other Nodes

For nodes that allocate objects (ML inference, data processing), tune the thresholds instead of disabling:

# simplified
import gc

def init(node):
    # Default: (700, 10, 10)
    # Raise gen0 threshold to reduce collection frequency
    gc.set_threshold(1500, 15, 15)
    node.log_info(f"GC thresholds: {gc.get_threshold()}")

Higher thresholds mean fewer GC pauses but higher peak memory usage. Measure both latency and memory for your workload.

Manual GC Between Ticks

For the best control, disable automatic GC and trigger collection manually during idle periods:

# simplified
import gc
import horus

gc.disable()

def tick(node):
    if not node.has_msg("camera.rgb"):
        # No frame to process -- good time to collect
        gc.collect(generation=0)  # Only gen0, fast (~100us)
        return

    frame = node.recv("camera.rgb")
    detect(frame)

Memory Profiling

tracemalloc for Leak Detection

Python nodes running for days can leak memory through accumulating references. Use tracemalloc to find the source:

# simplified
import tracemalloc
import horus

tracemalloc.start(10)  # Keep 10 frames of traceback

tick_count = 0
baseline = None

def tick(node):
    global tick_count, baseline
    tick_count += 1

    # Normal work
    process_data(node)

    # Snapshot every 10000 ticks
    if tick_count % 10000 == 0:
        snapshot = tracemalloc.take_snapshot()
        if baseline is None:
            baseline = snapshot
        else:
            stats = snapshot.compare_to(baseline, "lineno")
            for stat in stats[:5]:
                node.log_warning(f"Memory growth: {stat}")

What to Look For

Pattern	Likely cause	Fix
Steady growth in one file/line	List or dict accumulating entries	Cap size or use `collections.deque(maxlen=N)`
Growth in `node.recv()` calls	Holding references to old messages	Process and discard, do not store
Growth in `json.loads()`	String interning or dict caching	Use `msgpack` or typed messages instead
Growth in third-party library	Library-internal caching	Check library docs for cache control

Resource Monitoring

Monitor system resources from within a node:

# simplified
import os
import resource
import horus

def monitor_tick(node):
    # RSS (Resident Set Size) in MB
    usage = resource.getrusage(resource.RUSAGE_SELF)
    rss_mb = usage.ru_maxrss / 1024  # Linux reports in KB

    node.send("diagnostics.memory", {
        "rss_mb": rss_mb,
        "pid": os.getpid(),
    })

    if rss_mb > 500:
        node.log_warning(f"High memory: {rss_mb:.0f} MB")

monitor = horus.Node(
    name="resource_monitor",
    tick=monitor_tick,
    rate=1,
    pubs=["diagnostics.memory"],
)

Mixed Deployments

The most effective production architectures combine Python and other HORUS-supported languages. Each language handles what it does best, communicating through zero-copy shared memory topics.

Typical Architecture

Camera Driver (high-freq, safety) ──→ camera.rgb topic
ML Inference (Python, PyTorch)    ←── camera.rgb topic
                                  ──→ detections topic
Path Planner (Python, scipy)      ←── detections topic
                                  ──→ path topic
Motor Controller (high-freq, RT)  ←── path topic
                                  ──→ motor.status topic
Safety Monitor (high-freq, RT)    ←── motor.status topic

Safety-critical nodes (camera driver, motor controller, safety monitor) benefit from compiled languages. Python handles ML inference and path planning where ecosystem libraries matter more than tick latency.

Running Together

Each process runs independently. They communicate through HORUS topics over shared memory:

# Terminal 1: compiled safety-critical nodes
horus run safety_stack

# Terminal 2: Python ML nodes
source /opt/myrobot/venv/bin/activate
python ml_nodes.py

# Terminal 3: Python planner
source /opt/myrobot/venv/bin/activate
python planner.py

Or use systemd to manage all processes:

# /etc/systemd/system/horus-safety.service
[Service]
ExecStart=/usr/local/bin/horus run safety_stack

# /etc/systemd/system/horus-ml.service
[Service]
ExecStart=/opt/myrobot/venv/bin/python -u ml_nodes.py

# /etc/systemd/system/horus-planner.service
[Service]
ExecStart=/opt/myrobot/venv/bin/python -u planner.py

The Handoff Pattern

When a Python prototype node gets promoted to production, the topic interface stays the same. Only the implementation changes:

# simplified
# Python prototype (runs at 30 Hz, good enough for testing)
def planner_tick(node):
    scan = node.recv("lidar.scan")
    if scan:
        path = compute_path(scan)  # scipy A*
        node.send("path", path)

The compiled replacement subscribes to the same topics and publishes the same messages. No other node needs to change. This is the key benefit of topic-based IPC: language boundaries are invisible to the rest of the system.

Pre-Deployment Checklist

Before shipping Python nodes to production:

Design Decisions

Why venv instead of containers? Containers add overhead (cgroup management, overlay filesystem, network namespacing) that hurts real-time performance. Shared memory IPC between containers requires explicit --ipc=host flags that defeat isolation. A virtual environment gives dependency isolation without the performance or IPC penalty. Use containers for CI/CD and development, not for production robots.

Why systemd instead of a HORUS-native process manager? systemd is battle-tested, ships with every Linux distribution, integrates with journald for logging, and supports cgroup resource limits. Building a custom process manager would duplicate all of this poorly. The HORUS scheduler manages node execution within a process; systemd manages processes within the system. Each tool does what it does best.

Why not auto-detect which nodes need GC tuning? Garbage collection impact depends on allocation patterns, object lifetimes, and timing requirements -- all application-specific. A node publishing pre-allocated IMU structs at 1000 Hz needs GC disabled. A node building detection lists at 10 Hz needs GC enabled. There is no heuristic that works for both. Explicit tuning by the developer is the only reliable approach.

Trade-offs

Python for ML vs compiled inference: Python gives you the full PyTorch/ONNX/HuggingFace ecosystem. Compiled inference (ONNX Runtime C++, TensorRT) gives lower latency and no GIL. For most robotics workloads, Python inference at 10-30 Hz is fast enough. Rewrite when profiling shows that Python overhead (not model inference) is the bottleneck.

Single process vs multi-process: Running all Python nodes in one process (one horus.run() call) shares the GIL. Running each node as a separate process (separate systemd services) avoids GIL contention but uses more memory and loses in-process topic shortcuts. Single process is simpler to deploy. Multi-process scales better when you have CPU-bound Python nodes competing for the GIL.

gc.disable() vs gc.set_threshold(): Disabling GC eliminates pauses completely but risks memory leaks if you create circular references. Tuning thresholds reduces pause frequency without eliminating leaks. For nodes with pre-allocated buffers and no circular references, disable. For nodes that build temporary data structures, tune thresholds. When in doubt, tune rather than disable -- a slow leak is easier to debug than a mysterious OOM after 48 hours.

Pinned versions vs version ranges: Pinned versions (numpy==1.26.4) guarantee reproducibility but require manual updates for security patches. Version ranges (numpy>=1.26,<1.27) allow patch updates but risk behavior changes. For production robots, pin everything. Run pip install --upgrade in CI, run your test suite, and pin the new versions explicitly.