Production Deployment (Python)
Your Python HORUS nodes work on your laptop. Now they need to run 24/7 on a robot with no keyboard, no monitor, and nobody watching. This page covers virtual environments, dependency pinning, systemd services, logging, monitoring, garbage collection tuning, memory profiling, and the decision of what stays in Python versus what gets rewritten.
Virtual Environment Setup
Always isolate HORUS Python nodes in a virtual environment. System Python packages drift between OS updates and break silently.
# Create a dedicated venv for your project
python3 -m venv /opt/myrobot/venv
# Activate and install horus
source /opt/myrobot/venv/bin/activate
pip install maturin
cd /path/to/horus/horus_py
maturin develop --release
# Install your project dependencies
pip install -r requirements.txt
venv in horus.toml Projects
If you are using horus.toml for project management, HORUS generates a .horus/pyproject.toml from your manifest. The venv still works -- install the generated project after activation:
source /opt/myrobot/venv/bin/activate
cd /path/to/your/project
horus build # generates .horus/pyproject.toml from horus.toml
pip install -e .horus/
Dependency Pinning
Pin every dependency version. An unpinned numpy upgrade at 3 AM will crash your robot at 3:01 AM.
requirements.txt
numpy==1.26.4
opencv-python-headless==4.9.0.80
torch==2.2.1+cpu
onnxruntime==1.17.1
scipy==1.12.0
Generate from your working environment:
pip freeze > requirements.txt
horus.toml
For HORUS-managed projects, pin in the manifest:
[dependencies]
numpy = { version = "1.26.4", source = "pypi" }
opencv-python-headless = { version = "4.9.0.80", source = "pypi" }
torch = { version = "2.2.1+cpu", source = "pypi" }
HORUS generates horus.lock (lockfile v3) with exact resolved versions for reproducible installs across machines.
CPU-Only PyTorch
Production robots rarely have datacenter GPUs. Use the CPU-only torch build to save 2 GB of disk and avoid CUDA driver version mismatches:
pip install torch==2.2.1+cpu --index-url https://download.pytorch.org/whl/cpu
systemd Service Files
Run HORUS Python nodes as systemd services for automatic restart, logging, and boot-time startup.
Basic Service
# /etc/systemd/system/horus-myrobot.service
[Unit]
Description=HORUS MyRobot Nodes
After=network.target
[Service]
Type=simple
User=robot
Group=robot
WorkingDirectory=/opt/myrobot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u main.py
Restart=on-failure
RestartSec=3
StandardOutput=journal
StandardError=journal
# Shared memory access
SupplementaryGroups=
# Real-time scheduling (optional)
LimitMEMLOCK=infinity
LimitRTPRIO=99
[Install]
WantedBy=multi-user.target
Key Settings
| Setting | Value | Why |
|---|---|---|
Type=simple | Required | HORUS blocks on horus.run() |
User=robot | Dedicated user | Never run as root in production |
-u flag on Python | Required | Unbuffered output so journald gets logs immediately |
Restart=on-failure | Auto-restart | systemd restarts if the process exits non-zero |
RestartSec=3 | 3 second delay | Prevents restart loops from burning CPU |
LimitMEMLOCK=infinity | For RT nodes | Allows memory locking to prevent page faults |
LimitRTPRIO=99 | For RT nodes | Allows real-time scheduling priority |
Enable and Start
sudo systemctl daemon-reload
sudo systemctl enable horus-myrobot.service
sudo systemctl start horus-myrobot.service
# Check status
sudo systemctl status horus-myrobot.service
# View logs
journalctl -u horus-myrobot.service -f
Multi-Node Service with Separate Processes
For process isolation, run each node as its own service:
# /etc/systemd/system/horus-camera.service
[Unit]
Description=HORUS Camera Node
After=network.target
[Service]
Type=simple
User=robot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u nodes/camera_node.py
Restart=on-failure
RestartSec=2
[Install]
WantedBy=horus-myrobot.target
# /etc/systemd/system/horus-planner.service
[Unit]
Description=HORUS Planner Node
After=horus-camera.service
[Service]
Type=simple
User=robot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u nodes/planner_node.py
Restart=on-failure
RestartSec=2
[Install]
WantedBy=horus-myrobot.target
Use After= to express startup order between nodes. Use a shared .target to start/stop the entire robot stack as one unit.
Log Collection
HORUS nodes produce two streams of logs: structured logs from node.log_*() calls and standard output from print().
horus logs
The horus logs CLI command reads the structured log stream:
# Follow logs from all running nodes
horus logs -f
# Filter by node name
horus logs -f --node camera
# Filter by level
horus logs -f --level warning
node.log_* Output
Inside your tick function, use the structured logging methods:
def tick(node):
node.log_info("Frame processed")
node.log_warning("Latency spike: 12ms")
node.log_error("Motor timeout")
node.log_debug("Raw encoder: 4821")
These go through the scheduler's logging pipeline, tagged with the node name and timestamp. They appear in horus logs and, when running under systemd, in the journal.
journald Integration
When running as a systemd service, all output (structured logs and print statements) goes to the journal:
# Live follow
journalctl -u horus-myrobot.service -f
# Last 100 lines
journalctl -u horus-myrobot.service -n 100
# Since last boot
journalctl -u horus-myrobot.service -b
# Export for analysis
journalctl -u horus-myrobot.service --output=json > logs.json
Log Rotation
journald handles rotation automatically. For long-running deployments, configure retention:
# /etc/systemd/journald.conf.d/horus.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7d
Performance Tuning
What Stays in Python
Python is the right choice for nodes that are I/O-bound, compute-heavy-but-batchable, or change frequently:
| Node type | Why Python works | Typical rate |
|---|---|---|
| ML inference | PyTorch/ONNX ecosystem, GPU offload | 10-30 Hz |
| Data logging | I/O-bound (disk, database, network) | 1-10 Hz |
| Path planning | Scipy/numpy, compute=True offloads to thread pool | 1-10 Hz |
| Visualization | matplotlib, OpenCV display | 1-30 Hz |
| HTTP/API integration | aiohttp, async nodes handle I/O naturally | 0.1-10 Hz |
| Prototyping | Fast iteration, no compile step | Any |
When to Rewrite in Python to Another Language
Rewrite a node when Python becomes the bottleneck, not before. Profile first:
| Signal | What it means | Action |
|---|---|---|
tick() consistently exceeds budget | CPU-bound work is too slow | Profile, optimize, then rewrite hot path |
| Deadline misses under load | GIL contention or GC pauses | Try gc.disable(), then rewrite if still missing |
| Memory growing unbounded | Python object overhead | Profile with tracemalloc, rewrite if unfixable |
| Latency jitter >1ms at >100Hz | Python overhead is inherent | Rewrite -- Python cannot do sub-ms deterministic ticks |
The practical threshold: If your node needs deterministic ticks above 100 Hz, or sub-millisecond jitter, rewrite it. Below that, Python is fine.
Monitoring Python Nodes
horus monitor
The horus monitor command shows a live dashboard of all running nodes:
horus monitor
This shows per-node tick rate, budget usage, deadline misses, error counts, and health state.
Programmatic Monitoring
Use the Scheduler API to query stats from within your code:
import horus
sched = horus.Scheduler(tick_rate=1000, rt=True)
sched.add(sensor_node)
sched.add(planner_node)
# Start in background or check after running
stats = sched.get_node_stats("sensor")
print(f"Total ticks: {stats['total_ticks']}")
print(f"Errors: {stats['errors_count']}")
safety_stats()
For safety-critical deployments, query the safety monitor:
safety = sched.safety_stats()
if safety:
print(f"Watchdog: {safety}")
# Returns dict with watchdog stats, deadline misses, health states
Health Checks
Build a health-check endpoint for external monitoring (Prometheus, Grafana, fleet manager):
import horus
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
sched = None
class HealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
stats = {}
for name in ["camera", "planner", "motor"]:
stats[name] = sched.get_node_stats(name)
healthy = all(s.get("errors_count", 0) == 0 for s in stats.values())
self.send_response(200 if healthy else 503)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(stats).encode())
def log_message(self, format, *args):
pass # Suppress access logs
def start_health_server():
server = HTTPServer(("0.0.0.0", 8080), HealthHandler)
server.serve_forever()
Run the health server in a background thread or as a separate async node.
Garbage Collection Tuning
Python's garbage collector introduces non-deterministic pauses. For nodes with timing constraints, tune or disable it.
Disable GC for Real-Time Nodes
If your node has a tight budget (sub-10ms) and allocates few objects per tick, disable GC entirely:
import gc
import horus
def init(node):
gc.disable()
node.log_info("GC disabled for RT node")
def tick(node):
# Pre-allocated buffers only -- no new objects per tick
cmd = node.recv("cmd_vel")
if cmd:
apply_command(cmd)
motor = horus.Node(
name="motor",
tick=tick,
init=init,
rate=1000,
subs=["cmd_vel"],
failure_policy="fatal",
)
Requirement: When GC is disabled, you must not create circular references. Use pre-allocated buffers, avoid closures that capture self, and avoid building data structures in tick(). If you leak memory with GC disabled, it is never reclaimed.
Tune GC Thresholds for Other Nodes
For nodes that allocate objects (ML inference, data processing), tune the thresholds instead of disabling:
import gc
def init(node):
# Default: (700, 10, 10)
# Raise gen0 threshold to reduce collection frequency
gc.set_threshold(1500, 15, 15)
node.log_info(f"GC thresholds: {gc.get_threshold()}")
Higher thresholds mean fewer GC pauses but higher peak memory usage. Measure both latency and memory for your workload.
Manual GC Between Ticks
For the best control, disable automatic GC and trigger collection manually during idle periods:
import gc
import horus
gc.disable()
def tick(node):
if not node.has_msg("camera.rgb"):
# No frame to process -- good time to collect
gc.collect(generation=0) # Only gen0, fast (~100us)
return
frame = node.recv("camera.rgb")
detect(frame)
Memory Profiling
tracemalloc for Leak Detection
Python nodes running for days can leak memory through accumulating references. Use tracemalloc to find the source:
import tracemalloc
import horus
tracemalloc.start(10) # Keep 10 frames of traceback
tick_count = 0
baseline = None
def tick(node):
global tick_count, baseline
tick_count += 1
# Normal work
process_data(node)
# Snapshot every 10000 ticks
if tick_count % 10000 == 0:
snapshot = tracemalloc.take_snapshot()
if baseline is None:
baseline = snapshot
else:
stats = snapshot.compare_to(baseline, "lineno")
for stat in stats[:5]:
node.log_warning(f"Memory growth: {stat}")
What to Look For
| Pattern | Likely cause | Fix |
|---|---|---|
| Steady growth in one file/line | List or dict accumulating entries | Cap size or use collections.deque(maxlen=N) |
Growth in node.recv() calls | Holding references to old messages | Process and discard, do not store |
Growth in json.loads() | String interning or dict caching | Use msgpack or typed messages instead |
| Growth in third-party library | Library-internal caching | Check library docs for cache control |
Resource Monitoring
Monitor system resources from within a node:
import os
import resource
import horus
def monitor_tick(node):
# RSS (Resident Set Size) in MB
usage = resource.getrusage(resource.RUSAGE_SELF)
rss_mb = usage.ru_maxrss / 1024 # Linux reports in KB
node.send("diagnostics.memory", {
"rss_mb": rss_mb,
"pid": os.getpid(),
})
if rss_mb > 500:
node.log_warning(f"High memory: {rss_mb:.0f} MB")
monitor = horus.Node(
name="resource_monitor",
tick=monitor_tick,
rate=1,
pubs=["diagnostics.memory"],
)
Mixed Deployments
The most effective production architectures combine Python and other HORUS-supported languages. Each language handles what it does best, communicating through zero-copy shared memory topics.
Typical Architecture
Camera Driver (high-freq, safety) ──→ camera.rgb topic
ML Inference (Python, PyTorch) ←── camera.rgb topic
──→ detections topic
Path Planner (Python, scipy) ←── detections topic
──→ path topic
Motor Controller (high-freq, RT) ←── path topic
──→ motor.status topic
Safety Monitor (high-freq, RT) ←── motor.status topic
Safety-critical nodes (camera driver, motor controller, safety monitor) benefit from compiled languages. Python handles ML inference and path planning where ecosystem libraries matter more than tick latency.
Running Together
Each process runs independently. They communicate through HORUS topics over shared memory:
# Terminal 1: compiled safety-critical nodes
horus run safety_stack
# Terminal 2: Python ML nodes
source /opt/myrobot/venv/bin/activate
python ml_nodes.py
# Terminal 3: Python planner
source /opt/myrobot/venv/bin/activate
python planner.py
Or use systemd to manage all processes:
# /etc/systemd/system/horus-safety.service
[Service]
ExecStart=/usr/local/bin/horus run safety_stack
# /etc/systemd/system/horus-ml.service
[Service]
ExecStart=/opt/myrobot/venv/bin/python -u ml_nodes.py
# /etc/systemd/system/horus-planner.service
[Service]
ExecStart=/opt/myrobot/venv/bin/python -u planner.py
The Handoff Pattern
When a Python prototype node gets promoted to production, the topic interface stays the same. Only the implementation changes:
# Python prototype (runs at 30 Hz, good enough for testing)
def planner_tick(node):
scan = node.recv("lidar.scan")
if scan:
path = compute_path(scan) # scipy A*
node.send("path", path)
The compiled replacement subscribes to the same topics and publishes the same messages. No other node needs to change. This is the key benefit of topic-based IPC: language boundaries are invisible to the rest of the system.
Pre-Deployment Checklist
Before shipping Python nodes to production:
- All dependencies pinned in
requirements.txtorhorus.toml - Virtual environment created and tested on target hardware
- systemd service file with
Restart=on-failure -
failure_policyset on every node (not relying on defaults) -
node.log_*()used instead ofprint()for operational messages - GC tuned or disabled for nodes with timing constraints
- Memory profiled under sustained load (run for hours, check RSS)
-
horus monitorshows all nodes healthy under load - Health-check endpoint accessible for external monitoring
- Shared memory cleaned before first deploy (
horus clean --shm)
Design Decisions
Why venv instead of containers? Containers add overhead (cgroup management, overlay filesystem, network namespacing) that hurts real-time performance. Shared memory IPC between containers requires explicit --ipc=host flags that defeat isolation. A virtual environment gives dependency isolation without the performance or IPC penalty. Use containers for CI/CD and development, not for production robots.
Why systemd instead of a HORUS-native process manager? systemd is battle-tested, ships with every Linux distribution, integrates with journald for logging, and supports cgroup resource limits. Building a custom process manager would duplicate all of this poorly. The HORUS scheduler manages node execution within a process; systemd manages processes within the system. Each tool does what it does best.
Why not auto-detect which nodes need GC tuning? Garbage collection impact depends on allocation patterns, object lifetimes, and timing requirements -- all application-specific. A node publishing pre-allocated IMU structs at 1000 Hz needs GC disabled. A node building detection lists at 10 Hz needs GC enabled. There is no heuristic that works for both. Explicit tuning by the developer is the only reliable approach.
Trade-offs
Python for ML vs compiled inference: Python gives you the full PyTorch/ONNX/HuggingFace ecosystem. Compiled inference (ONNX Runtime C++, TensorRT) gives lower latency and no GIL. For most robotics workloads, Python inference at 10-30 Hz is fast enough. Rewrite when profiling shows that Python overhead (not model inference) is the bottleneck.
Single process vs multi-process: Running all Python nodes in one process (one horus.run() call) shares the GIL. Running each node as a separate process (separate systemd services) avoids GIL contention but uses more memory and loses in-process topic shortcuts. Single process is simpler to deploy. Multi-process scales better when you have CPU-bound Python nodes competing for the GIL.
gc.disable() vs gc.set_threshold(): Disabling GC eliminates pauses completely but risks memory leaks if you create circular references. Tuning thresholds reduces pause frequency without eliminating leaks. For nodes with pre-allocated buffers and no circular references, disable. For nodes that build temporary data structures, tune thresholds. When in doubt, tune rather than disable -- a slow leak is easier to debug than a mysterious OOM after 48 hours.
Pinned versions vs version ranges: Pinned versions (numpy==1.26.4) guarantee reproducibility but require manual updates for security patches. Version ranges (numpy>=1.26,<1.27) allow patch updates but risk behavior changes. For production robots, pin everything. Run pip install --upgrade in CI, run your test suite, and pin the new versions explicitly.
See Also
- Error Handling -- Exception types and failure policies
- Python Bindings -- Core Python API reference
- Async Nodes -- Async I/O nodes for HTTP and database
- ML Guide -- ML inference optimization