Skip to main content

19 Architecture Decisions: Self-Hosting Multi-Agent Infrastructure

April 20, 2025 · 9 min read

Most multi-agent AI demos run in notebooks. Hive runs on a Beelink SER5 mini PC in my apartment, orchestrating specialized AI agents 24/7 with Docker sandboxing, zero-trust networking, encrypted storage, and a projected monthly cost of $10-29.

This post covers the architectural decisions that turned a weekend experiment into a production system — and why I wrote 19 formal Architecture Decision Records to keep it that way.

The Hardware

A Beelink SER5 with an AMD Ryzen 5 5500U (6 cores, 12 threads), 28GB usable RAM, and a 500GB NVMe SSD. Total cost: roughly $300. It runs Ubuntu Server 24.04 LTS headless, provisioned entirely through cloud-init so the setup is reproducible.

The SER5 has one notable limitation: no "Restore on AC Power Loss" option in its locked-down OEM BIOS. If the power goes out, someone has to physically press the button. A UPS with NUT monitoring handles graceful shutdown on battery low, and an RTC wake alarm at 3:00 AM provides a recovery window after brief outages.

Agent Orchestration: Depth-2 Nesting

The system uses a modular domain team architecture with three layers:

Layer 0 — Orchestrator (main): Routes all Discord messages, spawns domain teams, handles cross-domain synthesis. Runs Gemini 3.1 Pro for complex reasoning. This is the only agent with elevated host exec capability (gated behind approval).

Layer 1 — Domain Leads: Specialized agents for research, wine analysis, startup evaluation (incrementally added). Each lead owns a Discord channel and can spawn workers. The research-lead was the first pilot, exercising most subsystems: Gemini API, Google Search grounding, worker spawning, and file-based artifact patterns.

Layer 2 — Workers: Ephemeral sub-agents spawned on-demand by team leads. They run the cheapest model (Gemini 3 Flash), have minimal tool access (no browser, no cron, no session management), and are archived after task completion.

This depth-2 pattern (orchestrator → leads → workers) allows new domains to be added without architectural changes. The expansion protocol is five steps: add agent, set permissions, write bootstrap docs, update orchestrator's allowed agent list, bind a Discord channel.

Six Layers of Security

Self-hosting AI agents that can execute code and browse the web requires defense in depth. Here's the security model:

Layer 1: Network Invisibility

The gateway binds to loopback only (127.0.0.1:18789). Zero ports are exposed to the public internet. All access flows through a Tailscale mesh network — encrypted WireGuard tunnels with MagicDNS hostname routing. External nmap scans confirm zero visible ports. The iptables INPUT policy is DROP on everything except loopback and the tailscale0 interface.

Layer 2: Secrets at Rest

No plaintext API keys on disk. Ever. All credentials (Gemini, Anthropic, Brave, ElevenLabs) live in a 1Password vault. At startup, unlock.sh runs op signin, extracts secrets, and writes them to /run/openclaw-credentials/.env — a tmpfs mount (RAM-only, mode 600). The systemd unit loads this environment file. When the machine reboots, the tmpfs is gone.

Layer 3: Full-Disk Encryption

LUKS full-disk encryption with TPM2 auto-unseal via Clevis. The TPM is bound to PCR 7 (Secure Boot policy), which remains stable across kernel updates since both the stock 6.8 and HWE 6.17 kernels are signed by the same Canonical UEFI key. This provides zero-touch boots — the disk unlocks automatically without console access — while keeping the data unreadable if the NVMe is physically removed. A keyboard passphrase stored in 1Password serves as permanent fallback.

Layer 4: Secrets Injection via Exec Provider

Beyond environment variables, the gateway supports source: "exec" for credential resolution — running 1Password CLI commands at startup to resolve individual secrets. This is eager validation: if the 1Password session is unavailable, the gateway fails fast rather than degrading silently.

Layer 5: Per-Agent Docker Sandboxing

Every non-main agent runs in a Docker container with --cap-drop=ALL and --network=none. No outbound network access from inside the sandbox — web tools (browser, search) execute on the gateway host process, not inside the container. Each agent gets an isolated workspace mount at /workspace-<agentId> (read-write).

An important subtlety: web tools appearing in an agent's allowed tool list doesn't mean the agent's container has network access. The gateway executes those tools on the host and returns results to the sandboxed agent. The container itself is fully air-gapped.

Layer 6: Webhook Hardening

The only public surface is a Tailscale Funnel endpoint for Gmail Pub/Sub webhooks. Defense is three-tiered: strong bearer token authentication at the gateway, iptables hashlimit (1 request/second per source IP, burst 10) on the Tailscale interface, and the receiving agent (ops) runs sandboxed with no network and no exec tools.

Cost Governance with LiteLLM

AI API costs can escalate quickly with multiple agents running 24/7. The system uses LiteLLM as a proxy between the gateway and model providers, with a clear separation of concerns:

  • OpenClaw (gateway) handles agent-level model selection and per-model timeouts
  • LiteLLM (proxy) handles cross-provider failover chains, semantic caching, and hard budget caps

The failover chain is critical for reliability. If Gemini returns a 429 rate limit error, LiteLLM automatically cascades to the next provider:

gemini-3.1-pro → gemini-2.5-pro → claude-sonnet-4-6 gemini-3-flash → gemini-2.5-flash → claude-sonnet-4-6

Budget enforcement uses LiteLLM virtual keys with monthly caps. At 80% spend, the orchestrator's observability cron job fires an alert. At 100%, the proxy returns 429 to the agent, triggering the fallback chain. Per-agent cost tracking is visible in the LiteLLM dashboard.

During Google Cloud's free trial ($300 credits over 90 days), the system runs at ~$4-13/month. After credits expire, projected cost is $10-29/month — dominated by Gemini 3.1 Pro token usage for the orchestrator and research lead.

Hybrid Memory: BM25 + Vector Search

Each agent maintains its own memory index using QMD — a hybrid search backend that combines BM25 text matching (weight 0.3) with vector semantic search (weight 0.7). Results are re-ranked using Maximal Marginal Relevance to eliminate near-duplicates, and a 30-day temporal decay half-life ensures recent context is weighted appropriately.

The embedding model runs locally as a ~600MB GGUF file via node-llama-cpp. Zero API cost for memory operations. The index updates every 5 minutes with debouncing.

Each agent's memory has a clear hierarchy:

  • MEMORY.md — Durable lessons learned (never temporally decayed)
  • memory/YYYY-MM-DD.md — Daily pre-compaction flush (ephemeral context)
  • memory/reviews/YYYY-WXX.md — Weekly self-assessment summaries
  • memory/worker-results/ — File-based worker output artifacts

The file-based worker output pattern deserves special mention. Early testing revealed that Gemini 3.1 Pro drops earlier worker context when later workers finish — a context accumulation bug. The fix: workers write results to files, and the team lead reads all files before synthesizing. Simple, reliable, and no framework dependency.

Why 19 ADRs?

Architecture Decision Records aren't overhead — they're the difference between a system you can maintain and one you can't. Each ADR documents the decision, the alternatives considered, the trade-offs accepted, and the verification criteria.

Some highlights:

ADR-002 (QMD over Qdrant): Self-hosted hybrid search with zero API cost vs. a managed vector database. QMD's BM25+vector combination with MMR re-ranking and temporal decay met all requirements without adding an external dependency.

ADR-012 (Standard Docker, not rootless): For a single-user, single-purpose machine, rootless Docker adds complexity without meaningful security benefit. The hive user already has access to everything critical. --cap-drop=ALL on containers is the right isolation boundary.

ADR-015 (API Keys over Subscription): Multi-provider pay-per-token (Gemini + Anthropic fallback) is cheaper than a flat-rate Claude Pro subscription for this usage pattern. LiteLLM makes provider switching transparent.

ADR-016 (Adaptive Self-Improvement): Five integrated patterns, all using native primitives — no custom code. Reflection prompting (agents write lessons when corrected), suggestion surfacing (orchestrator scans for tool gaps), weekly self-assessment (cron-triggered reviews), config self-modification with trust tiers (autonomous for low-impact changes, ask-first for structural ones), and cross-domain knowledge sharing via sessions_send.

Automation

Four recurring cron jobs keep the system self-maintaining:

JobSchedulePurpose
Morning BriefingDaily 7:02 AMSummary posted to Discord #general
Daily Ops LogDaily 9:00 PMStructured log of API spend, changes, memory health
Weekly Self-AssessmentSunday 10:00 AMReviews own performance, surfaces gaps
System Health CheckEvery 6 hoursReads host monitoring report, alerts on anomalies

The health check flow is intentionally decoupled: a host-level shell script (hive-monitor.sh) runs every 30 minutes via system cron, writing structured metrics to a file. The AI agent reads that file every 6 hours via the gateway's read tool. The agent never gets elevated host access for monitoring — it reads a pre-generated report.

What Makes It Production

The difference between a demo and a production system is usually boring: restart resilience, secret rotation, cost alerts, log retention, runbooks. Hive has:

  • Automated startup verification via a Makefile target that checks Tailscale, LUKS, firewall, disk, UPS, Docker, LiteLLM, gateway, agents, and secrets in sequence
  • IaC provisioning via cloud-init — the entire OS configuration is version-controlled and reproducible
  • Boot runbook in docs/BOOT.md covering the 7-step post-reboot verification sequence
  • 90+ completed tasks across 7 implementation phases with explicit verification gates

Tech Stack

  • Hardware: Beelink SER5 (AMD Ryzen 5 5500U, 28GB RAM, 500GB NVMe)
  • OS: Ubuntu Server 24.04 LTS (headless, cloud-init provisioned)
  • Agent Framework: OpenClaw (gateway + agents + cron + memory + sandboxing)
  • Models: Gemini 3.1 Pro (orchestrator), Gemini 3 Flash (workers), Claude Sonnet 4.6 (fallback)
  • Proxy: LiteLLM + Redis + PostgreSQL (Docker Compose)
  • Networking: Tailscale (WireGuard mesh, MagicDNS, Funnel for webhooks)
  • Encryption: LUKS + Clevis-TPM2 (PCR 7 binding)
  • Secrets: 1Password CLI → tmpfs injection
  • Memory: QMD (BM25 + vector, local GGUF embeddings, MMR re-ranking)
  • Monitoring: NUT (UPS), lm-sensors, custom hive-monitor.sh
  • Interface: Discord (channel bindings per agent)