25 Architecture Decisions: Self-Hosting Multi-Agent Infrastructure

Most multi-agent AI demos run in notebooks. Hive runs 24/7 on a single self-hosted node, orchestrating specialized AI agents with Docker sandboxing, zero-trust networking, encrypted storage, and a cost-governance layer that holds spend under a hard monthly cap.

This post covers the architectural decisions behind running it as a production system, and why I wrote 25 formal Architecture Decision Records to keep it that way.

The Hardware

The node is an x86 mini-PC (AMD Ryzen 5 5500U, 6 cores / 12 threads, 28GB usable RAM, 500GB NVMe SSD) running Ubuntu Server 24.04 LTS headless. The OS is provisioned entirely through cloud-init, so the setup is version-controlled and reproducible from a cold drive.

One hardware constraint worth noting: the OEM BIOS has no "Restore on AC Power Loss" option. A UPS with NUT monitoring handles graceful shutdown on battery low, and an RTC wake alarm at 3:00 AM provides a recovery window after brief outages.

Agent Orchestration: Depth-2 Nesting

The system uses a modular domain team architecture with three layers:

Layer 0, Orchestrator (main): Routes all Discord messages, spawns domain teams, and handles cross-domain synthesis. Runs Gemini 3 Flash for this dispatch and routing work; the reasoning-heavy load runs on Pro-tier models in the domain leads (see ADR-024 for the per-agent model tiering). This is the only agent with elevated host exec capability (gated behind approval).

Layer 1, Domain Leads: Specialized agents for research, wine analysis, startup evaluation (incrementally added). Each lead owns a Discord channel and can spawn workers. The research-lead was the first pilot, exercising most subsystems: Gemini API, Google Search grounding, worker spawning, and file-based artifact patterns.

Layer 2, Workers: Ephemeral sub-agents spawned on-demand by team leads. They run the cheapest model (Gemini 3 Flash), have minimal tool access (no browser, no cron, no session management), and are archived after task completion.

This depth-2 pattern (orchestrator → leads → workers) allows new domains to be added without architectural changes. The expansion protocol is five steps: add agent, set permissions, write bootstrap docs, update orchestrator's allowed agent list, bind a Discord channel.

flowchart TD
  accTitle: Hive depth-2 agent orchestration on a self-hosted node
  accDescr: Discord messages reach the orchestrator (main, Gemini 3 Flash), which spawns domain leads for research and culinary work, each Docker-sandboxed with no network access. Leads run Pro-tier models and spawn ephemeral workers on Gemini 3 Flash. The orchestrator uses a local QMD hybrid memory index and routes all model calls through a LiteLLM proxy that handles cross-provider failover and a hard monthly budget cap before any HTTPS call leaves the host. The node has zero public ports and is reachable only over a Tailscale mesh.
  discord["Discord (per-agent channels)"] --> main["Orchestrator: main (Gemini 3 Flash)"]
  subgraph host["Self-hosted node: zero public ports, Tailscale mesh only"]
    main --> leads["Domain leads: research, chef (Docker sandbox, network none)"]
    leads --> workers["Ephemeral workers (Gemini 3 Flash)"]
    main --> qmd[("QMD memory: BM25 + vector + MMR, local embeddings")]
    main --> litellm["LiteLLM proxy: failover + hard monthly cap"]
    leads --> litellm
    workers --> litellm
  end
  litellm -->|"HTTPS outbound from host"| providers["Gemini / OpenAI / Anthropic"]

Six Layers of Security

Agents that can execute code and browse the web are a standing attack surface, so the model is defense in depth across six layers, plus a prompt-injection half-step:

Layer 1: Network Invisibility

The gateway binds to loopback only (127.0.0.1:18789). The iptables INPUT policy is DROP on everything except loopback and the tailscale0 interface, so external nmap scans show zero visible ports. All access flows through a Tailscale mesh: encrypted WireGuard tunnels with MagicDNS hostname routing. There are no public ports at all: Gmail is polled by a host cron job (ADR-021), which retired the last public endpoint the system ever had (a Tailscale Funnel webhook), so the node is now genuinely zero-public-ports.

Layer 2: Secrets Management

No plaintext API keys on disk. Ever. All credentials (Gemini, Anthropic, Brave, ElevenLabs) live in a 1Password vault and are resolved at startup into a tmpfs env file (/run/openclaw-credentials/.env, RAM-only, mode 600) that the systemd unit loads; the tmpfs is gone on reboot. Credential fields use structured SecretRef objects rather than inline strings, and the gateway also supports an exec provider that runs 1Password CLI commands to resolve individual secrets at startup. That resolution is eager: if the 1Password session is unavailable, the gateway fails fast rather than degrading silently.

Layer 3: Access Control

Every channel uses DM pairing (a one-time code the owner approves), sessions are isolated per person per channel (session.dmScope: "per-channel-peer"), and group channels are mention-gated. Tool access runs through a layered, deny-wins policy pipeline: a deny at any level cannot be overridden below it, so each agent only ever holds the tools explicitly left to it. The effective policy for any agent is inspectable with openclaw sandbox explain.

Layer 3.5: Prompt Injection Awareness

Untrusted content (email bodies, web pages, tool and script output) is treated as data, not instructions. It is processed by sandboxed agents whose tool policies deny the cross-agent messaging tools (sessions_send, sessions_spawn), so a compromised agent cannot pivot to others. The main orchestrator carries no web tools at all; it delegates web work to a sandboxed research agent (ADR-019). The design is explicit that prompt-injection risk is reduced, not eliminated.

Layer 4: Execution Isolation

Every non-main agent runs in a Docker container with --cap-drop=ALL, --security-opt=no-new-privileges, and --network=none. There is no outbound network from inside the sandbox: web tools (browser, search) execute on the gateway host, not in the container, so a tool appearing in an agent's allowed list never means its container has network access. Workers inherit their lead's container and its isolation.

Layer 5: Infrastructure

LUKS full-disk encryption, with TPM2 auto-unseal via Clevis bound to PCR 7 for zero-touch boots, keeps the data unreadable if the NVMe is physically removed (a 1Password-stored passphrase is the permanent fallback). A dedicated, disabled-password hive service user runs the workload; the admin account carries minimal group membership; config files are mode 600 and the OpenClaw directory 700; unattended security upgrades and sensitive-value log redaction are on by default.

Layer 6: Supply Chain

Only explicitly trusted plugins load (plugins.allow allowlist). ClawHub skills are version-pinned on install, never @latest, and the gateway itself is pinned to a specific version at least a few days old, with the built-in auto-updater disabled in favor of a Hive-managed daily update that carries a rollback safety net. openclaw security audit --deep runs on a regular cadence.

Cost Governance with LiteLLM

With multiple agents running 24/7, token spend compounds fast. LiteLLM sits between the gateway and the model providers, splitting responsibility two ways:

OpenClaw (gateway) handles agent-level model selection and per-model timeouts
LiteLLM (proxy) handles cross-provider failover chains, semantic caching, and hard budget caps

The failover chain keeps the system answering through a provider outage. If Gemini returns a 429 rate-limit error, LiteLLM cascades to the next provider:

gemini-3.1-pro → gemini-2.5-pro → claude-sonnet-4-6
gemini-3-flash → gemini-2.5-flash → claude-sonnet-4-6

These chains reflect the original deployment. ADR-024 later moved the terminal Anthropic fallback to Claude Haiku 4.5 for every agent except the research lead, which keeps Sonnet 4.6 as its last resort.

Budget enforcement uses LiteLLM virtual keys with monthly caps. At 80% spend, the orchestrator's observability cron job fires an alert. At 100%, the proxy returns 429 to the agent, triggering the fallback chain. Per-agent cost tracking is visible in the LiteLLM dashboard.

During Google Cloud's free trial ($300 credits over 90 days), token spend stayed a fraction of the cap. After the credits expire, per-agent model tiering holds projected spend under the hard monthly ceiling, dominated by Gemini 3.1 Pro usage for the research and culinary domain leads.

Hybrid Memory: BM25 + Vector Search

Each agent maintains its own memory index using QMD, a hybrid search backend that combines BM25 text matching (weight 0.3) with vector semantic search (weight 0.7). Results are re-ranked using Maximal Marginal Relevance to eliminate near-duplicates, and a 30-day temporal decay half-life ensures recent context is weighted appropriately.

The embedding model runs locally as a ~600MB GGUF file via node-llama-cpp. Zero API cost for memory operations. The index updates every 5 minutes with debouncing.

Each agent's memory has a clear hierarchy:

MEMORY.md, Durable lessons learned (never temporally decayed)
memory/YYYY-MM-DD.md, Daily pre-compaction flush (ephemeral context)
memory/reviews/YYYY-WXX.md, Weekly self-assessment summaries
memory/worker-results/, File-based worker output artifacts

The file-based worker output pattern deserves special mention. Early testing revealed that Gemini 3.1 Pro drops earlier worker context when later workers finish, a context accumulation bug. The fix: workers write results to files, and the team lead reads all files before synthesizing. Simple, reliable, and no framework dependency.

Why 25 ADRs?

Architecture Decision Records are what make this system maintainable rather than a pile of undocumented choices. Each ADR records the decision, the alternatives considered, the trade-offs accepted, and the verification criteria that proves it works.

Some highlights:

ADR-002 (QMD over Qdrant): Self-hosted hybrid search with zero API cost vs. a managed vector database. QMD's BM25+vector combination with MMR re-ranking and temporal decay met all requirements without adding an external dependency.

ADR-012 (Standard Docker, not rootless): For a single-user, single-purpose machine, rootless Docker adds complexity without meaningful security benefit. The hive user already has access to everything critical. --cap-drop=ALL on containers is the right isolation boundary.

ADR-015 (API Keys over Subscription): Multi-provider pay-per-token (Gemini + Anthropic fallback) is cheaper than a flat-rate Claude Pro subscription for this usage pattern. LiteLLM makes provider switching transparent.

ADR-016 (Adaptive Self-Improvement): Five integrated patterns, all using native primitives, no custom code. Reflection prompting (agents write lessons when corrected), suggestion surfacing (orchestrator scans for tool gaps), weekly self-assessment (cron-triggered reviews), config self-modification with trust tiers (autonomous for low-impact changes, ask-first for structural ones), and cross-domain knowledge sharing via sessions_send.

Automation

Four recurring cron jobs keep the system self-maintaining:

Job	Schedule	Purpose
Morning Briefing	Daily 7:02 AM	Summary posted to Discord #general
Daily Ops Log	Daily 9:00 PM	Structured log of API spend, changes, memory health
Weekly Self-Assessment	Sunday 10:00 AM	Reviews own performance, surfaces gaps
System Health Check	Every 6 hours	Reads host monitoring report, alerts on anomalies

The health check flow is intentionally decoupled: a host-level shell script (hive-monitor.sh) runs every 30 minutes via system cron, writing structured metrics to a file. The AI agent reads that file every 6 hours via the gateway's read tool. The agent never gets elevated host access for monitoring; it reads a pre-generated report.

What Makes It Production

What separates a demo from a production system is the boring work: restart resilience, secret rotation, cost alerts, log retention, runbooks. Hive has:

Automated startup verification via a Makefile target that checks Tailscale, LUKS, firewall, disk, UPS, Docker, LiteLLM, gateway, agents, and secrets in sequence
IaC provisioning via cloud-init; the entire OS configuration is version-controlled and reproducible
Boot runbook in docs/BOOT.md covering the 7-step post-reboot verification sequence
130+ completed tasks across phases 0 through 7 with explicit verification gates, plus 7E / 8 / 9 in progress and Phase 10 (Chef Antoine) operational

What's Changed Since (May 2026 Update)

Three new ADRs since this post first went up:

ADR-023: Chef Antoine + Kroger Cart Integration

A culinary mentor with persistent memory, a recipe archive, and live grocery integration. Built as a new chef-lead domain team following the ADR-014 expansion protocol (no architectural changes, just config). Primary model is gemini-3.1-pro-preview-customtools because creative cooking and persona maintenance is load-bearing. Kroger Cart API integration uses Hive's existing pre-fetch pattern: the agent writes a request, a host cron script calls the Kroger API, results land in workspace, agent reads. This keeps the agent inside its network: none sandbox while still touching live store data. Cart population is automated; checkout stays manual in the Kroger app for safety. Operational since 2026-04-08.

ADR-024: Cost Optimization Post-GCP Credits

The $300 GCP credit program that covered Gemini during Phases 3A through 10 ran out in early April. Every Gemini token now bills directly. The unoptimized projection without credits ran well over the cap. Goal: best value for spend, not cheapest possible. The LiteLLM hard monthly cap stayed, and per-agent model tiering brought steady-state spend comfortably back under it.

Per-agent model assignments:

Agent	Primary	Tier 1b Google	Tier 2 OpenAI	Tier 3 Anthropic
`main` (Queenie)	`gemini-3-flash-preview`	`gemini-2.5-flash`	`gpt-4.1-mini`	`claude-haiku-4-5`
`ops`	`gemini-3-flash-preview`	`gemini-2.5-flash`	`gpt-4.1-mini`	`claude-haiku-4-5`
`chef-lead` (Chef Antoine)	`gemini-3.1-pro-preview-customtools`	`gemini-2.5-pro`	`gpt-4.1`	`claude-haiku-4-5`
`research-lead` (Q)	`gemini-3.1-pro-preview-customtools`	`gemini-2.5-pro`	`gpt-4.1`	`claude-sonnet-4-6`

Queenie does dispatch and tool calling, which is classification work. Zero quality risk on Flash. Chef and Q keep Pro CT because that's where creativity and factual accuracy actually live. Anthropic fallback shifted from Sonnet to Haiku 4.5 everywhere except Q, which kept Sonnet as a last resort for output quality on writing-heavy tasks during provider outages. Two infrastructure fixes came along for the ride: cost metadata in openclaw.json was undercounting Gemini 2.5 spend (a relic of the credits era, fixed), and Vertex AI routing was reordered to prefer the Gemini Dev API (which has a free tier) over Vertex (which doesn't).

ADR-025: Active Memory Plugin Adoption

OpenClaw 2026.4.12 added an Active Memory plugin: an opt-in memory sub-agent that runs before the main reply and pulls relevant memory into context automatically. Hive's memory surface has grown to roughly 20 files today, projected to land in the 500-1000 range over twelve months. The summary-based context-loading pattern degrades past about 50 files and becomes untenable past a few hundred. Active Memory is server-side RAG over the existing QMD memory: no local embedding infrastructure, no local llama.cpp server (the previously parked "Path A" exploration becomes unnecessary). Enabled for main, research-lead, and chef-lead. Skipped on ops because email triage and health checks are mostly stateless.

Config: model: gemini-3-flash-preview (cheapest tier trusted for structured memory selection), queryMode: recent (the documented-recommended default, latest user turn plus a small tail), 15s timeout, 220-char summary cap. Transcripts persisted to disk for the first review cycle so memory selection quality can be audited, then flipped off to save disk.

The trade-off is a marginal DM latency increase from running a sub-agent call before every primary reply. Worth watching; not yet a problem.

Tech Stack

Hardware: Beelink SER5 (AMD Ryzen 5 5500U, 28GB RAM, 500GB NVMe)
OS: Ubuntu Server 24.04 LTS (headless, cloud-init provisioned)
Agent Framework: OpenClaw (gateway + agents + cron + memory + sandboxing)
Models (post-ADR-024 tiering): gemini-3-flash-preview for orchestrator and ops, gemini-3.1-pro-preview-customtools for creative work (Chef, Q), Haiku 4.5 as Anthropic fallback (Sonnet 4.6 kept only as Q's last resort)
Memory plugin: OpenClaw Active Memory (server-side RAG over QMD)
Proxy: LiteLLM + Redis + PostgreSQL (Docker Compose)
Networking: Tailscale (WireGuard mesh, MagicDNS, zero public ports)
Encryption: LUKS + Clevis-TPM2 (PCR 7 binding)
Secrets: 1Password CLI → tmpfs injection
Memory: QMD (BM25 + vector, local GGUF embeddings, MMR re-ranking)
Monitoring: NUT (UPS), lm-sensors, custom hive-monitor.sh
Interface: Discord (channel bindings per agent)