Building a Real-Time Voice AI Sales Coach

Real-time voice AI has a fundamental tension: natural conversation demands sub-second responses, but rubric-based coaching takes 5-15 seconds per transcript. You can't have both in a single pipeline. So I built two.

The Problem

At Right at Home, sales representatives needed practice handling complex buyer objections: affordability concerns, care urgency, competitive comparisons. Traditional role-play training doesn't scale across 350+ franchise locations. I set out to build an AI coaching agent that could simulate realistic buyer personas and deliver structured performance feedback.

The core constraint is a latency budget that grading cannot fit inside. A monolithic pipeline that handles conversation and evaluation in-line can't hold sub-second latency, because each audio round-trip through STT, LLM, and TTS adds 500ms+ of pipeline overhead before the model even starts reasoning.

Two-Brain Architecture

The solution separates real-time interaction from deep evaluation through an asynchronous air gap:

Brain 1 (The Actor) handles live conversation using Azure OpenAI's gpt-realtime model over a LiveKit Agents SDK v1.5.9 WebRTC transport, keeping glass-to-glass latency under 1000ms. In the original design this was audio-to-audio streaming: the model received raw audio and produced raw audio with no intermediate speech-to-text or text-to-speech steps, which cut the traditional pipeline latency in half. (The Half-Cascade Evolution section below covers why Brain 1 later switched to text-only output with a dedicated TTS engine.)

Brain 2 (The Grader) runs post-session. When a coaching call ends, Brain 1 uploads the full transcript to Azure Blob Storage. An Event Grid trigger fires an Azure Function that evaluates the transcript against rubric-based criteria using o4-mini with Structured Outputs. The grader produces deterministic JSON scores, coaching summaries, and specific evidence citations, then delivers results via email.

The "air gap" between them is just Azure Blob Storage. No shared state, no bidirectional communication, no latency coupling.

flowchart LR
  accTitle: Two-brain voice coaching architecture
  accDescr: Brain 1 runs Azure OpenAI gpt-realtime over WebRTC for sub-second conversation. The transcript is written to Azure Blob Storage as an air gap, which triggers an Azure Function that grades it asynchronously with o4-mini Structured Outputs against the ISR or OSR rubric, then stores results in Cosmos DB and emails a coaching report.
  rep(["Sales rep"])
  subgraph brain1["Brain 1: real-time, under 1000ms"]
    b1["gpt-realtime over WebRTC"]
  end
  rep <-->|"WebRTC audio"| b1
  b1 -->|"transcript JSON"| blob[("Azure Blob Storage (air gap)")]
  blob -->|"Event Grid trigger"| fn["Azure Function"]
  subgraph brain2["Brain 2: async grading"]
    fn --> grader["o4-mini Structured Outputs"]
    grader --> rubric{"ISR 90-pt / OSR 100-pt"}
  end
  rubric --> cosmos[("Cosmos DB")]
  rubric --> email["Emailed coaching report"]

Dynamic Buyer Personas

Each training scenario is defined by a JSON persona configuration that controls both brains.

The persona specifies the buyer's voice, personality, opening line, and, critically, their call direction. Call direction matters because large language models are conditioned by RLHF to behave as "helpful assistants." When you need the AI to play someone who received a phone call, that default behavior fights you.

I solved this with three-layer identity anchoring:

Programmatic preamble, a call-direction statement prepended to the system prompt ("You RECEIVED this phone call")
ChatContext priming, for caller personas, a synthetic greeting message exploits the model's RLHF training to produce caller-style speech patterns
Lifecycle hook, the opening line delivered via on_enter() with per-response identity reinforcement

This three-layer approach overcomes the assistant-mode default reliably across hundreds of test sessions.

Dual-Rubric Grading

Right at Home evaluates inside sales (incoming calls) and outside sales (outbound calls) on different criteria. Inside sales uses a 90-point rubric focused on responsiveness, empathy, and qualifying needs. Outside sales uses a 100-point rubric across preparation, discovery, value proposition, objection handling, and closing.

Rather than forcing both into a single schema (which would require messy allOf workarounds in JSON Schema strict mode), I implemented separate o4-mini prompts and Structured Output schemas per call type. The persona's call_type field flows through the transcript into Brain 2, which selects the appropriate rubric.

One constraint surfaced during calibration: the inside-sales rubric was written for human mystery shoppers who can observe things like "phone answered within 3 rings." An AI grading a transcript can't evaluate that. The fix was to auto-grant those unobservable points and score the model only on what a transcript reveals: a professional greeting, using the caller's name, demonstrating active listening.

All scores normalize to 0-100 for cross-rubric analytics, so franchise operators can compare performance regardless of call type.

Latency Budget

Real-time voice has zero margin for error. Here's how the 1000ms budget breaks down:

Component	Budget
WebRTC transport (UDP via LiveKit)	50-150ms
Voice Activity Detection (server-side)	200-300ms
LLM inference (audio-to-audio)	300-500ms
Agent logic (pure asyncio)	<50ms

The agent loop is entirely non-blocking. Any synchronous I/O would blow the budget. Even persona loading uses functools.lru_cache to avoid disk reads after the first session.

Everything runs in Azure East US 2. Co-locating the OpenAI endpoints, Blob Storage, Cosmos DB, and the agent runtime in one region keeps inter-service latency low on the real-time path.

What I Learned

Audio-to-audio matters. Bypassing the STT/TTS pipeline isn't just faster; the model processes prosody, hesitation, and emphasis directly from the audio signal, which makes the conversation more natural. (See the Half-Cascade Evolution section below for the catch.)

Structured Outputs make grading deterministic. Without "strict": true on the JSON Schema, o4-mini would occasionally invent new rubric categories or return scores outside valid ranges. Strict mode eliminated that entirely.

The air gap is the architecture. Separating real-time interaction from deep analysis isn't a compromise; it's the only design that satisfies both constraints at once. Each brain operates at its natural timescale without degrading the other.

Identity anchoring requires defense in depth. No single technique reliably overrides RLHF conditioning. The three-layer approach (preamble + context priming + lifecycle hooks) was the minimum that held persona fidelity in testing.

Half-Cascade Evolution (May 2026)

The two-brain split worked for latency and grading. It did not solve a problem we only caught later: voice gender drift.

Listeners reported the assistant's voice wobbling between gender presentations mid-conversation, even though each persona was pinned to a single voice ID (shimmer for Sarah, for example). Telemetry confirmed the configured voice_id stayed fixed throughout each session. The drift was originating inside gpt-realtime's audio synthesis layer, downstream of every config knob the application controlled. No prompt-side fix could touch it, because the model itself decides the final waveform.

LiveKit documents a half-cascade pattern for exactly this class of problem. The realtime model emits text only (modalities=["text"]) and a separate TTS engine synthesizes the audio. With a deterministic voice ID, voice gender cannot drift by construction.

Decision (ADR-020):

Switch RealtimeModel.with_azure(...) to modalities=["text"] so gpt-realtime emits text only.
Wire livekit-plugins-azure azure.TTS into the agent session with per-persona DragonHD voice IDs (en-US-Ava:DragonHDLatestNeural, en-US-Andrew:DragonHDLatestNeural).
Auth Azure Speech via Microsoft Entra managed identity using the documented aad#{resourceId}#{aadToken} wrapper, so no plaintext Speech key lives anywhere.
Retire the volume-boost helpers (about 412 lines including tests). DragonHD emits at standard amplitude, and there is no gpt-realtime audio path left to scale.

Measured outcomes pre-merge on staging:

Q1/Q9 invariants held across 6/6 persona sessions. Voice gender held for every session.
Q8 TTS TTFB direct measurement: p50 ~237ms, p95 ~316ms. Well under the 1000ms glass-to-glass HARD gate.
Per-session cost dropped roughly 98% (Azure TTS at $15/1M chars versus gpt-realtime audio output at $0.20/1K tokens).

Trade-offs the half-cascade introduces:

The TTS hop adds back about 150-300ms. Steady-state headroom is still 150-450ms; first-turn variance is tighter because the plugin has no pre-connect pooling.
Azure Speech becomes a runtime dependency (deployed in the same region and managed-identity scope as the agent).
Non-Omni DragonHD voices support only the temperature knob; full <mstts:express-as> style support is reserved for Omni voices.

The deeper lesson sits one level beneath the latency-vs-intelligence split: when a single model owns both intent and synthesis, voice traits can drift in ways no config knob fixes. Half-cascade puts each concern in its own component, so a model upgrade can't regress voice and a voice upgrade can't regress turn handling.

Tech Stack

Real-time reasoning: Azure OpenAI gpt-realtime-1.5 (text-only modality after ADR-020), Python 3.11+
Voice synthesis: Azure Speech DragonHD TTS (per-persona deterministic voice IDs)
Auth: Microsoft Entra managed identity (aad# wrapper, no plaintext Speech key)
Agent framework: LiveKit Agents SDK v1.5.9 with livekit-plugins-azure
Async grading: Azure Functions, o4-mini Structured Outputs, Cosmos DB
Transport: WebRTC via LiveKit Cloud (UDP, not HTTP)
Storage: Azure Blob Storage (transcript air gap), Azure Cosmos DB (grading results)
Email: Azure Communication Services (transactional coaching reports)
Frontend: Next.js with TypeScript, LiveKit BarVisualizer + useVoiceAssistant hook

The system spans 20 Architecture Decision Records and 199+ pytest tests with rubric calibration validation, from the two-brain split through the half-cascade migration.