Skip to main content

Building a Real-Time Voice AI Sales Coach

June 1, 2025 · 5 min read

Real-time voice AI has a fundamental tension: natural conversation demands sub-second responses, but meaningful coaching requires deep analysis that takes 5-15 seconds. You can't have both in a single pipeline. So I built two.

The Problem

At Right at Home, sales representatives needed practice handling complex buyer objections — affordability concerns, care urgency, competitive comparisons. Traditional role-play training doesn't scale across 700+ franchise locations. I set out to build an AI-powered coaching agent that could simulate realistic buyer personas and deliver structured performance feedback.

The core constraint: a monolithic architecture where one model handles both conversation and evaluation is mathematically impossible at acceptable latency. Audio round-trips through STT → LLM → TTS add 500ms+ of pipeline overhead before the model even starts reasoning.

Two-Brain Architecture

The solution separates real-time interaction from deep evaluation through an asynchronous air gap:

Brain 1 (The Actor) handles live conversation using Azure OpenAI's gpt-realtime model. This is audio-to-audio streaming — no intermediate speech-to-text or text-to-speech steps. The model receives raw audio and produces raw audio, cutting the traditional pipeline latency in half. LiveKit Agents SDK v1.3 provides the WebRTC transport layer, keeping glass-to-glass latency under 1000ms.

Brain 2 (The Grader) runs post-session. When a coaching call ends, Brain 1 uploads the full transcript to Azure Blob Storage. An Event Grid trigger fires an Azure Function that evaluates the transcript against rubric-based criteria using GPT-4o with Structured Outputs. The grader produces deterministic JSON scores, coaching summaries, and specific evidence citations — then delivers results via email.

The "air gap" between them is just Azure Blob Storage. No shared state, no bidirectional communication, no latency coupling.

Dynamic Buyer Personas

Each training scenario is defined by a JSON persona configuration that controls both brains:

The persona specifies the buyer's voice, personality, opening line, and — critically — their call direction. This matters because large language models have deep RLHF conditioning to behave as "helpful assistants." When you need the AI to play someone who received a phone call, that default behavior fights you.

I solved this with three-layer identity anchoring:

  1. Programmatic preamble — a call-direction statement prepended to the system prompt ("You RECEIVED this phone call")
  2. ChatContext priming — for caller personas, a synthetic greeting message exploits the model's RLHF training to produce caller-style speech patterns
  3. Lifecycle hook — the opening line delivered via on_enter() with per-response identity reinforcement

This three-layer approach overcomes the assistant-mode default reliably across hundreds of test sessions.

Dual-Rubric Grading

Right at Home uses fundamentally different evaluation criteria for inside sales (incoming calls) versus outside sales (outbound calls). Inside sales uses a 90-point rubric focused on responsiveness, empathy, and qualifying needs. Outside sales uses a 100-point rubric across preparation, discovery, value proposition, objection handling, and closing.

Rather than forcing both into a single schema (which would require messy allOf workarounds in JSON Schema strict mode), I implemented separate GPT-4o prompts and Structured Output schemas per call type. The persona's call_type field flows through the transcript into Brain 2, which selects the appropriate rubric.

One interesting challenge: the ISR rubric was designed for human mystery shoppers who can observe things like "phone answered within 3 rings." An AI grading a transcript can't evaluate that. The solution was auto-granting those points while evaluating observable behaviors — professional greeting, using the caller's name, demonstrating active listening — for the remaining criteria.

All scores normalize to 0-100 for cross-rubric analytics, so franchise operators can compare performance regardless of call type.

Latency Budget

Real-time voice has zero margin for error. Here's how the 1000ms budget breaks down:

ComponentBudget
WebRTC transport (UDP via LiveKit)50-150ms
Voice Activity Detection (server-side)200-300ms
LLM inference (audio-to-audio)300-500ms
Agent logic (pure asyncio)<50ms

The agent loop is entirely non-blocking. Any synchronous I/O would blow the budget. Even persona loading uses functools.lru_cache to avoid disk reads after the first session.

Everything runs in Azure East US 2 — the only region with gpt-realtime GA availability. Co-locating all services (OpenAI endpoints, Blob Storage, Cosmos DB, the agent runtime) in the same region shaves 20-40ms off each inter-service call.

What I Learned

Audio-to-audio is transformative. Bypassing the STT/TTS pipeline isn't just faster — it produces more natural conversation because the model processes prosody, hesitation, and emphasis directly from the audio signal.

Structured Outputs make grading deterministic. Without "strict": true on the JSON Schema, GPT-4o would occasionally invent new rubric categories or return scores outside valid ranges. Strict mode eliminated that entirely.

The air gap is the architecture. Separating real-time interaction from deep analysis isn't a compromise — it's the only design that satisfies both constraints simultaneously. Each brain operates at its natural timescale without degrading the other.

Identity anchoring requires defense in depth. No single technique reliably overrides RLHF conditioning. The three-layer approach (preamble + context priming + lifecycle hooks) is the minimum viable solution for persona fidelity.

Tech Stack

  • Real-time voice: Azure OpenAI gpt-realtime, LiveKit Agents SDK v1.3, Python 3.11+
  • Async grading: Azure Functions, GPT-4o Structured Outputs, Cosmos DB
  • Transport: WebRTC via LiveKit Cloud (UDP, not HTTP)
  • Storage: Azure Blob Storage (transcript air gap), Azure Cosmos DB (grading results)
  • Email: Azure Communication Services (transactional coaching reports)
  • Frontend: Next.js with TypeScript, LiveKit BarVisualizer + useVoiceAssistant hook

The full system represents about 80 hours of engineering, with Phase 4 (the grading pipeline) consuming 36 hours alone — mostly on multi-rubric restructuring and email delivery resilience.