Anatomy of a Real-Time Voice Agent

Introduction #

In the rapidly evolving landscape of conversational AI, it is easy to get lost in the specifics of its applications—whether it’s a voice agent for healthcare or a appointment scheduler. However, for this case study, I chose to pivot from the application to the abstraction.

💡 Key Insight

The voice interface is not the product — the orchestration pipeline behind it is. Understanding this architecture is what separates a prototype from a production-grade voice agent.

Rather than detailing a single use case, I am focusing on the fundamental architecture that powers the entire ecosystem. Why? Because beneath the surface, nearly every modern voice agent platform relies on the same core structural patterns. The distinction between a 'therapist bot' and a 'sales bot' often comes down to just three variable components: the System Prompt, the Tool Execution Layer, and the SaaS Wrapper.

High-level architecture diagram of a real-time voice agent

Fig 1. High-level architecture of a real-time voice agent pipeline — from microphone to speaker.

The Data Journey (Live View)

⚙️ Backend
Orchestrator

🧠 LLM

👂 STT

🗣️ TTS

📚 RAG

👤 User

Audio Input

Streaming Voicebots vs. Turn-Based Chatbots #

While Voicebots and Chatbots share the same "brain" (the LLM), their nervous systems are entirely different. To the end-user, the difference is just the medium. To an engineer, it is the difference between a discrete Request-Response cycle and a continuous Full-Duplex stream.

Feature	Chatbot (Discrete)	Voicebot (Streaming)
Interaction	Turn-Based	Full-Duplex (Simultaneous)
Protocol	HTTP / REST	WebSockets / gRPC / SIP
Latency	1-3s (Accepted)	< 500ms (Critical)
Data Flow	User types → Wait → Reply	Continuous audio stream
State	Stateless (per request)	Persistent & Stateful

⚡ Latency Target

For a voice agent to feel natural, the TTFB must be under 500ms. Human conversations have a natural turn-taking gap of ~200-300ms. Anything above 800ms feels like "thinking" — above 1.5s feels broken.

Core Components #

Every real-time voice agent is built from six core modules. Think of them as LEGO blocks — interchangeable, independently upgradable, but all essential to the final product.

1. Speech-to-Text (STT)

The Ears. The STT engine (e.g., Deepgram, Whisper, AssemblyAI) is responsible for converting raw audio streams into text. In a real-time architecture, standard transcription is insufficient. The engine must support streaming transcription via WebSockets, returning "interim results" (partial text) milliseconds after the user speaks. With streaming architecture, the final transcript is ready almost instantly after the user is done speaking.

2. Large Language Model (LLM)

The brain. The LLM receives the STT transcript and generates a response. For voice agents, the LLM must support streaming token output — emitting tokens one at a time rather than waiting for the full response. This is what enables the TTS to start speaking before the LLM has finished "thinking."

3. Text-to-Speech (TTS)

The mouth. TTS engines like ElevenLabs, Deepgram, or Sarvam AI convert text into natural-sounding audio. Modern TTS systems accept streaming text input and produce streaming audio output — meaning they can start speaking the first sentence while the LLM is still generating the second.

4. Voice Activity Detection (VAD)

The Traffic Cop. VAD is the unsung hero of voice interfaces. Its sole job is to distinguish human speech from silence, background noise, or non-speech vocalizations (like heavy breathing). A sophisticated VAD algorithm doesn't just detect sound; it manages the turn-taking logic. It determines if a pause is a comma (keep listening) or a full stop (start processing), preventing the agent from constantly interrupting the user.

5. The Orchestrator

The Conductor. The Orchestrator is the central nervous system that binds the components together. It is usually a custom backend service (Node.js/Python/Go) that manages the state of the conversation.

Pipeline Management: It pipes the output of the STT directly into the LLM and the output of the LLM into the TTS.
State Management: It handles conversation history, executes external tool calls (APIs), and manages authentication.
Interrupt Handling: It listens for "Barge-In" events and instantly kills the audio stream if the user speaks.

python

# Simplified orchestrator flow
async def handle_audio_stream(websocket):
    async for audio_chunk in websocket:
        # 1. Feed to STT
        transcript = await stt.transcribe(audio_chunk)

        if transcript.is_final:
            # 2. Feed to LLM (streaming)
            async for token in llm.stream(transcript.text):
                # 3. Feed to TTS (streaming)
                audio = await tts.synthesize(token)
                await websocket.send(audio)

6. RAG (Retrieval-Augmented Generation)

The Memory. While the LLM provides reasoning and language capabilities, it lacks specific knowledge about your business or user data. RAG bridges this gap by retrieving relevant context from a Vector Database (Pinecone, Qdrant, etc.) in real-time. Before the LLM generates a response, the system searches the knowledge base for semantic matches to the user's query and injects that information into the prompt, ensuring accuracy and reducing hallucinations.

The "Barge-In" Problem #

Barge-in is what happens when the user interrupts the agent mid-sentence. In a human conversation, this is natural — we do it constantly. But for a voice agent, it's an engineering nightmare.

When the user barges in, the system must simultaneously: stop TTS playback immediately, flush any queued audio from the buffer, capture the new user utterance via STT, and route the new transcript to the LLM with the correct context. All of this must happen within ~100ms, or the user will hear overlapping audio.

python

# Barge-in handling
async def on_user_speech_detected():
    # 1. Immediately stop TTS playback
    tts.stop()
    audio_buffer.flush()

    # 2. Cancel pending LLM generation
    llm.cancel_current_stream()

    # 3. Log partial response for context
    partial = tts.get_spoken_text_so_far()
    conversation.add_partial_response(partial)

    # 4. Resume listening for new utterance
    await stt.resume()

Echo cancellation adds another layer of complexity. The user's microphone picks up the agent's own voice coming from the speaker, creating a feedback loop. Without acoustic echo cancellation (AEC), the STT engine would transcribe the agent's own output as user speech and go in an infinite loop.

🔇 Echo Cancellation

Most modern browsers have built-in Acoustic Echo Cancellation (AEC) via WebRTC, which automatically filters out audio coming from the speakers. For server-side processing, reference-signal-based cancellation can be used to subtract the TTS output from the incoming microphone stream.

The Secret Sauce (Optimizations) #

Latency is the single most important metric for a voice agent. Every millisecond counts. Here are the key optimization techniques used in production systems:

1. Streaming Tokens #

Instead of waiting for the LLM to generate a complete response, tokens are relayed to TTS in real-time. To ensure natural intonation, tokens are typically divided into chunks of an entire sentence before being sent to the TTS engine. This allows the engine to begin synthesizing audio from the first sentence while the LLM continues generating the next. This alone can cut TTFB by 60-70%, as the audio generation happens in parallel with text generation.

2. The Latency Budget #

To achieve a conversational feel, we must adhere to a strict latency budget. Human conversation typically accepts a gap of ~200-500ms. Anything above 800ms feels like "thinking," and above 1.5s feels disjointed. Here is the breakdown:

Voice Agent Latency Budget
Component	Target Latency
VAD	50-100ms
STT	100-200ms (interim)
LLM TTFB	150-300ms
TTS TTFB	80-150ms
Network (2-way)	20-80ms (Variable)
TOTAL TTFB	400-830ms

3. Latency Masking #

The most clever optimization: use filler audio ("Let me check that for you...", "Hmm...") or strategic pauses to mask backend processing time. The user perceives engagement rather than silence. Some systems even use the LLM to generate contextually appropriate filler phrases based on the user's intent, buying valuable seconds for the main model to compute.

4. Modular Architecture #

A monolithic approach fails at scale. Instead, we use a modular design where no specific model is hardcoded. By standardizing a common input and output schema for each component (STT, LLM, TTS), the pipeline ensures consistent operation regardless of the underlying engine. This abstraction grants users the flexibility to choose any model—swapping GPT-4 for Llama-3 or Deepgram for Whisper—making the system highly scalable and future-proof without rebuilding the entire architecture.

5. Smart Caching (Semantic Router) #

The fastest inference is the one you don't run. In customer support, roughly 40% of queries are identical ("Reset my password", "Where is my order?"). By implementing a Semantic Cache, we embed the incoming transcript and search a vector database for similar past queries. If a match is found with high confidence (>0.95), we skip the LLM entirely and serve a pre-computed response, cutting latency from ~800ms to ~50ms.

Use Cases #

From an architectural perspective, a Voice Agent is just a generic "audio-in, audio-out" pipeline. The core infrastructure—WebSockets, VAD, STT, LLM, TTS—does not care if it is a doctor, a dungeon master, or a flight attendant.

Once you have built the Core Engine, you can pivot to entirely new industries by changing just three configuration files:

1. The System Prompt

The "personality and rules" injected into the LLM context.

2. The Tools

The specific APIs the agent is allowed to call (e.g., calendar_api).

3. The SaaS Wrapper

The UI/UX layer that surrounds the voice interaction.

By decoupling the Engine from the Implementation, we can build highly specific, high-value tools without rewriting the backend code. Here are three architectural configurations for use cases beyond simple scheduling.

Use Case A: The Customer Support Automator

SaaS / Telco

"Imagine a frustrated user whose delivery is late and they want to know where it is. They are angry, impatient, and want an update immediately."

THE VIBE

Empathetic, decisive, efficient.

System Prompt Config

Instruction: "You are a customer support bot. Acknowledge frustration immediately. Pivot to providing the user with the update they need. If confidence is low, escalate to human."

Style: "Short sentences. Confirm understanding after every step."

Tools Provided

➜ get_order_status(order_id)
➜ initiate_refund(order_id)
➜ schedule_technician()

The Wrapper

A web widget that shows a real-time progress bar of the diagnostic tests while the voice speaks ("I'm checking on your order now...").

Use Case B: The Language Roleplay Partner

EdTech

"Imagine a user learning Spanish who wants to practice ordering food in a busy cafe without the embarrassment of failing in front of a real person."

THE VIBE

Immersive, educational, patient.

System Prompt Config

Instruction: "You are a grumpy waiter at a cafe in Madrid. Speak only in Spanish. If the user makes a grammar mistake, ignore it unless it changes the meaning of the order."

Style: "Speak at 0.9x speed. Use simple vocabulary."

Tools Provided

➜ grammar_check(user_input)
➜ hint_generator()

The Wrapper

A gamified UI showing a "Confidence Meter" and live subtitles that blur out the words until the user hears them.

Use Case C: The Field Technician’s Copilot

Industrial / Blue Collar

"Imagine a wind turbine technician 300 feet in the air, wearing heavy gloves. They cannot type on a tablet, but they need technical data immediately."

THE VIBE

Ultra-efficient, terse, safety-critical.

System Prompt Config

Instruction: "You are a senior technical field assistant. Do not use pleasantries. Be extremely concise. If a value is out of safety range, alert the user immediately."

Style: "Output structured data first, then explanations."

Tools Provided

➜ query_technical_manuals(component_id)
➜ log_incident_report(severity, description)

The Wrapper

A "Push-to-Talk" (Walkie-Talkie style) interface to prevent accidental triggering by loud machinery.

Want to see this in action?

I have built Velox AI, a comprehensive platform that implements this exact architecture—featuring low-latency voice capabilities, modular tools, and real-time streaming as discussed in this post.

Check out Velox AI

Thanks for reading. If you enjoyed this deep-dive, let's connect and build something amazing.

LinkedIn GitHub

Navigation

Anatomy of a Real-Time Voice Agent

Introduction #

The Data Journey (Live View)

Streaming Voicebots vs. Turn-Based Chatbots #

Core Components #

1. Speech-to-Text (STT)

2. Large Language Model (LLM)

3. Text-to-Speech (TTS)

4. Voice Activity Detection (VAD)

5. The Orchestrator

6. RAG (Retrieval-Augmented Generation)

The "Barge-In" Problem #

The Secret Sauce (Optimizations) #

1. Streaming Tokens #

2. The Latency Budget #

3. Latency Masking #

4. Modular Architecture #

5. Smart Caching (Semantic Router) #

Use Cases #

1. The System Prompt

2. The Tools

3. The SaaS Wrapper

Use Case A: The Customer Support Automator

System Prompt Config

Tools Provided

The Wrapper

Use Case B: The Language Roleplay Partner

System Prompt Config

Tools Provided

The Wrapper

Use Case C: The Field Technician’s Copilot

System Prompt Config

Tools Provided

The Wrapper

Want to see this in action?