Contents
Voice AI sounds simple: transcribe speech, send to LLM, synthesise response. In practice, building voice agents that feel natural to talk to is full of hard problems that don't exist in text interfaces.
Here's what we learned building Switchboard's voice capabilities.
The Latency Budget
Humans notice delays over 200-300 milliseconds in conversation. That's your total budget - from when the user stops speaking to when they hear a response.
Break that down:
- Detecting speech end: 50-100ms
- Transcription: 100-200ms
- LLM inference: 200-500ms (or more)
- Text-to-speech: 100-200ms
- Network round trips: Variable
Add those up and you're already over budget. This is why streaming everything is non-negotiable - you can't wait for complete responses before starting the next step.
Our Approach
We pipeline everything:
- Audio streams to transcription while the user is still speaking
- Partial transcripts stream to the LLM before the user finishes
- LLM responses stream to TTS as tokens generate
- Audio streams back while the response is still generating
The user hears the beginning of the response while the LLM is still generating the end. This feels responsive even when total processing time is high.
The Interruption Problem
Humans interrupt each other. It's natural conversational behaviour. A voice agent that can't handle interruptions feels robotic and frustrating.
But interruption handling is complex:
- Detection: Is this an interruption or just background noise?
- Stopping: Halt TTS playback immediately
- Context: What was the user trying to say?
- Recovery: Continue naturally from the interruption
Barge-In Detection
We continuously monitor for user speech while the agent is talking. When detected:
- Stop TTS playback within milliseconds
- Cancel any pending audio chunks
- Capture what the user is saying
- Update the conversation context
The challenge is distinguishing intentional interruption from:
- Background noise
- The user's "uh-huh" acknowledgments
- Audio feedback loops (the user's mic picking up the agent)
We use a combination of voice activity detection, volume thresholds, and timing heuristics. It's not perfect, but it handles the common cases well.
Turn-Taking
Conversation has natural turn-taking patterns. The agent needs to know:
- When the user has finished speaking (not just paused)
- When to respond immediately vs. wait for more input
- When the user is thinking vs. waiting for the agent
End-of-Turn Detection
Simple silence detection doesn't work:
- Some users pause while thinking
- Some sentences have natural mid-sentence pauses
- Different topics warrant different pause tolerances
We combine:
- Acoustic signals: Falling intonation, silence duration
- Linguistic signals: Complete sentences, question marks
- Context signals: Whether a response is expected
This is still an active area of improvement. Getting turn-taking perfect is harder than getting the LLM responses right.
Audio Quality Issues
Real-world audio is messy:
- Background noise (offices, streets, cars)
- Poor microphone quality
- Network packet loss
- Echo and feedback
- Accents and speech patterns
Audio Preprocessing
Before transcription, we:
- Noise reduction: Filter consistent background noise
- Normalisation: Consistent volume levels
- Echo cancellation: Remove agent audio picked up by user's mic
- Packet loss handling: Interpolate missing audio samples
Better input quality means better transcription means better responses.
Graceful Degradation
When audio quality is poor, we:
- Request clarification naturally ("Sorry, I didn't catch that - could you repeat?")
- Ask specific questions rather than open-ended ones
- Confirm important information before acting
The goal is handling bad audio gracefully rather than producing garbage responses.
The Ordering Problem
With everything streaming and async, ordering becomes complex:
- Transcription chunks might arrive out of order
- LLM tokens stream while earlier tokens are still being spoken
- Interruptions create race conditions between stopping and starting
Sequence Numbers
Every chunk has a sequence number. The audio player maintains a buffer and reorders as needed. If chunks arrive too late, they're dropped rather than played out of order.
This is fiddly to get right. Off-by-one errors result in garbled audio or unnatural pauses.
State Management
A voice conversation is stateful:
- What has been said (conversation history)
- What is currently being said (in-progress transcription)
- What the agent is saying (current response)
- What's queued to be said (pending audio)
- What was interrupted (cancelled responses)
All of this state changes rapidly and concurrently. Race conditions are everywhere.
Message Sessions
We wrap conversation state in message session objects that handle:
- Atomic updates to conversation history
- Proper cleanup on interruption
- Context preservation across turns
- Graceful handling of concurrent updates
Testing Voice Systems
How do you test a voice interface?
- Unit tests: Test individual components (transcription, TTS, LLM integration)
- Integration tests: Test the full pipeline with recorded audio
- Scenario tests: Test specific conversation flows
- Chaos testing: Inject latency, packet loss, interruptions
We have dedicated test suites for:
- Barge-in behaviour
- Interruption recovery
- Audio ordering
- Race conditions under load
Automated testing catches regressions, but human evaluation remains essential. A technically correct response can still feel unnatural.
Lessons Learned
1. Streaming Is Table Stakes
You cannot build good voice AI without streaming end-to-end. Batch processing creates unacceptable latency.
2. The Hard Part Isn't the LLM
LLM integration is straightforward. The hard problems are in audio handling, turn-taking, and state management. Don't underestimate the "glue" work.
3. Perfect Is Impossible
Real-world audio and human behaviour are too variable for perfect handling. Design for graceful degradation - handle the common cases well and fail gracefully on edge cases.
4. Human Evaluation Matters
Metrics can tell you latency and error rates. They can't tell you whether a conversation felt natural. Regular human evaluation is essential.
What's Next
We're continuing to improve:
- Better turn-taking: ML-based end-of-turn detection
- Emotion awareness: Detecting user frustration and adapting
- Multi-party support: Handling multiple speakers
- Lower latency: Squeezing milliseconds out of every step
Voice AI is genuinely hard. But when it works - when a user has a natural conversation with an AI agent - it's transformative.
Want to solve hard problems?
Switchboard's voice capabilities are built by engineers who enjoy thorny technical challenges. If this sounds interesting:

