Contents

Voice AI sounds simple: transcribe speech, send to LLM, synthesise response. In practice, building voice agents that feel natural to talk to is full of hard problems that don't exist in text interfaces.

Here's what we learned building Switchboard's voice capabilities.

The Latency Budget

Humans notice delays over 200-300 milliseconds in conversation. That's your total budget - from when the user stops speaking to when they hear a response.

Break that down:

Detecting speech end: 50-100ms
Transcription: 100-200ms
LLM inference: 200-500ms (or more)
Text-to-speech: 100-200ms
Network round trips: Variable

Add those up and you're already over budget. This is why streaming everything is non-negotiable - you can't wait for complete responses before starting the next step.

Our Approach

We pipeline everything:

Audio streams to transcription while the user is still speaking
Partial transcripts stream to the LLM before the user finishes
LLM responses stream to TTS as tokens generate
Audio streams back while the response is still generating

The user hears the beginning of the response while the LLM is still generating the end. This feels responsive even when total processing time is high.

The Interruption Problem

Humans interrupt each other. It's natural conversational behaviour. A voice agent that can't handle interruptions feels robotic and frustrating.

But interruption handling is complex:

Detection: Is this an interruption or just background noise?
Stopping: Halt TTS playback immediately
Context: What was the user trying to say?
Recovery: Continue naturally from the interruption

Barge-In Detection

We continuously monitor for user speech while the agent is talking. When detected:

Stop TTS playback within milliseconds
Cancel any pending audio chunks
Capture what the user is saying
Update the conversation context

The challenge is distinguishing intentional interruption from:

Background noise
The user's "uh-huh" acknowledgments
Audio feedback loops (the user's mic picking up the agent)

We use a combination of voice activity detection, volume thresholds, and timing heuristics. It's not perfect, but it handles the common cases well.

Turn-Taking

Conversation has natural turn-taking patterns. The agent needs to know:

When the user has finished speaking (not just paused)
When to respond immediately vs. wait for more input
When the user is thinking vs. waiting for the agent

End-of-Turn Detection

Simple silence detection doesn't work:

Some users pause while thinking
Some sentences have natural mid-sentence pauses
Different topics warrant different pause tolerances

We combine:

Acoustic signals: Falling intonation, silence duration
Linguistic signals: Complete sentences, question marks
Context signals: Whether a response is expected

This is still an active area of improvement. Getting turn-taking perfect is harder than getting the LLM responses right.

Audio Quality Issues

Real-world audio is messy:

Background noise (offices, streets, cars)
Poor microphone quality
Network packet loss
Echo and feedback
Accents and speech patterns

Audio Preprocessing

Before transcription, we:

Noise reduction: Filter consistent background noise
Normalisation: Consistent volume levels
Echo cancellation: Remove agent audio picked up by user's mic
Packet loss handling: Interpolate missing audio samples

Better input quality means better transcription means better responses.

Graceful Degradation

When audio quality is poor, we:

Request clarification naturally ("Sorry, I didn't catch that - could you repeat?")
Ask specific questions rather than open-ended ones
Confirm important information before acting

The goal is handling bad audio gracefully rather than producing garbage responses.

The Ordering Problem

With everything streaming and async, ordering becomes complex:

Transcription chunks might arrive out of order
LLM tokens stream while earlier tokens are still being spoken
Interruptions create race conditions between stopping and starting

Sequence Numbers

Every chunk has a sequence number. The audio player maintains a buffer and reorders as needed. If chunks arrive too late, they're dropped rather than played out of order.

This is fiddly to get right. Off-by-one errors result in garbled audio or unnatural pauses.

State Management

A voice conversation is stateful:

What has been said (conversation history)
What is currently being said (in-progress transcription)
What the agent is saying (current response)
What's queued to be said (pending audio)
What was interrupted (cancelled responses)

All of this state changes rapidly and concurrently. Race conditions are everywhere.

Message Sessions

We wrap conversation state in message session objects that handle:

Atomic updates to conversation history
Proper cleanup on interruption
Context preservation across turns
Graceful handling of concurrent updates

Testing Voice Systems

How do you test a voice interface?

Unit tests: Test individual components (transcription, TTS, LLM integration)
Integration tests: Test the full pipeline with recorded audio
Scenario tests: Test specific conversation flows
Chaos testing: Inject latency, packet loss, interruptions

We have dedicated test suites for:

Barge-in behaviour
Interruption recovery
Audio ordering
Race conditions under load

Automated testing catches regressions, but human evaluation remains essential. A technically correct response can still feel unnatural.

Lessons Learned

1. Streaming Is Table Stakes

You cannot build good voice AI without streaming end-to-end. Batch processing creates unacceptable latency.

2. The Hard Part Isn't the LLM

LLM integration is straightforward. The hard problems are in audio handling, turn-taking, and state management. Don't underestimate the "glue" work.

3. Perfect Is Impossible

Real-world audio and human behaviour are too variable for perfect handling. Design for graceful degradation - handle the common cases well and fail gracefully on edge cases.

4. Human Evaluation Matters

Metrics can tell you latency and error rates. They can't tell you whether a conversation felt natural. Regular human evaluation is essential.

What's Next

We're continuing to improve:

Better turn-taking: ML-based end-of-turn detection
Emotion awareness: Detecting user frustration and adapting
Multi-party support: Handling multiple speakers
Lower latency: Squeezing milliseconds out of every step

Voice AI is genuinely hard. But when it works - when a user has a natural conversation with an AI agent - it's transformative.

Want to solve hard problems?

Switchboard's voice capabilities are built by engineers who enjoy thorny technical challenges. If this sounds interesting:

View engineering roles | Learn about Switchboard Voice

Contents

Here's what we learned building Switchboard's voice capabilities.

The Latency Budget

Humans notice delays over 200-300 milliseconds in conversation. That's your total budget - from when the user stops speaking to when they hear a response.

Break that down:

Detecting speech end: 50-100ms
Transcription: 100-200ms
LLM inference: 200-500ms (or more)
Text-to-speech: 100-200ms
Network round trips: Variable

Add those up and you're already over budget. This is why streaming everything is non-negotiable - you can't wait for complete responses before starting the next step.

Our Approach

We pipeline everything:

Audio streams to transcription while the user is still speaking
Partial transcripts stream to the LLM before the user finishes
LLM responses stream to TTS as tokens generate
Audio streams back while the response is still generating

The user hears the beginning of the response while the LLM is still generating the end. This feels responsive even when total processing time is high.

The Interruption Problem

Humans interrupt each other. It's natural conversational behaviour. A voice agent that can't handle interruptions feels robotic and frustrating.

But interruption handling is complex:

Detection: Is this an interruption or just background noise?
Stopping: Halt TTS playback immediately
Context: What was the user trying to say?
Recovery: Continue naturally from the interruption

Barge-In Detection

We continuously monitor for user speech while the agent is talking. When detected:

Stop TTS playback within milliseconds
Cancel any pending audio chunks
Capture what the user is saying
Update the conversation context

The challenge is distinguishing intentional interruption from:

Background noise
The user's "uh-huh" acknowledgments
Audio feedback loops (the user's mic picking up the agent)

We use a combination of voice activity detection, volume thresholds, and timing heuristics. It's not perfect, but it handles the common cases well.

Turn-Taking

Conversation has natural turn-taking patterns. The agent needs to know:

When the user has finished speaking (not just paused)
When to respond immediately vs. wait for more input
When the user is thinking vs. waiting for the agent

End-of-Turn Detection

Simple silence detection doesn't work:

Some users pause while thinking
Some sentences have natural mid-sentence pauses
Different topics warrant different pause tolerances

We combine:

Acoustic signals: Falling intonation, silence duration
Linguistic signals: Complete sentences, question marks
Context signals: Whether a response is expected

This is still an active area of improvement. Getting turn-taking perfect is harder than getting the LLM responses right.

Audio Quality Issues

Real-world audio is messy:

Background noise (offices, streets, cars)
Poor microphone quality
Network packet loss
Echo and feedback
Accents and speech patterns

Audio Preprocessing

Before transcription, we:

Noise reduction: Filter consistent background noise
Normalisation: Consistent volume levels
Echo cancellation: Remove agent audio picked up by user's mic
Packet loss handling: Interpolate missing audio samples

Better input quality means better transcription means better responses.

Graceful Degradation

When audio quality is poor, we:

Request clarification naturally ("Sorry, I didn't catch that - could you repeat?")
Ask specific questions rather than open-ended ones
Confirm important information before acting

The goal is handling bad audio gracefully rather than producing garbage responses.

The Ordering Problem

With everything streaming and async, ordering becomes complex:

Transcription chunks might arrive out of order
LLM tokens stream while earlier tokens are still being spoken
Interruptions create race conditions between stopping and starting

Sequence Numbers

Every chunk has a sequence number. The audio player maintains a buffer and reorders as needed. If chunks arrive too late, they're dropped rather than played out of order.

This is fiddly to get right. Off-by-one errors result in garbled audio or unnatural pauses.

State Management

A voice conversation is stateful:

What has been said (conversation history)
What is currently being said (in-progress transcription)
What the agent is saying (current response)
What's queued to be said (pending audio)
What was interrupted (cancelled responses)

All of this state changes rapidly and concurrently. Race conditions are everywhere.

Message Sessions

We wrap conversation state in message session objects that handle:

Atomic updates to conversation history
Proper cleanup on interruption
Context preservation across turns
Graceful handling of concurrent updates

Testing Voice Systems

How do you test a voice interface?

Unit tests: Test individual components (transcription, TTS, LLM integration)
Integration tests: Test the full pipeline with recorded audio
Scenario tests: Test specific conversation flows
Chaos testing: Inject latency, packet loss, interruptions

We have dedicated test suites for:

Barge-in behaviour
Interruption recovery
Audio ordering
Race conditions under load

Automated testing catches regressions, but human evaluation remains essential. A technically correct response can still feel unnatural.

Lessons Learned

1. Streaming Is Table Stakes

You cannot build good voice AI without streaming end-to-end. Batch processing creates unacceptable latency.

2. The Hard Part Isn't the LLM

LLM integration is straightforward. The hard problems are in audio handling, turn-taking, and state management. Don't underestimate the "glue" work.

3. Perfect Is Impossible

Real-world audio and human behaviour are too variable for perfect handling. Design for graceful degradation - handle the common cases well and fail gracefully on edge cases.

4. Human Evaluation Matters

Metrics can tell you latency and error rates. They can't tell you whether a conversation felt natural. Regular human evaluation is essential.

What's Next

We're continuing to improve:

Better turn-taking: ML-based end-of-turn detection
Emotion awareness: Detecting user frustration and adapting
Multi-party support: Handling multiple speakers
Lower latency: Squeezing milliseconds out of every step

Voice AI is genuinely hard. But when it works - when a user has a natural conversation with an AI agent - it's transformative.

Want to solve hard problems?

Switchboard's voice capabilities are built by engineers who enjoy thorny technical challenges. If this sounds interesting:

View engineering roles | Learn about Switchboard Voice

The Latency Budget

Our Approach

The Interruption Problem

Barge-In Detection

Turn-Taking

End-of-Turn Detection

Audio Quality Issues

Audio Preprocessing

Graceful Degradation

The Ordering Problem

Sequence Numbers

State Management

Message Sessions

Testing Voice Systems

Lessons Learned

1. Streaming Is Table Stakes

2. The Hard Part Isn't the LLM

3. Perfect Is Impossible

4. Human Evaluation Matters

What's Next

Want to solve hard problems?

Related Articles

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

Text Normalisation for Natural AI Speech: Making TTS Sound Human

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

Get automation insights delivered

About the Author

Related Free Tools

How we build SwiftCase

The Latency Budget

Our Approach

The Interruption Problem

Barge-In Detection

Turn-Taking

End-of-Turn Detection

Audio Quality Issues

Audio Preprocessing

Graceful Degradation

The Ordering Problem

Sequence Numbers

State Management

Message Sessions

Testing Voice Systems

Lessons Learned

1. Streaming Is Table Stakes

2. The Hard Part Isn't the LLM

3. Perfect Is Impossible

4. Human Evaluation Matters

What's Next

Want to solve hard problems?

Related Articles

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

Text Normalisation for Natural AI Speech: Making TTS Sound Human

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

Get automation insights delivered

About the Author

Related Free Tools

How we build SwiftCase