Skip to main content
SwiftCase
PlatformSwitchboardFeaturesSolutionsCase StudiesFree ToolsPricingAbout
Book a Demo
SwiftCase

Workflow automation for UK service businesses. Created in the UK.

A Livepoint Solution

Platform

  • Platform Overview
  • Workflow Engine
  • Case Management
  • CRM
  • Document Generation
  • Data Model
  • Integrations
  • Analytics

Switchboard

  • Switchboard Overview
  • Voice AI
  • Chat
  • Email
  • SMS
  • WhatsApp

Features

  • All Features
  • High-Volume Operations
  • Multi-Party Collaboration
  • Contract Renewals
  • Compliance & Audit
  • Pricing
  • Case Studies
  • Customers
  • Why SwiftCase

Company

  • About
  • Our Team
  • Adam Sykes
  • Nik Ellis
  • Implementation
  • 30-Day Pilot
  • Operations Pressure Map
  • For Your Role
  • Peer Clusters
  • Engineering
  • Careers
  • Partners
  • Press
  • Research
  • Tech Radar
  • Blog
  • Contact

Resources

  • Use Cases
  • Software
  • ROI Calculator
  • Pressure Diagnostic
  • Pilot Scope Estimator
  • Board Case Builder
  • Free Tools
  • Guides & Templates
  • FAQ
  • Compare
  • Glossary
  • Best Practices
  • Changelog
  • Help Centre

Legal

  • Privacy
  • Terms
  • Cookies
  • Accessibility

Stay in the loop

Cyber Essentials CertifiedGDPR CompliantUK Data CentresISO 27001 Standards

© 2026 SwiftCase. All rights reserved.

Back to Blog
Engineering

The Hard Problems in Real-Time Voice AI

Building voice AI agents that feel natural is harder than it looks. Here are the technical challenges we solved building Switchboard - from interruption handling to turn-taking.

Dr. Adam Sykes

Dr. Adam Sykes

Founder & CEO

January 6, 2026
7 min read
The Hard Problems in Real-Time Voice AI
Contents
  • The Latency Budget
  • Our Approach
  • The Interruption Problem
  • Barge-In Detection
  • Turn-Taking
  • End-of-Turn Detection
  • Audio Quality Issues
  • Audio Preprocessing
  • Graceful Degradation
  • The Ordering Problem
  • Sequence Numbers
  • State Management
  • Message Sessions
  • Testing Voice Systems
  • Lessons Learned
  • 1. Streaming Is Table Stakes
  • 2. The Hard Part Isn't the LLM
  • 3. Perfect Is Impossible
  • 4. Human Evaluation Matters
  • What's Next
  • Want to solve hard problems?

Voice AI sounds simple: transcribe speech, send to LLM, synthesise response. In practice, building voice agents that feel natural to talk to is full of hard problems that don't exist in text interfaces.

Here's what we learned building Switchboard's voice capabilities.

The Latency Budget

Humans notice delays over 200-300 milliseconds in conversation. That's your total budget - from when the user stops speaking to when they hear a response.

Break that down:

  • Detecting speech end: 50-100ms
  • Transcription: 100-200ms
  • LLM inference: 200-500ms (or more)
  • Text-to-speech: 100-200ms
  • Network round trips: Variable

Add those up and you're already over budget. This is why streaming everything is non-negotiable - you can't wait for complete responses before starting the next step.

Our Approach

We pipeline everything:

  1. Audio streams to transcription while the user is still speaking
  2. Partial transcripts stream to the LLM before the user finishes
  3. LLM responses stream to TTS as tokens generate
  4. Audio streams back while the response is still generating

The user hears the beginning of the response while the LLM is still generating the end. This feels responsive even when total processing time is high.

The Interruption Problem

Humans interrupt each other. It's natural conversational behaviour. A voice agent that can't handle interruptions feels robotic and frustrating.

But interruption handling is complex:

  1. Detection: Is this an interruption or just background noise?
  2. Stopping: Halt TTS playback immediately
  3. Context: What was the user trying to say?
  4. Recovery: Continue naturally from the interruption

Barge-In Detection

We continuously monitor for user speech while the agent is talking. When detected:

  1. Stop TTS playback within milliseconds
  2. Cancel any pending audio chunks
  3. Capture what the user is saying
  4. Update the conversation context

The challenge is distinguishing intentional interruption from:

  • Background noise
  • The user's "uh-huh" acknowledgments
  • Audio feedback loops (the user's mic picking up the agent)

We use a combination of voice activity detection, volume thresholds, and timing heuristics. It's not perfect, but it handles the common cases well.

Turn-Taking

Conversation has natural turn-taking patterns. The agent needs to know:

  • When the user has finished speaking (not just paused)
  • When to respond immediately vs. wait for more input
  • When the user is thinking vs. waiting for the agent

End-of-Turn Detection

Simple silence detection doesn't work:

  • Some users pause while thinking
  • Some sentences have natural mid-sentence pauses
  • Different topics warrant different pause tolerances

We combine:

  • Acoustic signals: Falling intonation, silence duration
  • Linguistic signals: Complete sentences, question marks
  • Context signals: Whether a response is expected

This is still an active area of improvement. Getting turn-taking perfect is harder than getting the LLM responses right.

Audio Quality Issues

Real-world audio is messy:

  • Background noise (offices, streets, cars)
  • Poor microphone quality
  • Network packet loss
  • Echo and feedback
  • Accents and speech patterns

Audio Preprocessing

Before transcription, we:

  1. Noise reduction: Filter consistent background noise
  2. Normalisation: Consistent volume levels
  3. Echo cancellation: Remove agent audio picked up by user's mic
  4. Packet loss handling: Interpolate missing audio samples

Better input quality means better transcription means better responses.

Graceful Degradation

When audio quality is poor, we:

  • Request clarification naturally ("Sorry, I didn't catch that - could you repeat?")
  • Ask specific questions rather than open-ended ones
  • Confirm important information before acting

The goal is handling bad audio gracefully rather than producing garbage responses.

The Ordering Problem

With everything streaming and async, ordering becomes complex:

  • Transcription chunks might arrive out of order
  • LLM tokens stream while earlier tokens are still being spoken
  • Interruptions create race conditions between stopping and starting

Sequence Numbers

Every chunk has a sequence number. The audio player maintains a buffer and reorders as needed. If chunks arrive too late, they're dropped rather than played out of order.

This is fiddly to get right. Off-by-one errors result in garbled audio or unnatural pauses.

State Management

A voice conversation is stateful:

  • What has been said (conversation history)
  • What is currently being said (in-progress transcription)
  • What the agent is saying (current response)
  • What's queued to be said (pending audio)
  • What was interrupted (cancelled responses)

All of this state changes rapidly and concurrently. Race conditions are everywhere.

Message Sessions

We wrap conversation state in message session objects that handle:

  • Atomic updates to conversation history
  • Proper cleanup on interruption
  • Context preservation across turns
  • Graceful handling of concurrent updates

Testing Voice Systems

How do you test a voice interface?

  • Unit tests: Test individual components (transcription, TTS, LLM integration)
  • Integration tests: Test the full pipeline with recorded audio
  • Scenario tests: Test specific conversation flows
  • Chaos testing: Inject latency, packet loss, interruptions

We have dedicated test suites for:

  • Barge-in behaviour
  • Interruption recovery
  • Audio ordering
  • Race conditions under load

Automated testing catches regressions, but human evaluation remains essential. A technically correct response can still feel unnatural.

Lessons Learned

1. Streaming Is Table Stakes

You cannot build good voice AI without streaming end-to-end. Batch processing creates unacceptable latency.

2. The Hard Part Isn't the LLM

LLM integration is straightforward. The hard problems are in audio handling, turn-taking, and state management. Don't underestimate the "glue" work.

3. Perfect Is Impossible

Real-world audio and human behaviour are too variable for perfect handling. Design for graceful degradation - handle the common cases well and fail gracefully on edge cases.

4. Human Evaluation Matters

Metrics can tell you latency and error rates. They can't tell you whether a conversation felt natural. Regular human evaluation is essential.

What's Next

We're continuing to improve:

  • Better turn-taking: ML-based end-of-turn detection
  • Emotion awareness: Detecting user frustration and adapting
  • Multi-party support: Handling multiple speakers
  • Lower latency: Squeezing milliseconds out of every step

Voice AI is genuinely hard. But when it works - when a user has a natural conversation with an AI agent - it's transformative.


Want to solve hard problems?

Switchboard's voice capabilities are built by engineers who enjoy thorny technical challenges. If this sounds interesting:

View engineering roles | Learn about Switchboard Voice

Related Articles

Engineering

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

January 21, 20268 min read
Engineering

Text Normalisation for Natural AI Speech: Making TTS Sound Human

January 19, 20267 min read
Engineering

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

January 17, 20268 min read

Get automation insights delivered

Join operations leaders who get weekly insights on workflow automation and AI.

About the Author

Dr. Adam Sykes
Dr. Adam Sykes

Founder & CEO

Founder & CEO of SwiftCase. PhD in Computational Chemistry. 35+ years programming experience.

View all articles by Adam →

Related Free Tools

Workflow Mapper

Draw your business process visually and export a professional PDF.

Try free

SLA Template Builder

Build and download a professional Service Level Agreement.

Try free

Meeting Cost Calculator

See the true cost of your meetings based on attendees and salary.

Try free

11.8M+ cases processed

How we build SwiftCase

A look behind the curtain at the engineering decisions, tools, and culture that power our platform.

Meet the Engineering Team
View Careers