Contents
You've integrated a real-time transcription API. Your tests pass. The demo works perfectly. Then you deploy to production and users start complaining: "The system takes forever to respond after I stop speaking."
You check your logs. The AI is responding quickly once it receives the transcript. The text-to-speech is snappy. So where's the delay coming from?
After hours of debugging, you discover the culprit: your speech recognition events aren't behaving the way you thought they were.
The Problem: Phantom Latency
We recently encountered this exact issue in our voice AI platform. Users were experiencing 2-3 second delays between finishing their sentence and receiving a response. For spelled-out content like postcodes or reference numbers, the delay was even worse.
Our architecture looked straightforward:
- Audio streams in via WebSocket
- Speech-to-text service transcribes in real-time
- When the user stops speaking, we send the complete utterance to the AI
- AI responds, text-to-speech plays back
The problem was step 3. We were waiting for an "utterance end" signal that wasn't arriving when we expected it.
Understanding Speech Recognition Events
Most real-time transcription services emit several types of events during a conversation:
Interim results arrive continuously as the user speaks. These are preliminary transcriptions that may change as more audio context becomes available. "I need to book a" might become "I need to book an appointment" as the service hears more.
Final results are committed transcriptions. The service has high confidence in these words and won't revise them. They typically arrive at natural phrase boundaries or after brief pauses.
Utterance end signals that the user has finished speaking. This is the event you wait for before processing the complete input.
The naive approach is to listen for "utterance end" as a distinct event type, separate from the transcript stream. You accumulate text in one handler and wait for the end signal in another. This looks reasonable. It's also wrong for many APIs.
The Event Routing Trap
Here's what we discovered: many speech-to-text services don't emit utterance end as a separate event. Instead, they send it as a message type within the transcript event channel.
The correct approach is to check each transcript message for its type. If the type indicates utterance end, you process the accumulated buffer. Otherwise, you continue accumulating final transcript segments.
The difference is subtle but critical. In the naive approach, you're listening for an event that never fires. In the correct approach, you're checking the message type within the transcript stream.
Our original implementation had both handlers, but the standalone utterance end listener was receiving nothing. We were falling back to timeout-based detection, which added 2-3 seconds of unnecessary delay.
The Flags That Matter
Real-time transcription involves multiple confidence signals. Understanding these prevents both premature and delayed processing:
Final segment flags indicate the service won't revise this particular segment. You can safely append it to your buffer. But a final segment doesn't mean the user has stopped speaking. They might just be pausing between words.
Speech final flags indicate the end of a continuous speech segment. The user has paused long enough that the service considers this a complete phrase. But they might continue speaking after a brief pause.
Utterance end signals (or equivalent) indicate the user has definitively stopped speaking. This is your signal to process the input.
A robust implementation checks all three: accumulating final segments, noting speech boundaries as potential processing points, and waiting for utterance end before committing to a response.
Adaptive Timeouts for Different Speech Patterns
Even with correct event handling, you'll need timeout-based fallbacks. Network issues, API quirks, or unusual speech patterns can delay or prevent utterance end signals.
But a single timeout value doesn't work well across all scenarios. Consider the difference between:
- "Yes" (single word, should process immediately)
- "My postcode is S-W-1-A-2-A-A" (spelled content with pauses between characters)
- "I need to reschedule my appointment for next Tuesday at 3pm if that's available" (long continuous speech)
Spelled content is particularly challenging. Users naturally pause between characters, and each pause can look like an utterance end. A short timeout processes "S-W-1" before the user finishes "S-W-1-A-2-A-A".
We implemented adaptive timeouts based on speech patterns. The system analyses recent transcript content to detect patterns like spelled characters or single letters. When it detects spelled content, it extends the timeout to allow for natural pauses between characters. For phrases that end with clear punctuation markers, it uses a shorter timeout. Everything else falls in between.
This reduced our average response latency by approximately one second for typical speech, while still handling spelled content correctly.
Testing Real-Time Speech Systems
Unit testing real-time audio systems is notoriously difficult. Here's what actually works:
Record and replay real conversations. Capture actual WebSocket messages from production (with appropriate anonymisation) and replay them in tests. Synthetic audio rarely captures the quirks of real speech patterns.
Test the event accumulation logic separately. Your transcript buffer and flush logic can be tested independently of the audio pipeline. Feed it sequences of mock events and verify the output.
Measure latency in production. Add timestamps at each stage: audio received, transcript received, utterance end detected, AI processing started, response sent. You can't optimise what you can't measure.
Test with spelled content explicitly. Create test cases for postcodes, reference numbers, and email addresses. These are common in business applications and frequently expose timing issues.
The Debugging Checklist
When you encounter unexplained latency in a voice application, work through this checklist:
-
Verify event routing. Log every event type you receive. Are utterance end signals arriving as separate events or within the transcript stream?
-
Check your flags. Are you correctly distinguishing between final segments, speech boundaries, and utterance end? Each has a different meaning.
-
Examine your timeouts. What's your fallback timeout value? Is it appropriate for the types of content your users speak?
-
Measure at each stage. Add timing logs between audio receipt, transcript receipt, utterance detection, and response generation. The bottleneck is often not where you expect.
-
Test with real speech patterns. Synthetic test audio doesn't capture hesitations, corrections, and natural pauses. Use recorded conversations.
Lessons Learned
Building real-time voice applications exposes assumptions that text-based systems never challenge. Users don't type in clean, complete sentences. They speak in fragments, pause to think, spell things out, and correct themselves mid-sentence.
The documentation for speech-to-text APIs often describes the happy path. Real integration requires understanding the edge cases: how events are routed, what each flag actually means, and how to handle the messy reality of human speech.
Our latency issue came down to a single misunderstanding about event routing. The fix was minimal. Finding it took considerably longer.
If you're building voice applications and hitting similar issues, start by questioning your assumptions about how events flow through your system. The answer is often hiding in plain sight, just not where the documentation told you to look.
Building voice-enabled workflows?
SwiftCase combines workflow automation with AI-powered voice capabilities, helping operations teams handle routine calls while maintaining full audit trails and process compliance.
