Contents
Your AI calls a customer. Someone picks up. But who?
A human says "Hello?" and waits for a response. A voicemail system plays a greeting and waits for a beep. An IVR announces menu options. Each requires a completely different interaction strategy.
Getting this wrong wastes time and creates awkward experiences. Your AI launches into a conversation with a voicemail greeting. Or it sits silently while a human waits for someone to speak. Or it tries to talk over an IVR menu.
Two complementary technologies solve this problem: Answering Machine Detection (AMD) identifies what answered the phone, while Voice Activity Detection (VAD) determines when someone is speaking in real-time.
Answering Machine Detection (AMD)
AMD analyses the audio characteristics of whoever answers a call to classify them as human or machine. Telephony providers like Twilio run this detection automatically when you enable it on outbound calls.
The classification categories are more nuanced than just "human" or "machine":
- human: A person answered and is waiting to speak
- machine_start: A machine answered but the greeting is still playing
- machine_end_beep: A machine answered and the beep just occurred
- machine_end_silence: A machine answered and went silent (no beep)
- fax: A fax machine answered
- unknown: Detection was inconclusive
The distinction between machine_start and machine_end_beep matters. If you want to leave a voicemail message, you need to wait for the beep. Detecting machine_start tells you a voicemail answered, but you shouldn't start talking yet.
Synchronous vs Asynchronous AMD
AMD can run in two modes, each with tradeoffs.
Synchronous AMD delays the call connection until detection completes. The caller hears ringing while analysis happens. When the call connects, you already know what answered. The downside is added latency, typically 3-5 seconds, which some humans notice as an awkward pause after they say "hello".
Asynchronous AMD connects the call immediately while detection runs in the background. Results arrive via webhook once analysis completes. This eliminates the awkward pause but means your AI starts talking before knowing what answered. If a voicemail is detected mid-conversation, you need to handle the transition gracefully.
Most production systems use asynchronous AMD to prioritise natural human interactions, then handle machine detection as a background event.
Handling AMD Results
When an answering machine is detected, you have three options:
Hang up. If the call isn't worth leaving a message, end it cleanly. This is appropriate for time-sensitive calls where voicemail follow-up isn't valuable.
Leave a message. Interrupt any in-progress speech, wait for the beep if needed, deliver a pre-configured voicemail message, then hang up. The message can include dynamic placeholders populated from call metadata.
Continue. Let the AI keep talking to the voicemail as if it were a human. This sometimes makes sense for simple informational calls where the full message is short enough to fit in a voicemail greeting.
The choice depends on the call's purpose. A payment reminder might benefit from leaving a message. A complex support callback probably shouldn't.
The Voicemail Message Flow
Leaving a voicemail requires careful sequencing:
- Detect answering machine - AMD reports machine_start or machine_end_*
- Interrupt current speech - Stop any TTS in progress, clear audio buffers
- Wait for beep (if needed) - If machine_start was detected, wait for machine_end_beep
- Deliver message - Generate and play the voicemail audio
- Wait for playback - Ensure the full message is heard before hanging up
- End call - Cleanly terminate the connection
The interruption step is critical. If your AI was mid-sentence when AMD completed, that partial speech shouldn't continue playing over your voicemail message. A coordinated interrupt cancels in-flight TTS requests, clears audio buffers, and sends a clear command to the telephony provider.
Voice Activity Detection (VAD)
While AMD determines what answered the phone, VAD determines when that entity is speaking. VAD operates continuously throughout the call, analysing audio energy levels to detect speech in real-time.
VAD serves several purposes in voice AI:
- Barge-in detection: Recognising when a caller starts speaking while the AI is talking
- End-of-utterance detection: Determining when a caller has finished speaking
- Audio quality monitoring: Tracking signal-to-noise ratio and call quality
- Pre-filtering for speech recognition: Avoiding sending silence to the transcription service
Cloud speech-to-text services like Deepgram provide their own utterance detection, but local VAD offers faster response times. A local energy check can detect speech start within 40-60 milliseconds, while waiting for cloud confirmation takes 200-500 milliseconds.
Energy-Based Speech Detection
The simplest VAD approach uses audio energy thresholds. Speech produces higher energy levels than silence or background noise.
Audio energy is typically measured in dBFS (decibels relative to full scale). 0 dBFS is maximum amplitude; -96 dBFS is effectively silence for 16-bit audio. Normal speech falls somewhere between -35 dBFS and -15 dBFS, depending on the caller's volume and microphone distance.
A basic VAD checks whether the current audio frame exceeds a speech threshold:
- Below -50 dBFS: Silence
- Between -50 and -35 dBFS: Possible background noise
- Above -35 dBFS: Likely speech
- Above -25 dBFS: Definite speech (potential interrupt)
These thresholds vary by environment. A quiet office has a lower noise floor than a busy street. Adaptive thresholds that adjust based on observed noise levels improve accuracy.
State Machine for Speech Detection
Raw energy readings are noisy. A single high-energy frame doesn't necessarily mean speech started, and a single low-energy frame doesn't mean speech ended. A state machine with confirmation requirements reduces false positives.
The approach tracks consecutive frames above or below thresholds:
To confirm speech start:
- Require N consecutive frames above the speech threshold (typically 3 frames, about 60ms)
- Only then transition to "speaking" state and emit a speech-start event
To confirm speech end:
- Require M consecutive frames below the threshold (typically 10 frames, about 200ms)
- Also apply a "hangover" period after the last high-energy frame
- Only then transition to "silent" state and emit a speech-end event
The hangover period prevents premature end-of-speech detection during natural pauses within an utterance. People don't speak in continuous streams; they pause for breath, emphasis, and thought.
Barge-In Detection
Barge-in occurs when a caller starts speaking while the AI is still talking. Handling this well is crucial for natural conversation flow.
The challenge is distinguishing caller speech from echo of the AI's own voice. When your AI speaks, some of that audio comes back through the caller's microphone. Without echo cancellation, this could trigger false barge-in detection.
Several strategies help:
Higher interrupt threshold. Require stronger energy levels to trigger barge-in than to detect normal speech. If the speech threshold is -35 dBFS, the interrupt threshold might be -25 dBFS.
Confirmation frames. Require multiple consecutive high-energy frames before triggering an interrupt. This filters out brief noise spikes.
Cooldown period. After triggering a barge-in, ignore subsequent high-energy frames for a short period (500ms). This prevents rapid-fire interrupts from sustained loud audio.
Context awareness. Only check for barge-in while the AI is actually speaking. During caller turns, high energy is expected and shouldn't trigger interrupts.
When barge-in is detected, the system should:
- Stop TTS generation immediately
- Clear any buffered audio not yet sent
- Send a clear command to the telephony provider
- Reset the speech recognition context for the new utterance
AMD and VAD Working Together
AMD and VAD complement each other in the call lifecycle:
Call answered → AMD determines if human, machine, or IVR answered
If human:
- VAD monitors for speech start/end
- Barge-in detection enables natural interruption
- Normal conversation proceeds
If answering machine:
- AMD action executes (hangup, leave_message, or continue)
- If leaving message, VAD monitors for beep completion
- After message delivery, call ends
If IVR:
- AMD often classifies IVRs as machine_start initially
- AI switches to menu navigation mode
- VAD helps detect menu option announcements
- DTMF tones navigate the menu
The classification isn't always clean. Some voicemail greetings sound like humans. Some humans answer with long pauses that look like machines. AMD confidence levels help, but edge cases require graceful fallback behaviour.
Adaptive Thresholds
Fixed energy thresholds work poorly across different call environments. A threshold tuned for quiet offices fails on mobile calls from noisy locations.
Adaptive thresholding adjusts detection parameters based on observed audio characteristics:
Noise floor estimation. Track the 10th percentile of energy levels over a rolling window. This represents the background noise level when no one is speaking.
Dynamic speech threshold. Set the speech threshold 15-20 dB above the estimated noise floor. As the noise floor rises, the threshold rises with it.
SNR monitoring. Track the ratio between speech energy (90th percentile) and noise floor. If SNR drops below 10 dB, speech detection becomes unreliable. The system can warn about poor call quality or adjust its behaviour accordingly.
Threshold adjustment should be gradual. Sudden changes cause erratic detection behaviour. A smoothing factor that limits how quickly thresholds can move prevents overreaction to transient noise.
Audio Quality Metrics
Beyond speech detection, continuous audio analysis provides valuable call quality metrics:
| Metric | Description | Healthy Range |
|---|---|---|
| Average dBFS | Mean energy level | -30 to -20 dBFS |
| Peak dBFS | Maximum amplitude | Below -3 dBFS |
| Speech percentage | Frames containing speech | 20-60% |
| Silence percentage | Frames with minimal energy | 30-70% |
| Clipping percentage | Frames near max amplitude | Below 1% |
| Estimated SNR | Signal-to-noise ratio | Above 15 dB |
High clipping percentages indicate distortion. Low SNR suggests the caller is in a noisy environment. Excessive silence might mean the caller is confused or the connection has issues.
These metrics help diagnose call quality problems and can trigger adaptive behaviours like speaking more slowly, repeating key information, or suggesting the caller move to a quieter location.
Configuration Per Agent
Different use cases warrant different detection parameters. A customer service agent handling inbound calls has different needs than an outbound appointment reminder.
Configurable options include:
AMD settings:
- Enabled/disabled for the agent
- Action to take (continue, hangup, leave_message)
- Voicemail message template with placeholder support
VAD settings:
- Speech threshold (dBFS)
- Interrupt threshold for barge-in
- Minimum speech frames for confirmation
- Minimum silence frames for end detection
- Hangover duration
An outbound collections agent might use aggressive AMD (hang up on machines) and sensitive barge-in detection (let customers interrupt). An inbound support agent doesn't need AMD at all but benefits from careful end-of-utterance detection to avoid cutting off customers mid-sentence.
Practical Considerations
AMD adds latency. Even asynchronous AMD takes time to classify. The AI might already be several sentences into a conversation before learning it's talking to a voicemail. Design your opening to work for both scenarios.
False positives happen. Some humans answer with long pauses or scripted greetings that sound machine-like. Your voicemail message should make sense even if delivered to a confused human.
Network conditions vary. Mobile calls have more packet loss and jitter than landlines. Audio energy calculations on degraded audio are less reliable. Build in tolerance for noisy conditions.
Echo cancellation helps. If your telephony setup includes echo cancellation, barge-in detection becomes more reliable. Without it, you're fighting the AI's own voice coming back through the caller's microphone.
Test with real phones. Simulated audio doesn't capture the full variability of real telephone networks. Test with actual mobile and landline calls in various acoustic environments.
Building voice AI that handles real-world calls?
SwiftCase integrates intelligent call handling with workflow automation. Our voice platform includes AMD, VAD, and adaptive audio processing so your AI agents respond appropriately whether they reach a human, voicemail, or IVR.
