Contents

Your AI calls a customer. Someone picks up. But who?

A human says "Hello?" and waits for a response. A voicemail system plays a greeting and waits for a beep. An IVR announces menu options. Each requires a completely different interaction strategy.

Getting this wrong wastes time and creates awkward experiences. Your AI launches into a conversation with a voicemail greeting. Or it sits silently while a human waits for someone to speak. Or it tries to talk over an IVR menu.

Two complementary technologies solve this problem: Answering Machine Detection (AMD) identifies what answered the phone, while Voice Activity Detection (VAD) determines when someone is speaking in real-time.

Answering Machine Detection (AMD)

AMD analyses the audio characteristics of whoever answers a call to classify them as human or machine. Telephony providers like Twilio run this detection automatically when you enable it on outbound calls.

The classification categories are more nuanced than just "human" or "machine":

human: A person answered and is waiting to speak
machine_start: A machine answered but the greeting is still playing
machine_end_beep: A machine answered and the beep just occurred
machine_end_silence: A machine answered and went silent (no beep)
fax: A fax machine answered
unknown: Detection was inconclusive

The distinction between machine_start and machine_end_beep matters. If you want to leave a voicemail message, you need to wait for the beep. Detecting machine_start tells you a voicemail answered, but you shouldn't start talking yet.

Synchronous vs Asynchronous AMD

AMD can run in two modes, each with tradeoffs.

Synchronous AMD delays the call connection until detection completes. The caller hears ringing while analysis happens. When the call connects, you already know what answered. The downside is added latency, typically 3-5 seconds, which some humans notice as an awkward pause after they say "hello".

Asynchronous AMD connects the call immediately while detection runs in the background. Results arrive via webhook once analysis completes. This eliminates the awkward pause but means your AI starts talking before knowing what answered. If a voicemail is detected mid-conversation, you need to handle the transition gracefully.

Most production systems use asynchronous AMD to prioritise natural human interactions, then handle machine detection as a background event.

Handling AMD Results

When an answering machine is detected, you have three options:

Hang up. If the call isn't worth leaving a message, end it cleanly. This is appropriate for time-sensitive calls where voicemail follow-up isn't valuable.

Leave a message. Interrupt any in-progress speech, wait for the beep if needed, deliver a pre-configured voicemail message, then hang up. The message can include dynamic placeholders populated from call metadata.

Continue. Let the AI keep talking to the voicemail as if it were a human. This sometimes makes sense for simple informational calls where the full message is short enough to fit in a voicemail greeting.

The choice depends on the call's purpose. A payment reminder might benefit from leaving a message. A complex support callback probably shouldn't.

The Voicemail Message Flow

Leaving a voicemail requires careful sequencing:

Detect answering machine - AMD reports machine_start or machine_end_*
Interrupt current speech - Stop any TTS in progress, clear audio buffers
Wait for beep (if needed) - If machine_start was detected, wait for machine_end_beep
Deliver message - Generate and play the voicemail audio
Wait for playback - Ensure the full message is heard before hanging up
End call - Cleanly terminate the connection

The interruption step is critical. If your AI was mid-sentence when AMD completed, that partial speech shouldn't continue playing over your voicemail message. A coordinated interrupt cancels in-flight TTS requests, clears audio buffers, and sends a clear command to the telephony provider.

Voice Activity Detection (VAD)

While AMD determines what answered the phone, VAD determines when that entity is speaking. VAD operates continuously throughout the call, analysing audio energy levels to detect speech in real-time.

VAD serves several purposes in voice AI:

Barge-in detection: Recognising when a caller starts speaking while the AI is talking
End-of-utterance detection: Determining when a caller has finished speaking
Audio quality monitoring: Tracking signal-to-noise ratio and call quality
Pre-filtering for speech recognition: Avoiding sending silence to the transcription service

Cloud speech-to-text services like Deepgram provide their own utterance detection, but local VAD offers faster response times. A local energy check can detect speech start within 40-60 milliseconds, while waiting for cloud confirmation takes 200-500 milliseconds.

Energy-Based Speech Detection

The simplest VAD approach uses audio energy thresholds. Speech produces higher energy levels than silence or background noise.

Audio energy is typically measured in dBFS (decibels relative to full scale). 0 dBFS is maximum amplitude; -96 dBFS is effectively silence for 16-bit audio. Normal speech falls somewhere between -35 dBFS and -15 dBFS, depending on the caller's volume and microphone distance.

A basic VAD checks whether the current audio frame exceeds a speech threshold:

Below -50 dBFS: Silence
Between -50 and -35 dBFS: Possible background noise
Above -35 dBFS: Likely speech
Above -25 dBFS: Definite speech (potential interrupt)

These thresholds vary by environment. A quiet office has a lower noise floor than a busy street. Adaptive thresholds that adjust based on observed noise levels improve accuracy.

State Machine for Speech Detection

Raw energy readings are noisy. A single high-energy frame doesn't necessarily mean speech started, and a single low-energy frame doesn't mean speech ended. A state machine with confirmation requirements reduces false positives.

The approach tracks consecutive frames above or below thresholds:

To confirm speech start:

Require N consecutive frames above the speech threshold (typically 3 frames, about 60ms)
Only then transition to "speaking" state and emit a speech-start event

To confirm speech end:

Require M consecutive frames below the threshold (typically 10 frames, about 200ms)
Also apply a "hangover" period after the last high-energy frame
Only then transition to "silent" state and emit a speech-end event

The hangover period prevents premature end-of-speech detection during natural pauses within an utterance. People don't speak in continuous streams; they pause for breath, emphasis, and thought.

Barge-In Detection

Barge-in occurs when a caller starts speaking while the AI is still talking. Handling this well is crucial for natural conversation flow.

The challenge is distinguishing caller speech from echo of the AI's own voice. When your AI speaks, some of that audio comes back through the caller's microphone. Without echo cancellation, this could trigger false barge-in detection.

Several strategies help:

Higher interrupt threshold. Require stronger energy levels to trigger barge-in than to detect normal speech. If the speech threshold is -35 dBFS, the interrupt threshold might be -25 dBFS.

Confirmation frames. Require multiple consecutive high-energy frames before triggering an interrupt. This filters out brief noise spikes.

Cooldown period. After triggering a barge-in, ignore subsequent high-energy frames for a short period (500ms). This prevents rapid-fire interrupts from sustained loud audio.

Context awareness. Only check for barge-in while the AI is actually speaking. During caller turns, high energy is expected and shouldn't trigger interrupts.

When barge-in is detected, the system should:

Stop TTS generation immediately
Clear any buffered audio not yet sent
Send a clear command to the telephony provider
Reset the speech recognition context for the new utterance

AMD and VAD Working Together

AMD and VAD complement each other in the call lifecycle:

Call answered → AMD determines if human, machine, or IVR answered

If human:

VAD monitors for speech start/end
Barge-in detection enables natural interruption
Normal conversation proceeds

If answering machine:

AMD action executes (hangup, leave_message, or continue)
If leaving message, VAD monitors for beep completion
After message delivery, call ends

If IVR:

AMD often classifies IVRs as machine_start initially
AI switches to menu navigation mode
VAD helps detect menu option announcements
DTMF tones navigate the menu

The classification isn't always clean. Some voicemail greetings sound like humans. Some humans answer with long pauses that look like machines. AMD confidence levels help, but edge cases require graceful fallback behaviour.

Adaptive Thresholds

Fixed energy thresholds work poorly across different call environments. A threshold tuned for quiet offices fails on mobile calls from noisy locations.

Adaptive thresholding adjusts detection parameters based on observed audio characteristics:

Noise floor estimation. Track the 10th percentile of energy levels over a rolling window. This represents the background noise level when no one is speaking.

Dynamic speech threshold. Set the speech threshold 15-20 dB above the estimated noise floor. As the noise floor rises, the threshold rises with it.

SNR monitoring. Track the ratio between speech energy (90th percentile) and noise floor. If SNR drops below 10 dB, speech detection becomes unreliable. The system can warn about poor call quality or adjust its behaviour accordingly.

Threshold adjustment should be gradual. Sudden changes cause erratic detection behaviour. A smoothing factor that limits how quickly thresholds can move prevents overreaction to transient noise.

Audio Quality Metrics

Beyond speech detection, continuous audio analysis provides valuable call quality metrics:

Metric	Description	Healthy Range
Average dBFS	Mean energy level	-30 to -20 dBFS
Peak dBFS	Maximum amplitude	Below -3 dBFS
Speech percentage	Frames containing speech	20-60%
Silence percentage	Frames with minimal energy	30-70%
Clipping percentage	Frames near max amplitude	Below 1%
Estimated SNR	Signal-to-noise ratio	Above 15 dB

High clipping percentages indicate distortion. Low SNR suggests the caller is in a noisy environment. Excessive silence might mean the caller is confused or the connection has issues.

These metrics help diagnose call quality problems and can trigger adaptive behaviours like speaking more slowly, repeating key information, or suggesting the caller move to a quieter location.

Configuration Per Agent

Different use cases warrant different detection parameters. A customer service agent handling inbound calls has different needs than an outbound appointment reminder.

Configurable options include:

AMD settings:

Enabled/disabled for the agent
Action to take (continue, hangup, leave_message)
Voicemail message template with placeholder support

VAD settings:

Speech threshold (dBFS)
Interrupt threshold for barge-in
Minimum speech frames for confirmation
Minimum silence frames for end detection
Hangover duration

An outbound collections agent might use aggressive AMD (hang up on machines) and sensitive barge-in detection (let customers interrupt). An inbound support agent doesn't need AMD at all but benefits from careful end-of-utterance detection to avoid cutting off customers mid-sentence.

Practical Considerations

AMD adds latency. Even asynchronous AMD takes time to classify. The AI might already be several sentences into a conversation before learning it's talking to a voicemail. Design your opening to work for both scenarios.

False positives happen. Some humans answer with long pauses or scripted greetings that sound machine-like. Your voicemail message should make sense even if delivered to a confused human.

Network conditions vary. Mobile calls have more packet loss and jitter than landlines. Audio energy calculations on degraded audio are less reliable. Build in tolerance for noisy conditions.

Echo cancellation helps. If your telephony setup includes echo cancellation, barge-in detection becomes more reliable. Without it, you're fighting the AI's own voice coming back through the caller's microphone.

Test with real phones. Simulated audio doesn't capture the full variability of real telephone networks. Test with actual mobile and landline calls in various acoustic environments.

Building voice AI that handles real-world calls?

SwiftCase integrates intelligent call handling with workflow automation. Our voice platform includes AMD, VAD, and adaptive audio processing so your AI agents respond appropriately whether they reach a human, voicemail, or IVR.

Book a demo | Explore the platform | View pricing

Contents

Your AI calls a customer. Someone picks up. But who?

A human says "Hello?" and waits for a response. A voicemail system plays a greeting and waits for a beep. An IVR announces menu options. Each requires a completely different interaction strategy.

Answering Machine Detection (AMD)

The classification categories are more nuanced than just "human" or "machine":

human: A person answered and is waiting to speak
machine_start: A machine answered but the greeting is still playing
machine_end_beep: A machine answered and the beep just occurred
machine_end_silence: A machine answered and went silent (no beep)
fax: A fax machine answered
unknown: Detection was inconclusive

Synchronous vs Asynchronous AMD

AMD can run in two modes, each with tradeoffs.

Most production systems use asynchronous AMD to prioritise natural human interactions, then handle machine detection as a background event.

Handling AMD Results

When an answering machine is detected, you have three options:

Hang up. If the call isn't worth leaving a message, end it cleanly. This is appropriate for time-sensitive calls where voicemail follow-up isn't valuable.

The choice depends on the call's purpose. A payment reminder might benefit from leaving a message. A complex support callback probably shouldn't.

The Voicemail Message Flow

Leaving a voicemail requires careful sequencing:

Detect answering machine - AMD reports machine_start or machine_end_*
Interrupt current speech - Stop any TTS in progress, clear audio buffers
Wait for beep (if needed) - If machine_start was detected, wait for machine_end_beep
Deliver message - Generate and play the voicemail audio
Wait for playback - Ensure the full message is heard before hanging up
End call - Cleanly terminate the connection

Voice Activity Detection (VAD)

While AMD determines what answered the phone, VAD determines when that entity is speaking. VAD operates continuously throughout the call, analysing audio energy levels to detect speech in real-time.

VAD serves several purposes in voice AI:

Barge-in detection: Recognising when a caller starts speaking while the AI is talking
End-of-utterance detection: Determining when a caller has finished speaking
Audio quality monitoring: Tracking signal-to-noise ratio and call quality
Pre-filtering for speech recognition: Avoiding sending silence to the transcription service

Energy-Based Speech Detection

The simplest VAD approach uses audio energy thresholds. Speech produces higher energy levels than silence or background noise.

A basic VAD checks whether the current audio frame exceeds a speech threshold:

Below -50 dBFS: Silence
Between -50 and -35 dBFS: Possible background noise
Above -35 dBFS: Likely speech
Above -25 dBFS: Definite speech (potential interrupt)

These thresholds vary by environment. A quiet office has a lower noise floor than a busy street. Adaptive thresholds that adjust based on observed noise levels improve accuracy.

State Machine for Speech Detection

The approach tracks consecutive frames above or below thresholds:

To confirm speech start:

Require N consecutive frames above the speech threshold (typically 3 frames, about 60ms)
Only then transition to "speaking" state and emit a speech-start event

To confirm speech end:

Require M consecutive frames below the threshold (typically 10 frames, about 200ms)
Also apply a "hangover" period after the last high-energy frame
Only then transition to "silent" state and emit a speech-end event

The hangover period prevents premature end-of-speech detection during natural pauses within an utterance. People don't speak in continuous streams; they pause for breath, emphasis, and thought.

Barge-In Detection

Barge-in occurs when a caller starts speaking while the AI is still talking. Handling this well is crucial for natural conversation flow.

Several strategies help:

Higher interrupt threshold. Require stronger energy levels to trigger barge-in than to detect normal speech. If the speech threshold is -35 dBFS, the interrupt threshold might be -25 dBFS.

Confirmation frames. Require multiple consecutive high-energy frames before triggering an interrupt. This filters out brief noise spikes.

Cooldown period. After triggering a barge-in, ignore subsequent high-energy frames for a short period (500ms). This prevents rapid-fire interrupts from sustained loud audio.

Context awareness. Only check for barge-in while the AI is actually speaking. During caller turns, high energy is expected and shouldn't trigger interrupts.

When barge-in is detected, the system should:

Stop TTS generation immediately
Clear any buffered audio not yet sent
Send a clear command to the telephony provider
Reset the speech recognition context for the new utterance

AMD and VAD Working Together

AMD and VAD complement each other in the call lifecycle:

Call answered → AMD determines if human, machine, or IVR answered

If human:

VAD monitors for speech start/end
Barge-in detection enables natural interruption
Normal conversation proceeds

If answering machine:

AMD action executes (hangup, leave_message, or continue)
If leaving message, VAD monitors for beep completion
After message delivery, call ends

If IVR:

AMD often classifies IVRs as machine_start initially
AI switches to menu navigation mode
VAD helps detect menu option announcements
DTMF tones navigate the menu

Adaptive Thresholds

Fixed energy thresholds work poorly across different call environments. A threshold tuned for quiet offices fails on mobile calls from noisy locations.

Adaptive thresholding adjusts detection parameters based on observed audio characteristics:

Noise floor estimation. Track the 10th percentile of energy levels over a rolling window. This represents the background noise level when no one is speaking.

Dynamic speech threshold. Set the speech threshold 15-20 dB above the estimated noise floor. As the noise floor rises, the threshold rises with it.

Threshold adjustment should be gradual. Sudden changes cause erratic detection behaviour. A smoothing factor that limits how quickly thresholds can move prevents overreaction to transient noise.

Audio Quality Metrics

Beyond speech detection, continuous audio analysis provides valuable call quality metrics:

Metric	Description	Healthy Range
Average dBFS	Mean energy level	-30 to -20 dBFS
Peak dBFS	Maximum amplitude	Below -3 dBFS
Speech percentage	Frames containing speech	20-60%
Silence percentage	Frames with minimal energy	30-70%
Clipping percentage	Frames near max amplitude	Below 1%
Estimated SNR	Signal-to-noise ratio	Above 15 dB

High clipping percentages indicate distortion. Low SNR suggests the caller is in a noisy environment. Excessive silence might mean the caller is confused or the connection has issues.

These metrics help diagnose call quality problems and can trigger adaptive behaviours like speaking more slowly, repeating key information, or suggesting the caller move to a quieter location.

Configuration Per Agent

Different use cases warrant different detection parameters. A customer service agent handling inbound calls has different needs than an outbound appointment reminder.

Configurable options include:

AMD settings:

Enabled/disabled for the agent
Action to take (continue, hangup, leave_message)
Voicemail message template with placeholder support

VAD settings:

Speech threshold (dBFS)
Interrupt threshold for barge-in
Minimum speech frames for confirmation
Minimum silence frames for end detection
Hangover duration

Practical Considerations

False positives happen. Some humans answer with long pauses or scripted greetings that sound machine-like. Your voicemail message should make sense even if delivered to a confused human.

Network conditions vary. Mobile calls have more packet loss and jitter than landlines. Audio energy calculations on degraded audio are less reliable. Build in tolerance for noisy conditions.

Test with real phones. Simulated audio doesn't capture the full variability of real telephone networks. Test with actual mobile and landline calls in various acoustic environments.

Building voice AI that handles real-world calls?

Book a demo | Explore the platform | View pricing

Answering Machine Detection (AMD)

Synchronous vs Asynchronous AMD

Handling AMD Results

The Voicemail Message Flow

Voice Activity Detection (VAD)

Energy-Based Speech Detection

State Machine for Speech Detection

Barge-In Detection

AMD and VAD Working Together

Adaptive Thresholds

Audio Quality Metrics

Configuration Per Agent

Practical Considerations

Building voice AI that handles real-world calls?

Related Articles

Text Normalisation for Natural AI Speech: Making TTS Sound Human

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

Maintaining State Across WebSocket Reconnections in Voice Applications

Get automation insights delivered

Related Free Tools

How we build SwiftCase

Answering Machine Detection (AMD)

Synchronous vs Asynchronous AMD

Handling AMD Results

The Voicemail Message Flow

Voice Activity Detection (VAD)

Energy-Based Speech Detection

State Machine for Speech Detection

Barge-In Detection

AMD and VAD Working Together

Adaptive Thresholds

Audio Quality Metrics

Configuration Per Agent

Practical Considerations

Building voice AI that handles real-world calls?

Related Articles

Text Normalisation for Natural AI Speech: Making TTS Sound Human

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

Maintaining State Across WebSocket Reconnections in Voice Applications

Get automation insights delivered

Related Free Tools

How we build SwiftCase