Skip to main content
SwiftCase
PlatformSwitchboardFeaturesSolutionsCase StudiesFree ToolsPricingAbout
Book a Demo
SwiftCase

Workflow automation for UK service businesses. Created in the UK.

A Livepoint Solution

Platform

  • Platform Overview
  • Workflow Engine
  • Case Management
  • CRM
  • Document Generation
  • Data Model
  • Integrations
  • Analytics

Switchboard

  • Switchboard Overview
  • Voice AI
  • Chat
  • Email
  • SMS
  • WhatsApp

Features

  • All Features
  • High-Volume Operations
  • Multi-Party Collaboration
  • Contract Renewals
  • Compliance & Audit
  • Pricing
  • Case Studies
  • Customers
  • Why SwiftCase

Company

  • About
  • Our Team
  • Adam Sykes
  • Nik Ellis
  • Implementation
  • 30-Day Pilot
  • Operations Pressure Map
  • For Your Role
  • Peer Clusters
  • Engineering
  • Careers
  • Partners
  • Press
  • Research
  • Tech Radar
  • Blog
  • Contact

Resources

  • Use Cases
  • Software
  • ROI Calculator
  • Pressure Diagnostic
  • Pilot Scope Estimator
  • Board Case Builder
  • Free Tools
  • Guides & Templates
  • FAQ
  • Compare
  • Glossary
  • Best Practices
  • Changelog
  • Help Centre

Legal

  • Privacy
  • Terms
  • Cookies
  • Accessibility

Stay in the loop

Cyber Essentials CertifiedGDPR CompliantUK Data CentresISO 27001 Standards

© 2026 SwiftCase. All rights reserved.

Back to Blog
Engineering

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

How voice AI systems detect whether a human, voicemail, or IVR answered the phone, and how to handle each scenario appropriately.

SwiftCase Engineering
January 21, 2026
8 min read
Detecting Humans vs Machines in Voice AI: AMD and VAD Explained
Contents
  • Answering Machine Detection (AMD)
  • Synchronous vs Asynchronous AMD
  • Handling AMD Results
  • The Voicemail Message Flow
  • Voice Activity Detection (VAD)
  • Energy-Based Speech Detection
  • State Machine for Speech Detection
  • Barge-In Detection
  • AMD and VAD Working Together
  • Adaptive Thresholds
  • Audio Quality Metrics
  • Configuration Per Agent
  • Practical Considerations
  • Building voice AI that handles real-world calls?

Your AI calls a customer. Someone picks up. But who?

A human says "Hello?" and waits for a response. A voicemail system plays a greeting and waits for a beep. An IVR announces menu options. Each requires a completely different interaction strategy.

Getting this wrong wastes time and creates awkward experiences. Your AI launches into a conversation with a voicemail greeting. Or it sits silently while a human waits for someone to speak. Or it tries to talk over an IVR menu.

Two complementary technologies solve this problem: Answering Machine Detection (AMD) identifies what answered the phone, while Voice Activity Detection (VAD) determines when someone is speaking in real-time.

Answering Machine Detection (AMD)

AMD analyses the audio characteristics of whoever answers a call to classify them as human or machine. Telephony providers like Twilio run this detection automatically when you enable it on outbound calls.

The classification categories are more nuanced than just "human" or "machine":

  • human: A person answered and is waiting to speak
  • machine_start: A machine answered but the greeting is still playing
  • machine_end_beep: A machine answered and the beep just occurred
  • machine_end_silence: A machine answered and went silent (no beep)
  • fax: A fax machine answered
  • unknown: Detection was inconclusive

The distinction between machine_start and machine_end_beep matters. If you want to leave a voicemail message, you need to wait for the beep. Detecting machine_start tells you a voicemail answered, but you shouldn't start talking yet.

Synchronous vs Asynchronous AMD

AMD can run in two modes, each with tradeoffs.

Synchronous AMD delays the call connection until detection completes. The caller hears ringing while analysis happens. When the call connects, you already know what answered. The downside is added latency, typically 3-5 seconds, which some humans notice as an awkward pause after they say "hello".

Asynchronous AMD connects the call immediately while detection runs in the background. Results arrive via webhook once analysis completes. This eliminates the awkward pause but means your AI starts talking before knowing what answered. If a voicemail is detected mid-conversation, you need to handle the transition gracefully.

Most production systems use asynchronous AMD to prioritise natural human interactions, then handle machine detection as a background event.

Handling AMD Results

When an answering machine is detected, you have three options:

Hang up. If the call isn't worth leaving a message, end it cleanly. This is appropriate for time-sensitive calls where voicemail follow-up isn't valuable.

Leave a message. Interrupt any in-progress speech, wait for the beep if needed, deliver a pre-configured voicemail message, then hang up. The message can include dynamic placeholders populated from call metadata.

Continue. Let the AI keep talking to the voicemail as if it were a human. This sometimes makes sense for simple informational calls where the full message is short enough to fit in a voicemail greeting.

The choice depends on the call's purpose. A payment reminder might benefit from leaving a message. A complex support callback probably shouldn't.

The Voicemail Message Flow

Leaving a voicemail requires careful sequencing:

  1. Detect answering machine - AMD reports machine_start or machine_end_*
  2. Interrupt current speech - Stop any TTS in progress, clear audio buffers
  3. Wait for beep (if needed) - If machine_start was detected, wait for machine_end_beep
  4. Deliver message - Generate and play the voicemail audio
  5. Wait for playback - Ensure the full message is heard before hanging up
  6. End call - Cleanly terminate the connection

The interruption step is critical. If your AI was mid-sentence when AMD completed, that partial speech shouldn't continue playing over your voicemail message. A coordinated interrupt cancels in-flight TTS requests, clears audio buffers, and sends a clear command to the telephony provider.

Voice Activity Detection (VAD)

While AMD determines what answered the phone, VAD determines when that entity is speaking. VAD operates continuously throughout the call, analysing audio energy levels to detect speech in real-time.

VAD serves several purposes in voice AI:

  • Barge-in detection: Recognising when a caller starts speaking while the AI is talking
  • End-of-utterance detection: Determining when a caller has finished speaking
  • Audio quality monitoring: Tracking signal-to-noise ratio and call quality
  • Pre-filtering for speech recognition: Avoiding sending silence to the transcription service

Cloud speech-to-text services like Deepgram provide their own utterance detection, but local VAD offers faster response times. A local energy check can detect speech start within 40-60 milliseconds, while waiting for cloud confirmation takes 200-500 milliseconds.

Energy-Based Speech Detection

The simplest VAD approach uses audio energy thresholds. Speech produces higher energy levels than silence or background noise.

Audio energy is typically measured in dBFS (decibels relative to full scale). 0 dBFS is maximum amplitude; -96 dBFS is effectively silence for 16-bit audio. Normal speech falls somewhere between -35 dBFS and -15 dBFS, depending on the caller's volume and microphone distance.

A basic VAD checks whether the current audio frame exceeds a speech threshold:

  • Below -50 dBFS: Silence
  • Between -50 and -35 dBFS: Possible background noise
  • Above -35 dBFS: Likely speech
  • Above -25 dBFS: Definite speech (potential interrupt)

These thresholds vary by environment. A quiet office has a lower noise floor than a busy street. Adaptive thresholds that adjust based on observed noise levels improve accuracy.

State Machine for Speech Detection

Raw energy readings are noisy. A single high-energy frame doesn't necessarily mean speech started, and a single low-energy frame doesn't mean speech ended. A state machine with confirmation requirements reduces false positives.

The approach tracks consecutive frames above or below thresholds:

To confirm speech start:

  • Require N consecutive frames above the speech threshold (typically 3 frames, about 60ms)
  • Only then transition to "speaking" state and emit a speech-start event

To confirm speech end:

  • Require M consecutive frames below the threshold (typically 10 frames, about 200ms)
  • Also apply a "hangover" period after the last high-energy frame
  • Only then transition to "silent" state and emit a speech-end event

The hangover period prevents premature end-of-speech detection during natural pauses within an utterance. People don't speak in continuous streams; they pause for breath, emphasis, and thought.

Barge-In Detection

Barge-in occurs when a caller starts speaking while the AI is still talking. Handling this well is crucial for natural conversation flow.

The challenge is distinguishing caller speech from echo of the AI's own voice. When your AI speaks, some of that audio comes back through the caller's microphone. Without echo cancellation, this could trigger false barge-in detection.

Several strategies help:

Higher interrupt threshold. Require stronger energy levels to trigger barge-in than to detect normal speech. If the speech threshold is -35 dBFS, the interrupt threshold might be -25 dBFS.

Confirmation frames. Require multiple consecutive high-energy frames before triggering an interrupt. This filters out brief noise spikes.

Cooldown period. After triggering a barge-in, ignore subsequent high-energy frames for a short period (500ms). This prevents rapid-fire interrupts from sustained loud audio.

Context awareness. Only check for barge-in while the AI is actually speaking. During caller turns, high energy is expected and shouldn't trigger interrupts.

When barge-in is detected, the system should:

  1. Stop TTS generation immediately
  2. Clear any buffered audio not yet sent
  3. Send a clear command to the telephony provider
  4. Reset the speech recognition context for the new utterance

AMD and VAD Working Together

AMD and VAD complement each other in the call lifecycle:

Call answered → AMD determines if human, machine, or IVR answered

If human:

  • VAD monitors for speech start/end
  • Barge-in detection enables natural interruption
  • Normal conversation proceeds

If answering machine:

  • AMD action executes (hangup, leave_message, or continue)
  • If leaving message, VAD monitors for beep completion
  • After message delivery, call ends

If IVR:

  • AMD often classifies IVRs as machine_start initially
  • AI switches to menu navigation mode
  • VAD helps detect menu option announcements
  • DTMF tones navigate the menu

The classification isn't always clean. Some voicemail greetings sound like humans. Some humans answer with long pauses that look like machines. AMD confidence levels help, but edge cases require graceful fallback behaviour.

Adaptive Thresholds

Fixed energy thresholds work poorly across different call environments. A threshold tuned for quiet offices fails on mobile calls from noisy locations.

Adaptive thresholding adjusts detection parameters based on observed audio characteristics:

Noise floor estimation. Track the 10th percentile of energy levels over a rolling window. This represents the background noise level when no one is speaking.

Dynamic speech threshold. Set the speech threshold 15-20 dB above the estimated noise floor. As the noise floor rises, the threshold rises with it.

SNR monitoring. Track the ratio between speech energy (90th percentile) and noise floor. If SNR drops below 10 dB, speech detection becomes unreliable. The system can warn about poor call quality or adjust its behaviour accordingly.

Threshold adjustment should be gradual. Sudden changes cause erratic detection behaviour. A smoothing factor that limits how quickly thresholds can move prevents overreaction to transient noise.

Audio Quality Metrics

Beyond speech detection, continuous audio analysis provides valuable call quality metrics:

MetricDescriptionHealthy Range
Average dBFSMean energy level-30 to -20 dBFS
Peak dBFSMaximum amplitudeBelow -3 dBFS
Speech percentageFrames containing speech20-60%
Silence percentageFrames with minimal energy30-70%
Clipping percentageFrames near max amplitudeBelow 1%
Estimated SNRSignal-to-noise ratioAbove 15 dB

High clipping percentages indicate distortion. Low SNR suggests the caller is in a noisy environment. Excessive silence might mean the caller is confused or the connection has issues.

These metrics help diagnose call quality problems and can trigger adaptive behaviours like speaking more slowly, repeating key information, or suggesting the caller move to a quieter location.

Configuration Per Agent

Different use cases warrant different detection parameters. A customer service agent handling inbound calls has different needs than an outbound appointment reminder.

Configurable options include:

AMD settings:

  • Enabled/disabled for the agent
  • Action to take (continue, hangup, leave_message)
  • Voicemail message template with placeholder support

VAD settings:

  • Speech threshold (dBFS)
  • Interrupt threshold for barge-in
  • Minimum speech frames for confirmation
  • Minimum silence frames for end detection
  • Hangover duration

An outbound collections agent might use aggressive AMD (hang up on machines) and sensitive barge-in detection (let customers interrupt). An inbound support agent doesn't need AMD at all but benefits from careful end-of-utterance detection to avoid cutting off customers mid-sentence.

Practical Considerations

AMD adds latency. Even asynchronous AMD takes time to classify. The AI might already be several sentences into a conversation before learning it's talking to a voicemail. Design your opening to work for both scenarios.

False positives happen. Some humans answer with long pauses or scripted greetings that sound machine-like. Your voicemail message should make sense even if delivered to a confused human.

Network conditions vary. Mobile calls have more packet loss and jitter than landlines. Audio energy calculations on degraded audio are less reliable. Build in tolerance for noisy conditions.

Echo cancellation helps. If your telephony setup includes echo cancellation, barge-in detection becomes more reliable. Without it, you're fighting the AI's own voice coming back through the caller's microphone.

Test with real phones. Simulated audio doesn't capture the full variability of real telephone networks. Test with actual mobile and landline calls in various acoustic environments.


Building voice AI that handles real-world calls?

SwiftCase integrates intelligent call handling with workflow automation. Our voice platform includes AMD, VAD, and adaptive audio processing so your AI agents respond appropriately whether they reach a human, voicemail, or IVR.

Book a demo | Explore the platform | View pricing

Related Articles

Engineering

Text Normalisation for Natural AI Speech: Making TTS Sound Human

January 19, 20267 min read
Engineering

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

January 17, 20268 min read
Engineering

Maintaining State Across WebSocket Reconnections in Voice Applications

January 14, 20268 min read

Get automation insights delivered

Join operations leaders who get weekly insights on workflow automation and AI.

Related Free Tools

Workflow Mapper

Draw your business process visually and export a professional PDF.

Try free

SLA Template Builder

Build and download a professional Service Level Agreement.

Try free

Meeting Cost Calculator

See the true cost of your meetings based on attendees and salary.

Try free

11.8M+ cases processed

How we build SwiftCase

A look behind the curtain at the engineering decisions, tools, and culture that power our platform.

Meet the Engineering Team
View Careers