Skip to main content
SwiftCase
PlatformSwitchboardFeaturesSolutionsCase StudiesFree ToolsPricingAbout
Book a Demo
SwiftCase

Workflow automation for UK service businesses. Created in the UK.

A Livepoint Solution

Platform

  • Platform Overview
  • Workflow Engine
  • Case Management
  • CRM
  • Document Generation
  • Data Model
  • Integrations
  • Analytics

Switchboard

  • Switchboard Overview
  • Voice AI
  • Chat
  • Email
  • SMS
  • WhatsApp

Features

  • All Features
  • High-Volume Operations
  • Multi-Party Collaboration
  • Contract Renewals
  • Compliance & Audit
  • Pricing
  • Case Studies
  • Customers
  • Why SwiftCase

Company

  • About
  • Our Team
  • Adam Sykes
  • Nik Ellis
  • Implementation
  • 30-Day Pilot
  • Operations Pressure Map
  • For Your Role
  • Peer Clusters
  • Engineering
  • Careers
  • Partners
  • Press
  • Research
  • Tech Radar
  • Blog
  • Contact

Resources

  • Use Cases
  • Software
  • ROI Calculator
  • Pressure Diagnostic
  • Pilot Scope Estimator
  • Board Case Builder
  • Free Tools
  • Guides & Templates
  • FAQ
  • Compare
  • Glossary
  • Best Practices
  • Changelog
  • Help Centre

Legal

  • Privacy
  • Terms
  • Cookies
  • Accessibility

Stay in the loop

Cyber Essentials CertifiedGDPR CompliantUK Data CentresISO 27001 Standards

© 2026 SwiftCase. All rights reserved.

Back to Blog
Engineering

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

How voice AI can autonomously navigate phone menus using DTMF tones while preserving conversation state across connection interruptions.

SwiftCase Engineering
January 17, 2026
8 min read
AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions
Contents
  • The DTMF Challenge
  • Preserving State Across Reconnections
  • The Reconnection Flow
  • Giving the AI the Right Tool
  • Silent Mode During Navigation
  • Handling Complex Menu Trees
  • Validation and Error Handling
  • Detecting Humans vs Machines
  • The Complete Flow
  • Practical Considerations
  • When to Use IVR Navigation
  • Building voice agents for your workflows?

Your AI agent calls a supplier's support line. An automated voice answers: "Press 1 for billing, press 2 for technical support, press 3 for order status."

What happens next determines whether voice AI is genuinely useful or just a novelty.

A naive implementation plays the menu audio to your team, who then press the buttons themselves. A sophisticated implementation has the AI listen to the menu, decide which option to select, press the digit, navigate through subsequent menus, and eventually reach a human representative, at which point it explains why it's calling.

This is IVR navigation, and it's one of the more challenging capabilities to implement in voice AI systems. The technical hurdles aren't obvious until you start building.

The DTMF Challenge

DTMF (Dual-Tone Multi-Frequency) is the technical name for the tones produced when you press phone buttons. Each digit generates a specific combination of two audio frequencies that phone systems recognise.

Sending DTMF from software is straightforward. Telephony providers like Twilio offer APIs to inject tones into an active call. The challenge is what happens to your voice AI's audio stream while this occurs.

Most telephony platforms require you to interrupt the media stream to send DTMF tones. You can't simultaneously be listening to audio and injecting tones. This means your WebSocket connection to the call, the one carrying real-time audio for speech recognition, must be closed.

After the tones are sent, a new connection opens. But now you have a problem: your AI has amnesia. The conversation history, the context about why you're calling, the metadata about the customer, all of it lived in the previous connection's memory. The new connection starts fresh.

Preserving State Across Reconnections

The solution is to persist conversation state externally, then restore it when the connection re-establishes.

The approach involves maintaining a call data store keyed by the telephony provider's call identifier. Before sending DTMF, you snapshot the current conversation state, including the complete message history: system prompt, user utterances, AI responses, and tool call results. This is the context the language model needs to continue coherently.

When the new connection opens after DTMF transmission, you check for existing state associated with that call identifier. If found, the AI service initialises with the restored conversation history rather than starting fresh. The AI picks up exactly where it left off. From its perspective, the connection interruption never happened.

One subtlety: if you're restoring a conversation, you should skip any welcome message that would normally play at the start of a call. The greeting was already delivered before the DTMF interruption.

The Reconnection Flow

Sending DTMF typically requires updating the call with new instructions that play the tones, then reconnect to the media stream. The reconnection request should include parameters that identify this as a continuation rather than a new call:

  • The call identifier, so the server can look up existing state
  • A flag indicating this is a reconnection after DTMF
  • The agent identifier, so the correct configuration loads

The server-side handler checks these parameters and, when it sees the reconnection flag, retrieves the stored conversation history instead of initialising a blank session.

Giving the AI the Right Tool

The AI needs to know when and how to press phone buttons. This requires a well-designed tool definition with a clear description explaining that the tool exists specifically for automated phone systems. Without this guidance, the AI might try to "speak" the digit aloud instead of sending the tone.

The tool should accept a string of digits (0-9, *, #) and support pause characters for IVR systems that require delays between inputs. Some systems need a brief pause between digits, so sending "1w2w3" would press 1, wait half a second, press 2, wait, then press 3.

Input validation matters. The AI occasionally tries creative inputs like "one" or "billing" instead of "1". Your implementation should catch these before they reach the telephony provider.

Silent Mode During Navigation

One subtle issue: your AI's normal conversational behaviour becomes problematic during IVR navigation.

In regular conversation, playing acknowledgement audio ("Okay...", "One moment...") immediately after the user speaks reduces perceived latency while the AI processes. But during IVR navigation, these acknowledgements are disruptive. The AI isn't responding to a human; it's responding to a recorded menu.

The solution is a "silent mode" that suppresses acknowledgements during IVR interactions. This can be enabled through agent configuration for outbound calls where IVR navigation is expected, or the AI can toggle it dynamically when it detects it's speaking to an automated system rather than a human.

Handling Complex Menu Trees

Real-world IVR systems are rarely single-level. You might encounter:

  1. Main menu: "Press 1 for sales, 2 for support"
  2. After pressing 2: "For technical issues press 1, for billing press 2"
  3. After pressing 1: "Please enter your account number followed by hash"
  4. Queue hold music with periodic "Your call is important to us" messages

The AI needs to handle each of these stages. The conversation history preservation becomes critical here because the AI must remember:

  • Why it's calling (the original task)
  • What menu path it has navigated so far
  • What information it needs to provide when it reaches a human

The system prompt should provide clear instructions for IVR navigation. For an outbound support call agent, this might include guidance to listen to all options before selecting, choose the option most relevant to the issue category, provide account details when prompted, and explain the reason for calling when reaching a human representative. The prompt should also instruct the agent to navigate menus silently and only speak when connected to a human.

Placeholders in the prompt can be populated from call metadata, giving the AI full context about its mission: the company being called, the client on whose behalf the call is made, the ticket number, account details, and issue summary.

Validation and Error Handling

DTMF digits are constrained to 0-9, *, #, and w (wait). Anything else will fail or produce unexpected behaviour. Your validation should reject invalid inputs before they reach the telephony API.

Connection failures during DTMF are recoverable because the call itself remains active even if the media stream drops. The telephony provider continues holding the call while you sort out the reconnection. However, if the call ends entirely (the other party hangs up), no amount of state preservation helps.

Detecting Humans vs Machines

Closely related to IVR navigation is Answering Machine Detection (AMD). When your AI makes an outbound call, it needs to know whether a human or voicemail answered.

AMD typically runs asynchronously, with the call connecting immediately while detection happens in the background. Results arrive via webhook and indicate whether the call was answered by a human, a machine (with the greeting still playing, after a beep, or after silence), a fax machine, or if detection was inconclusive.

If the call goes to voicemail, the AI can either hang up, leave a message, or wait for the beep and leave a callback number. If a human answers, normal conversation proceeds. If an IVR answers (which AMD often classifies as a machine initially), the AI switches to menu navigation mode.

The Complete Flow

Putting it all together, here's what happens when your AI calls a support line:

  1. Call initiated: The telephony provider connects the call, WebSocket media stream opens
  2. AMD runs: Detection happens in background, call connects immediately
  3. IVR detected: AI hears menu options, enters silent mode
  4. DTMF sent: AI presses appropriate digit, media stream closes
  5. Reconnection: New WebSocket opens, conversation history restored
  6. Navigation continues: AI handles subsequent menu levels
  7. Human reached: AI exits silent mode, explains reason for calling
  8. Conversation proceeds: Normal voice interaction with support representative
  9. Call ends: AI summarises outcome, updates your systems

Each step has potential failure points. The robustness of your implementation depends on handling all of them gracefully.

Practical Considerations

Timing matters. Some IVR systems have tight windows for input. If your AI takes too long to decide which option to press, the system might repeat the menu or disconnect. Most systems are forgiving, but a few require responses within 3-5 seconds.

Menu detection isn't perfect. The AI interprets audio through speech-to-text, which occasionally mishears menu options. "Press 1 for sales" might transcribe as "Press won for sales". The AI needs to handle these variations.

Some systems require human escalation. Certain IVRs detect automated callers and refuse to proceed. These are rare, but when encountered, your system should gracefully hand off to a human operator.

Recording consent varies by jurisdiction. If you're recording calls for quality or training purposes, be aware that IVR navigation may traverse multiple legal jurisdictions. The call might start in the UK and route to a support centre elsewhere.

When to Use IVR Navigation

Not every voice AI application needs this capability. Inbound call handling, where your AI answers your phones, rarely encounters IVR systems.

IVR navigation shines for outbound use cases:

  • Supplier follow-ups: Calling vendors to check order status
  • Appointment confirmation: Navigating scheduling systems
  • Support escalation: Reaching third-party support on behalf of customers
  • Information gathering: Collecting data from automated information lines

The common thread is that these calls would otherwise consume human time navigating menus and waiting on hold. The AI handles the tedious parts so your team can focus on conversations that require human judgement.


Building voice agents for your workflows?

SwiftCase integrates AI voice capabilities with workflow automation, enabling your systems to make and receive calls without human intervention for routine interactions. Our platform handles the telephony complexity so you can focus on your business processes.

Book a demo | Explore the platform | View pricing

Related Articles

Engineering

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

January 21, 20268 min read
Engineering

Text Normalisation for Natural AI Speech: Making TTS Sound Human

January 19, 20267 min read
Engineering

Maintaining State Across WebSocket Reconnections in Voice Applications

January 14, 20268 min read

Get automation insights delivered

Join operations leaders who get weekly insights on workflow automation and AI.

Related Free Tools

Workflow Mapper

Draw your business process visually and export a professional PDF.

Try free

SLA Template Builder

Build and download a professional Service Level Agreement.

Try free

Meeting Cost Calculator

See the true cost of your meetings based on attendees and salary.

Try free

11.8M+ cases processed

How we build SwiftCase

A look behind the curtain at the engineering decisions, tools, and culture that power our platform.

Meet the Engineering Team
View Careers