Contents
Your AI agent calls a supplier's support line. An automated voice answers: "Press 1 for billing, press 2 for technical support, press 3 for order status."
What happens next determines whether voice AI is genuinely useful or just a novelty.
A naive implementation plays the menu audio to your team, who then press the buttons themselves. A sophisticated implementation has the AI listen to the menu, decide which option to select, press the digit, navigate through subsequent menus, and eventually reach a human representative, at which point it explains why it's calling.
This is IVR navigation, and it's one of the more challenging capabilities to implement in voice AI systems. The technical hurdles aren't obvious until you start building.
The DTMF Challenge
DTMF (Dual-Tone Multi-Frequency) is the technical name for the tones produced when you press phone buttons. Each digit generates a specific combination of two audio frequencies that phone systems recognise.
Sending DTMF from software is straightforward. Telephony providers like Twilio offer APIs to inject tones into an active call. The challenge is what happens to your voice AI's audio stream while this occurs.
Most telephony platforms require you to interrupt the media stream to send DTMF tones. You can't simultaneously be listening to audio and injecting tones. This means your WebSocket connection to the call, the one carrying real-time audio for speech recognition, must be closed.
After the tones are sent, a new connection opens. But now you have a problem: your AI has amnesia. The conversation history, the context about why you're calling, the metadata about the customer, all of it lived in the previous connection's memory. The new connection starts fresh.
Preserving State Across Reconnections
The solution is to persist conversation state externally, then restore it when the connection re-establishes.
The approach involves maintaining a call data store keyed by the telephony provider's call identifier. Before sending DTMF, you snapshot the current conversation state, including the complete message history: system prompt, user utterances, AI responses, and tool call results. This is the context the language model needs to continue coherently.
When the new connection opens after DTMF transmission, you check for existing state associated with that call identifier. If found, the AI service initialises with the restored conversation history rather than starting fresh. The AI picks up exactly where it left off. From its perspective, the connection interruption never happened.
One subtlety: if you're restoring a conversation, you should skip any welcome message that would normally play at the start of a call. The greeting was already delivered before the DTMF interruption.
The Reconnection Flow
Sending DTMF typically requires updating the call with new instructions that play the tones, then reconnect to the media stream. The reconnection request should include parameters that identify this as a continuation rather than a new call:
- The call identifier, so the server can look up existing state
- A flag indicating this is a reconnection after DTMF
- The agent identifier, so the correct configuration loads
The server-side handler checks these parameters and, when it sees the reconnection flag, retrieves the stored conversation history instead of initialising a blank session.
Giving the AI the Right Tool
The AI needs to know when and how to press phone buttons. This requires a well-designed tool definition with a clear description explaining that the tool exists specifically for automated phone systems. Without this guidance, the AI might try to "speak" the digit aloud instead of sending the tone.
The tool should accept a string of digits (0-9, *, #) and support pause characters for IVR systems that require delays between inputs. Some systems need a brief pause between digits, so sending "1w2w3" would press 1, wait half a second, press 2, wait, then press 3.
Input validation matters. The AI occasionally tries creative inputs like "one" or "billing" instead of "1". Your implementation should catch these before they reach the telephony provider.
Silent Mode During Navigation
One subtle issue: your AI's normal conversational behaviour becomes problematic during IVR navigation.
In regular conversation, playing acknowledgement audio ("Okay...", "One moment...") immediately after the user speaks reduces perceived latency while the AI processes. But during IVR navigation, these acknowledgements are disruptive. The AI isn't responding to a human; it's responding to a recorded menu.
The solution is a "silent mode" that suppresses acknowledgements during IVR interactions. This can be enabled through agent configuration for outbound calls where IVR navigation is expected, or the AI can toggle it dynamically when it detects it's speaking to an automated system rather than a human.
Handling Complex Menu Trees
Real-world IVR systems are rarely single-level. You might encounter:
- Main menu: "Press 1 for sales, 2 for support"
- After pressing 2: "For technical issues press 1, for billing press 2"
- After pressing 1: "Please enter your account number followed by hash"
- Queue hold music with periodic "Your call is important to us" messages
The AI needs to handle each of these stages. The conversation history preservation becomes critical here because the AI must remember:
- Why it's calling (the original task)
- What menu path it has navigated so far
- What information it needs to provide when it reaches a human
The system prompt should provide clear instructions for IVR navigation. For an outbound support call agent, this might include guidance to listen to all options before selecting, choose the option most relevant to the issue category, provide account details when prompted, and explain the reason for calling when reaching a human representative. The prompt should also instruct the agent to navigate menus silently and only speak when connected to a human.
Placeholders in the prompt can be populated from call metadata, giving the AI full context about its mission: the company being called, the client on whose behalf the call is made, the ticket number, account details, and issue summary.
Validation and Error Handling
DTMF digits are constrained to 0-9, *, #, and w (wait). Anything else will fail or produce unexpected behaviour. Your validation should reject invalid inputs before they reach the telephony API.
Connection failures during DTMF are recoverable because the call itself remains active even if the media stream drops. The telephony provider continues holding the call while you sort out the reconnection. However, if the call ends entirely (the other party hangs up), no amount of state preservation helps.
Detecting Humans vs Machines
Closely related to IVR navigation is Answering Machine Detection (AMD). When your AI makes an outbound call, it needs to know whether a human or voicemail answered.
AMD typically runs asynchronously, with the call connecting immediately while detection happens in the background. Results arrive via webhook and indicate whether the call was answered by a human, a machine (with the greeting still playing, after a beep, or after silence), a fax machine, or if detection was inconclusive.
If the call goes to voicemail, the AI can either hang up, leave a message, or wait for the beep and leave a callback number. If a human answers, normal conversation proceeds. If an IVR answers (which AMD often classifies as a machine initially), the AI switches to menu navigation mode.
The Complete Flow
Putting it all together, here's what happens when your AI calls a support line:
- Call initiated: The telephony provider connects the call, WebSocket media stream opens
- AMD runs: Detection happens in background, call connects immediately
- IVR detected: AI hears menu options, enters silent mode
- DTMF sent: AI presses appropriate digit, media stream closes
- Reconnection: New WebSocket opens, conversation history restored
- Navigation continues: AI handles subsequent menu levels
- Human reached: AI exits silent mode, explains reason for calling
- Conversation proceeds: Normal voice interaction with support representative
- Call ends: AI summarises outcome, updates your systems
Each step has potential failure points. The robustness of your implementation depends on handling all of them gracefully.
Practical Considerations
Timing matters. Some IVR systems have tight windows for input. If your AI takes too long to decide which option to press, the system might repeat the menu or disconnect. Most systems are forgiving, but a few require responses within 3-5 seconds.
Menu detection isn't perfect. The AI interprets audio through speech-to-text, which occasionally mishears menu options. "Press 1 for sales" might transcribe as "Press won for sales". The AI needs to handle these variations.
Some systems require human escalation. Certain IVRs detect automated callers and refuse to proceed. These are rare, but when encountered, your system should gracefully hand off to a human operator.
Recording consent varies by jurisdiction. If you're recording calls for quality or training purposes, be aware that IVR navigation may traverse multiple legal jurisdictions. The call might start in the UK and route to a support centre elsewhere.
When to Use IVR Navigation
Not every voice AI application needs this capability. Inbound call handling, where your AI answers your phones, rarely encounters IVR systems.
IVR navigation shines for outbound use cases:
- Supplier follow-ups: Calling vendors to check order status
- Appointment confirmation: Navigating scheduling systems
- Support escalation: Reaching third-party support on behalf of customers
- Information gathering: Collecting data from automated information lines
The common thread is that these calls would otherwise consume human time navigating menus and waiting on hold. The AI handles the tedious parts so your team can focus on conversations that require human judgement.
Building voice agents for your workflows?
SwiftCase integrates AI voice capabilities with workflow automation, enabling your systems to make and receive calls without human intervention for routine interactions. Our platform handles the telephony complexity so you can focus on your business processes.
