Contents
Your AI voice agent reads out a customer's phone number: "Your number is zero seven billion, one hundred twenty-three million..."
The customer hangs up.
Text-to-speech engines are remarkably good at converting written language to natural-sounding audio. But they're trained on prose, not data. Phone numbers, postcodes, reference codes, dates, currency amounts: these are the building blocks of business conversations, and TTS engines routinely mangle them.
Text normalisation solves this by preprocessing text before it reaches the TTS engine, converting data formats into speakable phrases the engine can pronounce correctly.
The Problem with Raw Data
Consider what happens when you send these strings directly to a TTS engine:
| Input | TTS Output (Unprocessed) |
|---|---|
07123456789 | "Zero seven billion, one hundred twenty-three million..." |
SW1A 1AA | "Swuh-wun-ah one double-a" or similar gibberish |
£1,234.56 | "Pound sign one comma two three four point five six" |
25/12/2024 | "Twenty-five slash twelve slash two thousand twenty-four" |
REF123456 | "Ref one hundred twenty-three thousand..." |
None of these are how a human would read the information aloud. A human says "oh seven one two three, four five six, seven eight nine" for the phone number. They say "S W one A, one A A" for the postcode. The TTS engine doesn't know these conventions unless you teach it.
UK Phone Numbers
Phone numbers should be read as individual digits with natural grouping. UK mobile numbers follow a predictable pattern: five digits, then three, then three.
The normalisation approach strips non-digit characters, then groups the digits according to the phone number type. For mobiles starting with 07, the grouping is 5-3-3. For landlines, it's typically 4-3-4. International formats (+44) are converted to national format first.
Crucially, commas are inserted between groups to create natural pauses. "0 7 1 2 3, 4 5 6, 7 8 9" sounds like a phone number. "0 7 1 2 3 4 5 6 7 8 9" sounds like a robot counting.
| Input | Output |
|---|---|
07123456789 | 0 7 1 2 3, 4 5 6, 7 8 9 |
+447123456789 | 0 7 1 2 3, 4 5 6, 7 8 9 |
0121 234 5678 | 0 1 2 1, 2 3 4, 5 6 7 8 |
UK Postcodes
Postcodes combine letters and numbers in specific patterns. The outward code (first part) identifies the postal district; the inward code (last three characters) identifies the specific address group.
The normalisation approach spaces out each character individually, with a comma pause between the outward and inward codes. If the postcode arrives without a space (like "SW1A1AA"), the normaliser splits it before the last three characters.
| Input | Output |
|---|---|
SW1A 1AA | S W 1 A, 1 A A |
M1 1AA | M 1, 1 A A |
B338TH | B 3 3, 8 T H |
The comma between outward and inward codes creates the same pause a human uses when dictating a postcode.
Currency
Currency requires context-aware formatting. "£1,234.56" should become "1,234 pounds and 56 pence", not "pound sign one comma two three four point five six".
The approach involves:
- Identifying the currency symbol and mapping it to spoken names (pound/pounds, pence for GBP; dollar/dollars, cents for USD)
- Separating whole and decimal parts
- Using singular or plural forms based on the amount
- Handling special cases like amounts under £1 ("99 pence" not "0 pounds and 99 pence")
- Recognising multiplier suffixes like "k", "m", "bn" for thousands, millions, billions
| Input | Output |
|---|---|
£1,234.56 | 1,234 pounds and 56 pence |
£0.99 | 99 pence |
£1.5m | 1.5 million pounds |
$100 | 100 dollars |
€1 | 1 euro |
Dates and Times
Dates in business contexts usually appear as DD/MM/YYYY (UK format) or YYYY-MM-DD (ISO format). Neither reads naturally without transformation.
The normalisation converts numeric dates to spoken form with ordinal day numbers and full month names: "25/12/2024" becomes "the 25th of December 2024". The ordinal suffix (st, nd, rd, th) is determined by the day number.
Times require similar treatment. The approach converts 24-hour format to 12-hour with am/pm, handles special cases like noon and midnight, and formats minutes naturally ("2 oh 5 pm" for 2:05pm rather than "2 zero 5 pm").
| Input | Output |
|---|---|
25/12/2024 | the 25th of December 2024 |
2024-01-15 | the 15th of January 2024 |
14:30 | 2 30 pm |
2:05pm | 2 oh 5 pm |
12:00 | noon |
Email Addresses
Email addresses contain symbols that TTS engines pronounce literally. "@" becomes "at sign" and "." becomes "dot" or "period" depending on the engine.
The normalisation replaces "@" with " at " and "." with " dot ", producing output that sounds natural when spoken.
| Input | Output |
|---|---|
john.smith@example.com | john dot smith at example dot com |
Car Registrations
UK vehicle registrations follow several formats depending on age. Modern registrations (post-2001) use the pattern AA00 AAA. The letters and numbers should be read separately with a pause between groups.
The approach normalises the registration to uppercase, then groups characters by type (letters vs digits), inserting comma pauses when switching between letter and digit groups.
| Input | Output |
|---|---|
AB12 CDE | A B, 1 2, C D E |
A123 BCD | A, 1 2 3, B C D |
Business Names and Acronyms
Business data contains abbreviations that should be spelled out rather than pronounced as words. "AX Motors" should be "A X Motors", not "Axe Motors". But common words like "UK" or "TV" should remain as-is because TTS engines handle them correctly.
The approach maintains a list of common words and abbreviations that TTS handles well (UK, US, TV, LTD, PLC, CEO, etc.) and leaves these unchanged. Other uppercase sequences of 2-4 letters are spaced out for spelling.
Additional business-specific patterns include:
- "T/A" becomes "trading as"
- Ampersands become "and"
- Letter-number-letter patterns like "B2B" become "B 2 B"
| Input | Output |
|---|---|
AX Motors | A X Motors |
B&M Retail | B and M Retail |
ABC Ltd T/A XYZ | A B C Ltd trading as X Y Z |
UK Business | UK Business (unchanged) |
B2B Services | B 2 B Services |
Stripping Internal Syntax
Sometimes language models output internal syntax as text instead of using proper function calls. Tool call syntax, JSON fragments, or other machine-readable content should never be read aloud.
The normalisation identifies and removes these patterns, leaving only the human-readable content.
| Input | Output |
|---|---|
Thank you. [action: end] Goodbye. | Thank you. Goodbye. |
Performance Optimisation
Text normalisation runs on every response before TTS generation. For high-volume systems, avoiding unnecessary processing matters.
The optimisation approach uses a quick pre-check to determine whether text contains any patterns that need normalisation. Simple conversational responses like "How can I help you today?" skip the full normalization pipeline entirely. Only text containing phone numbers, postcodes, currency symbols, dates, or other data patterns gets processed.
Pattern Ordering Matters
The order in which you apply transformations affects results. Consider the text "at 14:30 on AB12 CDE".
If you process acronyms before times, "AB" might get spaced out before the car registration pattern matches. If you process car registrations before postcodes, "AB12 CDE" correctly matches as a vehicle registration rather than being misidentified as a malformed postcode.
A robust implementation processes patterns in order of specificity:
- Email addresses (before other dot processing affects them)
- Car registrations (specific alphanumeric patterns)
- Times with "at" prefix (before acronym processing)
- Business name patterns (T/A, ampersands, acronyms)
- Dates (UK and ISO formats)
- Times (12-hour format)
- Currency (with and without multipliers)
- Postcodes
- Phone numbers (most specific patterns first)
- Reference numbers
Each pattern is designed to avoid false positives on the others.
Testing Edge Cases
Real-world data produces surprising edge cases:
- "Call me at 2" - Is "2" a time or just the number two?
- "Unit 2A" - Should "2A" be spaced out?
- "SW1A" without the inward code - Partial postcode or abbreviation?
- "REF: TBC" - Reference number or "to be confirmed"?
The solution is specificity. "at 2pm" triggers time formatting; "at 2" alone doesn't. Reference patterns require at least 4 alphanumeric characters after the prefix. Postcodes require both outward and inward codes to match.
When in doubt, leave text unchanged. It's better for TTS to mispronounce an edge case than for normalisation to corrupt valid prose.
The Complete Pipeline
Text normalisation slots into the voice AI pipeline between the language model and TTS:
- User speaks → Speech-to-text transcription
- LLM processes → Generates response text
- Normalisation → Converts data formats to speakable phrases
- TTS generates → Converts normalised text to audio
- Audio plays → User hears natural pronunciation
The normalisation step is invisible to both the language model and the user. The LLM can output "Your reference is REF123456" naturally, and the user hears "Your reference is REF 1 2 3 4 5 6" without either party knowing about the transformation.
Regional Considerations
The patterns described here focus on UK conventions: UK phone formats, UK postcodes, UK date ordering (DD/MM/YYYY), pounds sterling. Adapting for other regions requires:
- Different phone number patterns and groupings
- Different postal code formats (US ZIP codes, German PLZ, etc.)
- Different date ordering (MM/DD/YYYY for US)
- Different currency handling
A production system serving multiple markets needs either locale detection or explicit configuration to apply the correct regional patterns.
Building voice experiences with business data?
SwiftCase integrates voice AI with workflow automation, handling the complexity of natural speech synthesis so your AI agents can read customer data, reference numbers, and business information clearly. Our platform manages the technical details so you can focus on your processes.
