Contents

When we built Switchboard, our AI-powered communication platform, we made two deliberate architectural choices: build custom models for domain-specific tasks, and abstract external providers so no single vendor owns our system. Here's why, and how we implemented it.

The Vendor Lock-In Problem

AI is moving fast. The best model today might not be the best model tomorrow. If your architecture is tightly coupled to one provider:

You can't easily switch when something better emerges
Provider outages become your outages
Pricing changes hold you hostage
You're stuck with one provider's strengths and weaknesses

We've seen this pattern before. Companies that bet everything on one vendor inevitably regret it when that vendor changes pricing, deprecates APIs, or falls behind competitors.

The Provider Abstraction Layer

Switchboard's AI module abstracts all provider-specific details behind clean interfaces. The core engine doesn't know or care whether it's talking to OpenAI or Anthropic - it just requests completions and gets responses.

Custom Models

For domain-specific tasks, we build and train our own models:

Intent Classification: Custom models trained on our specific use cases outperform general-purpose LLMs for routing decisions
Entity Extraction: Purpose-built extractors for the data types our customers care about
Quality Scoring: Internal models that evaluate conversation quality in real-time

Custom models give us control over latency, cost, and behaviour that external providers can't match for specialised tasks.

External LLM Providers

For general language understanding and generation, we support multiple providers:

OpenAI: GPT-4 and newer models
Anthropic: Claude models

Each provider has different strengths. Some excel at reasoning, others at following instructions. Our abstraction lets us route requests to the optimal provider for each task - or switch entirely if one provider experiences issues.

Speech Processing

Voice AI requires more than just LLMs:

Speech-to-Text (STT): Converting user speech to text
Text-to-Speech (TTS): Converting responses to natural speech

We use Deepgram for STT (excellent accuracy, low latency) and ElevenLabs for TTS (natural-sounding voices). But these are behind abstractions too - we can swap providers without changing application code.

How the Abstraction Works

The architecture follows a consistent pattern:

Application Code → Provider Manager → Provider Implementation → External API

The Provider Manager handles:

Provider Selection: Choosing which provider to use for a request
Failover: Switching to backup providers on errors
Configuration: Provider-specific settings without leaking into app code
Streaming: Consistent streaming interface regardless of provider

Application code simply requests what it needs:

"Complete this conversation"
"Transcribe this audio"
"Generate speech for this text"

The provider layer handles the rest.

Real-World Benefits

Graceful Degradation

When OpenAI had a major outage last year, services built tightly on their API went down. Our systems degraded gracefully - routing to alternative providers while OpenAI recovered.

Cost Optimisation

Different providers have different pricing models. By abstracting providers, we can route simpler requests to cheaper models and reserve expensive models for complex tasks. This isn't possible when you're locked to one provider.

A/B Testing Models

When a new model releases, we can test it on a subset of traffic without risking production. The abstraction makes this trivial - just add a new provider implementation and configure routing rules.

Best Tool for Each Job

Transcription, language understanding, and speech synthesis are different problems. The best solution for each might come from different providers. Our architecture lets us optimise each component independently.

Implementation Details

Streaming Is Non-Negotiable

Voice AI needs real-time responses. Users expect sub-second latency - they're having a conversation, not submitting a form. Everything is streaming:

Audio streams in from the user
Transcription streams to the LLM
LLM responses stream to TTS
Audio streams back to the user

Traditional request-response patterns add unacceptable latency. Streaming everything means users hear responses while they're still being generated.

WebSockets for Real-Time

HTTP polling doesn't work for voice. We use WebSockets for bidirectional, real-time communication between clients and servers. This adds complexity but is essential for voice applications.

Circuit Breakers

When a provider starts failing, we need to fail fast rather than queue up timeouts. Circuit breakers monitor provider health and short-circuit requests to failing providers, automatically routing to alternatives.

Trade-Offs

This architecture isn't free:

Complexity: Abstractions add code and indirection
Lowest Common Denominator: Provider-specific features are harder to use
Testing: Each provider needs integration tests
Subtle Differences: Providers behave slightly differently

We accept these trade-offs because the benefits - resilience, flexibility, cost optimisation - outweigh the costs for our use case. If you're building a simple prototype, this is probably overkill. For production systems handling real customer conversations, it's essential.

What We Learned

1. Build Custom Where It Matters

General-purpose LLMs are impressive, but custom models win on latency, cost, and domain-specific accuracy. We build custom for high-volume, well-defined tasks and use external providers for complex reasoning.

2. Design for Change

The AI landscape changes constantly. Architectures that assume stability become liabilities. Design assuming your providers will change - because they will.

3. Abstractions Should Be Thin

Heavy abstractions that try to normalise everything become their own problem. Our abstractions are thin - just enough to swap providers, not so much that they obscure what's happening.

4. Monitoring Per Model

When something goes wrong, you need to know which model is the issue - custom or external. Separate monitoring and logging for each makes debugging tractable.

5. Provider-Specific Optimisations Are OK

The abstraction doesn't mean treating all providers identically. We tune prompts per provider, use provider-specific features when valuable, and route based on strengths. The abstraction is for swappability, not uniformity.

The Future

We're continuing to evolve this architecture:

More providers: As new capable models emerge
Smarter routing: ML-based provider selection
Cost prediction: Estimating costs before making requests
Quality monitoring: Automated quality scoring of provider responses

The core pattern - abstraction with clean interfaces - will remain. It's proven its value in production.

Interested in AI infrastructure?

Switchboard is built by a small team solving hard problems in real-time AI. If this kind of architecture interests you:

View engineering roles | Learn about Switchboard

Contents

The Vendor Lock-In Problem

AI is moving fast. The best model today might not be the best model tomorrow. If your architecture is tightly coupled to one provider:

You can't easily switch when something better emerges
Provider outages become your outages
Pricing changes hold you hostage
You're stuck with one provider's strengths and weaknesses

We've seen this pattern before. Companies that bet everything on one vendor inevitably regret it when that vendor changes pricing, deprecates APIs, or falls behind competitors.

The Provider Abstraction Layer

Custom Models

For domain-specific tasks, we build and train our own models:

Intent Classification: Custom models trained on our specific use cases outperform general-purpose LLMs for routing decisions
Entity Extraction: Purpose-built extractors for the data types our customers care about
Quality Scoring: Internal models that evaluate conversation quality in real-time

Custom models give us control over latency, cost, and behaviour that external providers can't match for specialised tasks.

External LLM Providers

For general language understanding and generation, we support multiple providers:

OpenAI: GPT-4 and newer models
Anthropic: Claude models

Speech Processing

Voice AI requires more than just LLMs:

Speech-to-Text (STT): Converting user speech to text
Text-to-Speech (TTS): Converting responses to natural speech

How the Abstraction Works

The architecture follows a consistent pattern:

Application Code → Provider Manager → Provider Implementation → External API

The Provider Manager handles:

Provider Selection: Choosing which provider to use for a request
Failover: Switching to backup providers on errors
Configuration: Provider-specific settings without leaking into app code
Streaming: Consistent streaming interface regardless of provider

Application code simply requests what it needs:

"Complete this conversation"
"Transcribe this audio"
"Generate speech for this text"

The provider layer handles the rest.

Real-World Benefits

Graceful Degradation

When OpenAI had a major outage last year, services built tightly on their API went down. Our systems degraded gracefully - routing to alternative providers while OpenAI recovered.

Cost Optimisation

A/B Testing Models

When a new model releases, we can test it on a subset of traffic without risking production. The abstraction makes this trivial - just add a new provider implementation and configure routing rules.

Best Tool for Each Job

Implementation Details

Streaming Is Non-Negotiable

Voice AI needs real-time responses. Users expect sub-second latency - they're having a conversation, not submitting a form. Everything is streaming:

Audio streams in from the user
Transcription streams to the LLM
LLM responses stream to TTS
Audio streams back to the user

Traditional request-response patterns add unacceptable latency. Streaming everything means users hear responses while they're still being generated.

WebSockets for Real-Time

HTTP polling doesn't work for voice. We use WebSockets for bidirectional, real-time communication between clients and servers. This adds complexity but is essential for voice applications.

Circuit Breakers

Trade-Offs

This architecture isn't free:

Complexity: Abstractions add code and indirection
Lowest Common Denominator: Provider-specific features are harder to use
Testing: Each provider needs integration tests
Subtle Differences: Providers behave slightly differently

What We Learned

1. Build Custom Where It Matters

2. Design for Change

The AI landscape changes constantly. Architectures that assume stability become liabilities. Design assuming your providers will change - because they will.

3. Abstractions Should Be Thin

Heavy abstractions that try to normalise everything become their own problem. Our abstractions are thin - just enough to swap providers, not so much that they obscure what's happening.

4. Monitoring Per Model

When something goes wrong, you need to know which model is the issue - custom or external. Separate monitoring and logging for each makes debugging tractable.

5. Provider-Specific Optimisations Are OK

The Future

We're continuing to evolve this architecture:

More providers: As new capable models emerge
Smarter routing: ML-based provider selection
Cost prediction: Estimating costs before making requests
Quality monitoring: Automated quality scoring of provider responses

The core pattern - abstraction with clean interfaces - will remain. It's proven its value in production.

Interested in AI infrastructure?

Switchboard is built by a small team solving hard problems in real-time AI. If this kind of architecture interests you:

View engineering roles | Learn about Switchboard

The Vendor Lock-In Problem

The Provider Abstraction Layer

Custom Models

External LLM Providers

Speech Processing

How the Abstraction Works

Real-World Benefits

Graceful Degradation

Cost Optimisation

A/B Testing Models

Best Tool for Each Job

Implementation Details

Streaming Is Non-Negotiable

WebSockets for Real-Time

Circuit Breakers

Trade-Offs

What We Learned

1. Build Custom Where It Matters

2. Design for Change

3. Abstractions Should Be Thin

4. Monitoring Per Model

5. Provider-Specific Optimisations Are OK

The Future

Interested in AI infrastructure?

Related Articles

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

Text Normalisation for Natural AI Speech: Making TTS Sound Human

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

Get automation insights delivered

About the Author

Related Free Tools

How we build SwiftCase

The Vendor Lock-In Problem

The Provider Abstraction Layer

Custom Models

External LLM Providers

Speech Processing

How the Abstraction Works

Real-World Benefits

Graceful Degradation

Cost Optimisation

A/B Testing Models

Best Tool for Each Job

Implementation Details

Streaming Is Non-Negotiable

WebSockets for Real-Time

Circuit Breakers

Trade-Offs

What We Learned

1. Build Custom Where It Matters

2. Design for Change

3. Abstractions Should Be Thin

4. Monitoring Per Model

5. Provider-Specific Optimisations Are OK

The Future

Interested in AI infrastructure?

Related Articles

Detecting Humans vs Machines in Voice AI: AMD and VAD Explained

Text Normalisation for Natural AI Speech: Making TTS Sound Human

AI Navigating IVR Menus: How Voice Agents Automate Phone System Interactions

Get automation insights delivered

About the Author

Related Free Tools

How we build SwiftCase