Contents
When we built Switchboard, our AI-powered communication platform, we made two deliberate architectural choices: build custom models for domain-specific tasks, and abstract external providers so no single vendor owns our system. Here's why, and how we implemented it.
The Vendor Lock-In Problem
AI is moving fast. The best model today might not be the best model tomorrow. If your architecture is tightly coupled to one provider:
- You can't easily switch when something better emerges
- Provider outages become your outages
- Pricing changes hold you hostage
- You're stuck with one provider's strengths and weaknesses
We've seen this pattern before. Companies that bet everything on one vendor inevitably regret it when that vendor changes pricing, deprecates APIs, or falls behind competitors.
The Provider Abstraction Layer
Switchboard's AI module abstracts all provider-specific details behind clean interfaces. The core engine doesn't know or care whether it's talking to OpenAI or Anthropic - it just requests completions and gets responses.
Custom Models
For domain-specific tasks, we build and train our own models:
- Intent Classification: Custom models trained on our specific use cases outperform general-purpose LLMs for routing decisions
- Entity Extraction: Purpose-built extractors for the data types our customers care about
- Quality Scoring: Internal models that evaluate conversation quality in real-time
Custom models give us control over latency, cost, and behaviour that external providers can't match for specialised tasks.
External LLM Providers
For general language understanding and generation, we support multiple providers:
- OpenAI: GPT-4 and newer models
- Anthropic: Claude models
Each provider has different strengths. Some excel at reasoning, others at following instructions. Our abstraction lets us route requests to the optimal provider for each task - or switch entirely if one provider experiences issues.
Speech Processing
Voice AI requires more than just LLMs:
- Speech-to-Text (STT): Converting user speech to text
- Text-to-Speech (TTS): Converting responses to natural speech
We use Deepgram for STT (excellent accuracy, low latency) and ElevenLabs for TTS (natural-sounding voices). But these are behind abstractions too - we can swap providers without changing application code.
How the Abstraction Works
The architecture follows a consistent pattern:
Application Code → Provider Manager → Provider Implementation → External API
The Provider Manager handles:
- Provider Selection: Choosing which provider to use for a request
- Failover: Switching to backup providers on errors
- Configuration: Provider-specific settings without leaking into app code
- Streaming: Consistent streaming interface regardless of provider
Application code simply requests what it needs:
- "Complete this conversation"
- "Transcribe this audio"
- "Generate speech for this text"
The provider layer handles the rest.
Real-World Benefits
Graceful Degradation
When OpenAI had a major outage last year, services built tightly on their API went down. Our systems degraded gracefully - routing to alternative providers while OpenAI recovered.
Cost Optimisation
Different providers have different pricing models. By abstracting providers, we can route simpler requests to cheaper models and reserve expensive models for complex tasks. This isn't possible when you're locked to one provider.
A/B Testing Models
When a new model releases, we can test it on a subset of traffic without risking production. The abstraction makes this trivial - just add a new provider implementation and configure routing rules.
Best Tool for Each Job
Transcription, language understanding, and speech synthesis are different problems. The best solution for each might come from different providers. Our architecture lets us optimise each component independently.
Implementation Details
Streaming Is Non-Negotiable
Voice AI needs real-time responses. Users expect sub-second latency - they're having a conversation, not submitting a form. Everything is streaming:
- Audio streams in from the user
- Transcription streams to the LLM
- LLM responses stream to TTS
- Audio streams back to the user
Traditional request-response patterns add unacceptable latency. Streaming everything means users hear responses while they're still being generated.
WebSockets for Real-Time
HTTP polling doesn't work for voice. We use WebSockets for bidirectional, real-time communication between clients and servers. This adds complexity but is essential for voice applications.
Circuit Breakers
When a provider starts failing, we need to fail fast rather than queue up timeouts. Circuit breakers monitor provider health and short-circuit requests to failing providers, automatically routing to alternatives.
Trade-Offs
This architecture isn't free:
- Complexity: Abstractions add code and indirection
- Lowest Common Denominator: Provider-specific features are harder to use
- Testing: Each provider needs integration tests
- Subtle Differences: Providers behave slightly differently
We accept these trade-offs because the benefits - resilience, flexibility, cost optimisation - outweigh the costs for our use case. If you're building a simple prototype, this is probably overkill. For production systems handling real customer conversations, it's essential.
What We Learned
1. Build Custom Where It Matters
General-purpose LLMs are impressive, but custom models win on latency, cost, and domain-specific accuracy. We build custom for high-volume, well-defined tasks and use external providers for complex reasoning.
2. Design for Change
The AI landscape changes constantly. Architectures that assume stability become liabilities. Design assuming your providers will change - because they will.
3. Abstractions Should Be Thin
Heavy abstractions that try to normalise everything become their own problem. Our abstractions are thin - just enough to swap providers, not so much that they obscure what's happening.
4. Monitoring Per Model
When something goes wrong, you need to know which model is the issue - custom or external. Separate monitoring and logging for each makes debugging tractable.
5. Provider-Specific Optimisations Are OK
The abstraction doesn't mean treating all providers identically. We tune prompts per provider, use provider-specific features when valuable, and route based on strengths. The abstraction is for swappability, not uniformity.
The Future
We're continuing to evolve this architecture:
- More providers: As new capable models emerge
- Smarter routing: ML-based provider selection
- Cost prediction: Estimating costs before making requests
- Quality monitoring: Automated quality scoring of provider responses
The core pattern - abstraction with clean interfaces - will remain. It's proven its value in production.
Interested in AI infrastructure?
Switchboard is built by a small team solving hard problems in real-time AI. If this kind of architecture interests you:

