The state of AI voice
Voice AI is moving fast. In the span of about a year, we've gone from clunky, robotic phone bots to voice agents that can hold genuine conversations, and from cloud-only TTS services to open-source models with 82 million parameters that sound shockingly good running on a laptop. If you're building anything voice-related right now, the landscape looks completely different than it did even twelve months ago. This post breaks down two sides of that shift: the commercial voice agent platforms competing to own the conversational AI stack, and the wave of lightweight open-source TTS models that are making local, private, high-quality speech synthesis a real option for the first time.
The voice agent platforms
The commercial voice AI space has consolidated around a handful of platforms, each with a different philosophy. Some want to own the entire stack. Others want to be the glue layer. Understanding the distinction matters, because it shapes what you can build and how much control you have.
ElevenLabs
ElevenLabs started as a text-to-speech company and has expanded into a full voice agent platform with ElevenAgents. The pitch is an end-to-end stack: TTS, voice cloning, conversational AI, telephony, and analytics all under one roof. If you want consistent production performance without stitching together multiple providers, ElevenLabs is the most integrated option available. The voice quality is widely regarded as best-in-class among commercial offerings. ElevenLabs supports real-time audio streaming, offers over a thousand voices, and provides instant voice cloning. Their conversational AI platform handles both voice and text interactions, with deployment across telephony systems, web, and mobile. Pricing starts at $0.08 per minute on annual business plans, with a free tier offering 10,000 credits per month. The tradeoff is that you're locked into their ecosystem. If you want to swap out the TTS engine or use a different LLM routing strategy, you're working against the grain of the platform rather than with it.
Vapi
Vapi takes the opposite approach. It's a developer-first orchestration layer that lets you choose your own STT, LLM, and TTS providers for each call. Think of it as the middleware for voice agents: you bring the models, Vapi handles the infrastructure, latency optimization, and telephony integration. The platform supports over 100 languages, offers a no-code Flow Studio for building conversation flows, and provides API access to everything. Vapi also includes a unified telephony system and analytics dashboards for tracking call metrics. Base pricing starts at $0.05 per minute for orchestration, but the real cost depends on which models you plug in. A typical production setup runs between $0.07 and $0.25 per minute once you factor in STT, TTS, and LLM costs. The flexibility is the selling point, but it also means you're responsible for managing a more complex stack and tracking costs across multiple providers.
Retell AI
Retell sits somewhere between ElevenLabs and Vapi. It's purpose-built for businesses that need to automate phone calls at scale, with a focus on customer support and outbound sales. Retell offers a visual flow builder, native telephony integration, and strong analytics including post-call sentiment analysis and success rate tracking. One of Retell's strengths is its modular pricing. The base platform cost starts at $0.07 per minute, and you can mix and match TTS providers (including ElevenLabs, Cartesia, OpenAI, and their own platform voices), LLMs, and telephony options. A typical setup with ElevenLabs voice and Claude runs around $0.14 per minute. Retell emphasizes low latency and handles both inbound and outbound calls. For teams that want a middle ground between full-stack simplicity and developer flexibility, it's a solid choice.
Bland AI
Bland AI targets enterprise teams that need granular control over conversation logic. Its standout feature is Conversational Pathways, which lets you mix scripted branches with LLM-generated replies while setting guardrails to prevent hallucinations. You can define conditions, fallbacks, and sentiment-based routing. The platform is built for scale, handling hundreds of thousands of concurrent calls with self-hosted options for teams that need latency control or compliance in regulated industries. Native integrations are limited and often require custom development, so it's best suited for engineering teams running high-volume call operations. Pricing sits around $0.09 per minute plus subscription and outbound fees.
Platform comparison
| ElevenLabs | Vapi | Retell AI | Bland AI | |
|---|---|---|---|---|
| Approach | Full-stack, end-to-end | Modular orchestration layer | Mid-range, modular pricing | Enterprise, granular control |
| Base pricing | ~$0.08/min | ~$0.05/min + model costs | ~$0.07/min + model costs | ~$0.09/min + fees |
| Typical cost | $0.08/min | $0.07 - $0.25/min | ~$0.14/min | $0.09/min+ |
| BYO models | No | Yes (STT, LLM, TTS) | Partial (mix TTS/LLM) | Partial |
| Best for | Small teams wanting simplicity | Developers wanting flexibility | Mid-size teams, phone automation | Enterprise, regulated industries |
| Self-hosted | No | No | No | Yes |
| Standout feature | Best-in-class voice quality | 100+ languages, full provider choice | Post-call sentiment analysis | Conversational Pathways with guardrails |
How to think about choosing
The choice between these platforms comes down to a few questions:
- Do you want a single vendor or a modular stack? ElevenLabs gives you everything in one place. Vapi gives you maximum flexibility. Retell and Bland fall in between.
- How much engineering capacity do you have? Retell and ElevenLabs are friendlier to smaller teams. Vapi and Bland reward teams with strong engineering resources.
- What's your call volume? At high volumes, the per-minute cost differences add up. Bland and Retell offer enterprise pricing that can bring costs down significantly.
- Do you need compliance controls? Bland's self-hosted options and detailed logging make it the strongest choice for regulated industries.
The open-source TTS revolution
While the commercial platforms have been competing on features and pricing, something arguably more important has been happening in the open-source world. A new generation of TTS models has emerged that are small enough to run locally, good enough to compete with commercial offerings, and permissively licensed enough to use in production. What makes this wave remarkable is the parameter count. These aren't billion-parameter behemoths that require expensive GPU clusters. They're models in the tens to hundreds of millions of parameters that run on consumer hardware, sometimes even on CPUs.
Kokoro-82M
Kokoro is the model that proved small TTS could sound great. With just 82 million parameters, it delivers voice quality that competes with models many times its size. Built on the StyleTTS 2 architecture with an ISTFTNet decoder, Kokoro uses a decoder-only design with no diffusion or encoder components, which keeps it fast and lightweight. Before its public release on Christmas Day 2024, Kokoro was the number one ranked model on the TTS Spaces Arena on Hugging Face, beating models trained on far more data with far more parameters. It was trained on less than 100 hours of audio data, which is remarkably little by modern standards. Kokoro ships with 10 voicepacks covering American and British English accents, with community-driven expansion to French, Korean, Japanese, and Mandarin. It runs comfortably on a standard NVIDIA GPU or Apple Silicon via MPS acceleration, and an ONNX version is available for optimized deployment. The Apache 2.0 license means you can use it commercially without restrictions. For developers who need decent English TTS without cloud dependencies, Kokoro set a new bar for what's possible at this scale.
Orpheus TTS
If Kokoro proved that small models could sound good, Orpheus proved that LLM-based architectures could make speech sound human. Built on the Llama 3B backbone by Canopy Labs, Orpheus is a 3-billion-parameter model that treats speech synthesis as a language modeling problem.
The results are striking. Orpheus produces speech with natural intonation, emotion, and rhythm that rivals closed-source commercial models. It supports zero-shot voice cloning, guided emotion and intonation control through simple tags like [cheerful] or [whisper], and streaming latency as low as 200ms (reducible to around 100ms with input streaming).
The architecture uses 7 tokens per audio frame decoded as a single flattened sequence rather than using multiple language model heads, combined with a CNN-based tokenizer. This design keeps inference efficient enough for real-time applications. Orpheus was trained on over 100,000 hours of English speech data and billions of text tokens.
Orpheus is available on platforms like Groq and Together AI for cloud inference, and can be self-hosted. It's licensed under Apache 2.0, making it viable for commercial use.
Chatterbox
Chatterbox, developed by Resemble AI, is a family of open-source TTS models that has gained serious traction, surpassing a million downloads on Hugging Face. The latest model in the lineup, Chatterbox-Turbo, uses a streamlined 350-million-parameter architecture that reduces compute and VRAM requirements while maintaining high-fidelity output. What sets Chatterbox apart is its distilled one-step decoder. Where earlier versions required 10 diffusion steps to generate audio, Turbo does it in a single step. This makes it one of the fastest open-source TTS options available. Chatterbox also supports emotion exaggeration control, a feature that's rare in open-source models, and zero-shot voice cloning from roughly 5 seconds of audio. The model has been benchmarked favorably against ElevenLabs in blind evaluations, which is notable for a free, MIT-licensed model. Chatterbox Multilingual extends support to 23 languages, making it one of the more versatile open-source options.
Other notable models
The open-source TTS space is moving quickly, and several other models deserve attention:
- Pocket TTS is a 100-million-parameter model built on a novel architecture called Continuous Audio Language Models (CALM). It runs faster than real-time on a laptop CPU, making it one of the most efficient options for truly resource-constrained environments.
- Fish Speech 1.5 excels at multilingual synthesis and code-switching (mixing languages within a single utterance), handling scenarios like Spanglish better than most paid APIs.
- CosyVoice2-0.5B from FunAudioLLM uses a dual autoregressive architecture and is consistently ranked among the top open-source models for accuracy across multiple languages.
- NeuTTS Air by Neuphonic is built on a 0.5B-parameter LLM backbone and is designed specifically for on-device deployment, running on everything from laptops to Raspberry Pis.
- Qwen3-TTS introduces "instructable TTS," where you can control voice characteristics through natural language prompts like "speak with a sarcastic, skeptical tone, accelerating slightly at the end."
TTS model comparison
| Kokoro-82M | Orpheus TTS | Chatterbox-Turbo | Pocket TTS | Fish Speech 1.5 | |
|---|---|---|---|---|---|
| Parameters | 82M | 3B | 350M | 100M | N/A |
| Architecture | StyleTTS 2 + ISTFTNet | Llama 3B backbone | Distilled one-step decoder | CALM | N/A |
| Voice cloning | No | Yes (zero-shot) | Yes (~5s audio) | Yes | Yes |
| Emotion control | No | Yes (tags) | Yes (exaggeration control) | No | No |
| Languages | English + community (FR, KO, JA, ZH) | English | 23 (with Multilingual) | English | Multilingual + code-switching |
| Runs on CPU | Via ONNX | No | No | Yes (faster than real-time) | No |
| License | Apache 2.0 | Apache 2.0 | MIT | Open-source | Open-source |
| Streaming latency | Low | ~100 - 200ms | Very low (single step) | Very low | N/A |
What this means
The convergence of these two trends, capable commercial platforms and increasingly competitive open-source models, is reshaping what's possible with voice AI. For product teams building voice agents, the commercial platforms have matured to the point where deploying a production voice agent is a weeks-long project rather than a months-long one. The main challenge is no longer the technology itself but designing conversations that actually help users and integrating agents into workflows that create business value. For developers who need TTS, the open-source options mean you no longer have to choose between quality and cost. A model like Kokoro or Chatterbox-Turbo can run locally, costs nothing per character, and sounds good enough for most production use cases. The days when high-quality TTS required an expensive API subscription are ending. For privacy-conscious applications, local TTS models eliminate the need to send user data to external services. This opens up use cases in healthcare, legal, and finance that were previously impractical with cloud-only solutions. The most interesting space to watch is where these trends intersect: voice agent platforms that let you plug in your own open-source TTS models. Vapi's modular architecture already supports this pattern, and as open-source models continue to improve, the economic case for self-hosted TTS within commercial orchestration platforms will only get stronger. Voice AI is no longer an expensive experiment. It's becoming infrastructure.
References
- ElevenLabs Conversational AI Platform, https://elevenlabs.io/conversational-ai
- Vapi Developer Platform, https://docs.vapi.ai/quickstart/introduction
- Retell AI Voice Agent Platform, https://www.retellai.com
- Bland AI Platform, https://www.bland.ai
- Kokoro-82M on Hugging Face, https://huggingface.co/hexgrad/Kokoro-82M
- Orpheus TTS on GitHub, https://github.com/canopyai/Orpheus-TTS
- Chatterbox TTS by Resemble AI, https://github.com/resemble-ai/chatterbox
- Orpheus TTS on Hugging Face, https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
- "The Best Open-Source Text-to-Speech Models in 2026," BentoML, https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models
- "The Best Small Text-to-Speech Models in 2026," SiliconFlow, https://www.siliconflow.com/articles/en/best-small-text-to-speech-models-2025
- "Pocket TTS: 100M TTS and Voice Cloning model for CPUs," Medium, https://medium.com/data-science-in-your-pocket/pocket-tts-100m-tts-and-voice-cloning-model-for-cpus-dfe3185fa0a8
- "Top Open-Source Text to Speech Models," Modal, https://modal.com/blog/open-source-tts
- "Best AI Voice Agent Platforms (2025 Review & Comparison)," Synthflow, https://synthflow.ai/blog/8-best-ai-voice-agents-for-business-in-2026