AI Voice & Speech Market 2026 — ElevenLabs, Deepgram, OpenAI & real-time agents

Key findings

ElevenLabs Q1 ARR

$200M

Up from $15M in 2023 — fastest-growing voice synthesis API

Real-time turn target

<200ms

Current best is ~280ms; whoever hits sub-200 owns voice agents

TTS price decline

-50%

$0.30 → $0.15 per 1k chars (2024 → 2026)

Deepgram price floor

$0.008/min

Owns the price-performance frontier on real-time STT

Audiobook AI spend

$80M+

Industry estimate for AI voice in 2026

Market overview

The voice AI category split in two during 2025:

Speech generation (TTS / voice cloning) — ElevenLabs, OpenAI Voice, Cartesia, Resemble. ~$700M total addressable revenue.
Speech recognition (STT / transcription) — Deepgram, AssemblyAI, Speechmatics, Rev. ~$400M.

A third emerging segment — conversational voice agents — combines both with an LLM in the middle. Real-time conversational voice is the highest-stakes contest in the category right now.

ElevenLabs — the synthesis winner

ElevenLabs hit ~$200M ARR by Q1 2026 from a standing start in 2022. Three growth vectors:

Voice cloning at scale — 30-second sample → indistinguishable clone. The product makes voice talent licensable as software.
Multilingual — same voice, 32 languages, no quality regression
Audiobooks — Spotify, Audible competitors all integrated. Book industry estimated to spend $80M+ on AI voice in 2026.

ElevenLabs Conversational AI (released late 2025) is the bid for the voice-agent stack. Latency around 350ms today, target sub-200ms by Q4 2026.

OpenAI Voice — the platform play

GPT-4o's voice mode and the Realtime API ship voice as a primitive of the OpenAI platform. The advantage: already integrated with the model billions of developers already use. The disadvantage: commodity quality — emotive range and voice variety still trail ElevenLabs.

Cartesia — the latency play

Cartesia's Sonic-2 model targets sub-100ms first-byte latency. Real-time conversational voice is the entire pitch. Funded by Index, Lightspeed, and operators from OpenAI's Realtime team.

Deepgram + AssemblyAI — the STT incumbents

Two distinct strategies:

Deepgram — fast, cheap, accurate. Pitched at developers and call-center automation. ~$100M ARR. Owns the price-performance frontier on real-time transcription.
AssemblyAI — accurate, batch-oriented, with rich intelligence layer (sentiment, topic, summarization). $80M+ ARR. Pitched at enterprise insights and post-call analytics.

Speechmatics (UK) is the technical leader on multilingual + low-resource languages. Rev.ai serves the legacy human-transcription pivot.

The conversational voice contest

Three companies are in real-time voice agents:

ElevenLabs Conversational AI — best voice quality, growing function-calling
OpenAI Realtime API — best LLM tight coupling, weakest voice variety
Cartesia Sonic-2 — best first-byte latency, smallest catalog

All three claim sub-300ms targets. None have reliably hit it on a real network with full LLM reasoning between turns. The first to combine sub-200ms + emotive voice + reliable function calling will own the voice agent stack.

Pricing dynamics

Surface	2024 typical	2026 typical
TTS per 1k chars	$0.30	$0.10–0.18
STT per minute	$0.024	$0.008–0.012
Real-time voice / minute	n/a	$0.04–0.08

Prices halved on STT and TTS but premium on real-time voice (~5× transcription) reflects the LLM cost in the loop.

Verdict

ElevenLabs is the structural winner of synthesis through 2027. OpenAI Voice is the structural winner of platform-default. Deepgram owns STT pricing. The real-time voice agent fight is undecided — and that's where the next $1B ARR business in this category will be built.

AI Voice Vendors — 2026 ARR Estimates ($M)

TTS Pricing per 1k Characters ($)