Audio & Speech

Voice is one of the most natural interfaces there is, and it’s now practical to build. Audio work has three building blocks — transcription, synthesis, and understanding — which combine into voice agents.

Speech-to-text (STT)

Speech-to-text — also called ASR (automatic speech recognition) or just transcription — converts spoken audio into written text. Modern STT models are accurate, multilingual, and cheap.

Two modes:

Batch — transcribe a complete recording. For meeting notes, call analytics, captioning archives.
Streaming — transcribe as the person speaks, emitting partial results. Required for live captions and conversational agents.

Accuracy is not uniform. It degrades with background noise, strong accents, overlapping speakers, and domain-specific vocabulary (product names, jargon, codes). Diarization — labeling who spoke — is a separate, harder problem. Plan for imperfect transcripts.

Text-to-speech (TTS)

Text-to-speech synthesizes spoken audio from text. Modern TTS is close to natural, supports many voices and languages, and can stream audio as it’s generated — essential for keeping perceived latency low.

The dimensions that matter: naturalness, latency (especially time to the first audio chunk), and voice selection.

Audio understanding

Beyond transcription, newer models reason about audio directly — tone and emotion, non-speech events (a siren, applause), music, who is speaking. This keeps information that a plain transcript throws away. Useful for call-quality analysis, accessibility, and richer voice agents.

Voice agents

A voice agent lets a user talk to an LLM-powered system. The classic design is a pipeline:

The latency budget

The hard part of a voice agent is latency. A natural conversation needs a response within a few hundred milliseconds, and every stage spends some of that budget: capturing audio, STT, the LLM (its time to first token), TTS, and playback. They add up fast.

Tactics: stream every stage (don’t wait for a full transcript before starting the LLM; don’t wait for the full response before starting TTS); handle barge-in so a user can interrupt; and get endpointing right — detecting when the user has actually finished speaking.

Speech-to-speech models

Newer speech-to-speech models take audio in and produce audio out directly, skipping the text round-trip. They cut latency and preserve tone and emotion that transcription discards — at the cost of less visibility and control (no transcript to inspect, log, or guardrail mid-pipeline). The pipeline approach remains easier to debug and govern.

Failure modes

Errors compound: an STT mistake becomes wrong input to the LLM, which answers confidently about the wrong thing. Voice also removes the ability to proofread — a user can’t see a misheard word. And voice recordings are sensitive personal data; handle them under Data & Privacy.

Key takeaways

Audio work has three blocks: speech-to-text (batch or streaming, accuracy varies with noise and accents), text-to-speech (natural, streamable — and voice cloning needs consent), and audio understanding. A voice agent chains STT → LLM → TTS, and its central challenge is the latency budget — stream every stage and handle interruptions. Speech-to-speech models cut latency and keep tone but sacrifice control. STT errors compound downstream, so design for imperfect transcripts.