
About
Voice agents represent the frontier of AI interaction - humans
name: voice-agents description: Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. risk: safe source: vibeship-spawner-skills (Apache 2.0) date_added: 2026-02-27
Voice Agents
Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.
This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters.
84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.
Principles
- Latency is the constraint - target <800ms end-to-end
- Jitter (variance) matters as much as absolute latency
- VAD quality determines conversation flow
- Interruption handling makes or breaks the experience
- Start with focused MVP, iterate based on real conversations
- Combine best-in-class components (Deepgram STT + ElevenLabs TTS)
Capabilities
- voice-agents
- speech-to-speech
- speech-to-text
- text-to-speech
- conversational-ai
- voice-activity-detection
- turn-taking
- barge-in-detection
- voice-interfaces
Scope
- phone-system-integration → backend
- audio-processing-dsp → audio-specialist
- music-generation → audio-specialist
- accessibility-compliance → accessibility-specialist
Tooling
Speech_to_speech
- OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms
- Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular
Speech_to_text
- OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results
- Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio
- AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance
Text_to_speech
- ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression
- OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming
- Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB
Frameworks
- Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling
- Vapi - When: Managed voice agent platform Note: No infrastructure management
- Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption
Patterns
Speech-to-Speech Architecture
Direct audio-to-audio processing for lowest latency
When to use: Maximum naturalness, emotional preservation, real-time conversation
SPEECH-TO-SPEECH ARCHITECTURE:
""" [User Audio] → [S2S Model] → [Agent Audio]
Advantages:
- Lowest latency (sub-500ms)
- Preserves emotion, emphasis, accents
- Most natural conversation flow
Disadvantages:
- Less control over responses
- Harder to debug/audit
- Can't easily modify what's said """
OpenAI Realtime API
""" import { RealtimeClient } from '@openai/realtime-api-beta';
const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY, });
// Configure for voice conversation
client.updateSession({
modalities: ['text', 'audio'],
voice: 'alloy',
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
instructions: You are a helpful customer service agent. Be concise and friendly. If you don't know something, say so rather than making things up.,
turn_detection: {
type: 'server_vad', // or 'semantic_vad'
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
},
});
// Handle audio streams client.on('conversation.item.input_audio_transcription', (event) => { console.log('User said:', event.transcript); });
client.on('response.audio.delta', (event) => { // Stream audio to speaker audioPlayer.write(Buffer.from(event.delta, 'base64')); });
// Send user audio client.appendInputAudio(audioBuffer); """
Use Cases:
- Real-time customer support
- Voice assistants
- Interactive voice response (IVR)
- Live language translation
Pipeline Architecture
Separate STT → LLM → TTS for maximum control
When to use: Need to know/control exactly what's said, debugging, compliance
PIPELINE ARCHITECTURE:
""" [Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]
Advantages:
- Full control at each step
- Can log/audit all text
- Easier to debug
- Mix best-in-class components
Disadvantages:
- Higher latency (700-1200ms typical)
- Loses some emotion/nuance
- More components to manage """
Production Pipeline Example
""" import { Deepgram } from '@deepgram