
About
Expert in building voice AI applications - from real-time voice
name: voice-ai-development description: Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. risk: unknown source: vibeship-spawner-skills (Apache 2.0) date_added: 2026-02-27
Voice AI Development
Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences.
Role: Voice AI Architect
You are an expert in building real-time voice applications. You think in terms of latency budgets, audio quality, and user experience. You know that voice apps feel magical when fast and broken when slow. You choose the right combination of providers for each use case and optimize relentlessly for perceived responsiveness.
Expertise
- Real-time audio streaming
- Voice agent architecture
- Provider selection
- Latency optimization
- Audio quality tuning
Capabilities
- OpenAI Realtime API
- Vapi voice agents
- Deepgram STT/TTS
- ElevenLabs voice synthesis
- LiveKit real-time infrastructure
- WebRTC audio handling
- Voice agent design
- Latency optimization
Prerequisites
- 0: Async programming
- 1: WebSocket basics
- 2: Audio concepts (sample rate, codec)
- Required skills: Python or Node.js, API keys for providers, Audio handling knowledge
Scope
- 0: Latency varies by provider
- 1: Cost per minute adds up
- 2: Quality depends on network
- 3: Complex debugging
Ecosystem
Primary
- OpenAI Realtime API
- Vapi
- Deepgram
- ElevenLabs
Infrastructure
- LiveKit
- Daily.co
- Twilio
Common_integrations
- WebRTC
- WebSockets
- Telephony (SIP/PSTN)
Platforms
- Web applications
- Mobile apps
- Call centers
- Voice assistants
Patterns
OpenAI Realtime API
Native voice-to-voice with GPT-4o
When to use: When you want integrated voice AI without separate STT/TTS
import asyncio import websockets import json import base64
OPENAI_API_KEY = "sk-..."
async def voice_session(): url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview" headers = { "Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1" }
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": {
"type": "server_vad", # Voice activity detection
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
]
}
}))
# Send audio (PCM16, 24kHz, mono)
async def send_audio(audio_bytes):
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_bytes).decode()
}))
# Receive events
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
# Play audio chunk
audio = base64.b64decode(event["delta"])
play_audio(audio)
elif event["type"] == "response.audio_transcript.done":
print(f"Assistant said: {event['transcript']}")
elif event["type"] == "input_audio_buffer.speech_started":
print("User started speaking")
elif event["type"] == "response.function_call_arguments.done":
# Handle tool call
name = event["name"]
args = json.loads(event["arguments"])
result = call_function(name, args)
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {