Voice Agents

Low Risk

by @sickn33Verified Source

4.6407 installsv1.0.0Updated May 25, 2026

How to Use

Run in Claude Code terminal

Step 1: Add Marketplace

/plugin marketplace add sickn33/antigravity-awesome-skills

Step 2: Install Plugin

/plugin install voice-agents@antigravity-awesome-skills

About

Voice agents represent the frontier of AI interaction - humans

name: voice-agents description: Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. risk: safe source: vibeship-spawner-skills (Apache 2.0) date_added: 2026-02-27

Voice Agents

Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.

This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters.

84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.

Principles

Latency is the constraint - target <800ms end-to-end
Jitter (variance) matters as much as absolute latency
VAD quality determines conversation flow
Interruption handling makes or breaks the experience
Start with focused MVP, iterate based on real conversations
Combine best-in-class components (Deepgram STT + ElevenLabs TTS)

Capabilities

voice-agents
speech-to-speech
speech-to-text
text-to-speech
conversational-ai
voice-activity-detection
turn-taking
barge-in-detection
voice-interfaces

Scope

phone-system-integration → backend
audio-processing-dsp → audio-specialist
music-generation → audio-specialist
accessibility-compliance → accessibility-specialist

Tooling

Speech_to_speech

OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms
Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular

Speech_to_text

OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results
Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio
AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance

Text_to_speech

ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression
OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming
Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB

Frameworks

Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling
Vapi - When: Managed voice agent platform Note: No infrastructure management
Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption

Patterns

Speech-to-Speech Architecture

Direct audio-to-audio processing for lowest latency

When to use: Maximum naturalness, emotional preservation, real-time conversation

SPEECH-TO-SPEECH ARCHITECTURE:

""" [User Audio] → [S2S Model] → [Agent Audio]

Advantages:

Lowest latency (sub-500ms)
Preserves emotion, emphasis, accents
Most natural conversation flow

Disadvantages:

Less control over responses
Harder to debug/audit
Can't easily modify what's said """

OpenAI Realtime API

""" import { RealtimeClient } from '@openai/realtime-api-beta';

const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY, });

// Configure for voice conversation client.updateSession({ modalities: ['text', 'audio'], voice: 'alloy', input_audio_format: 'pcm16', output_audio_format: 'pcm16', instructions: You are a helpful customer service agent. Be concise and friendly. If you don't know something, say so rather than making things up., turn_detection: { type: 'server_vad', // or 'semantic_vad' threshold: 0.5, prefix_padding_ms: 300, silence_duration_ms: 500, }, });

// Handle audio streams client.on('conversation.item.input_audio_transcription', (event) => { console.log('User said:', event.transcript); });

client.on('response.audio.delta', (event) => { // Stream audio to speaker audioPlayer.write(Buffer.from(event.delta, 'base64')); });

// Send user audio client.appendInputAudio(audioBuffer); """

Use Cases:

Real-time customer support
Voice assistants
Interactive voice response (IVR)
Live language translation

Pipeline Architecture

Separate STT → LLM → TTS for maximum control

When to use: Need to know/control exactly what's said, debugging, compliance

PIPELINE ARCHITECTURE:

""" [Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]

Advantages:

Full control at each step
Can log/audit all text
Easier to debug
Mix best-in-class components

Disadvantages:

Higher latency (700-1200ms typical)
Loses some emotion/nuance
More components to manage """

Production Pipeline Example

""" import { Deepgram } from '@deepgram

Compatible Tools

Claude CodeCursor

Voice Agents

About

name: voice-agents description: Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. risk: safe source: vibeship-spawner-skills (Apache 2.0) date_added: 2026-02-27

Voice Agents

Principles

Capabilities

Scope

Tooling

Speech_to_speech

Speech_to_text

Text_to_speech

Frameworks

Patterns

Speech-to-Speech Architecture

SPEECH-TO-SPEECH ARCHITECTURE:

OpenAI Realtime API

Use Cases:

Pipeline Architecture

PIPELINE ARCHITECTURE:

Production Pipeline Example

Compatible Tools

Tags

Related Skills

RAG Engineer

"orchestrate-batch-refactor"

Docx Official

Azure AI Agents Persistent Java

Azure Search Documents Ts

Agent Framework Azure AI Py