
关于
使用双向 WebSocket 通信构建实时语音 AI 应用。
name: azure-ai-voicelive-py description: "构建具有双向 WebSocket 通信的实时语音 AI 应用。" risk: unknown source: community date_added: "2026-02-27"
Azure AI Voice Live SDK
构建具有双向 WebSocket 通信的实时语音 AI 应用。
安装
pip install azure-ai-voicelive aiohttp azure-identity
环境变量
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# For API key auth (not recommended for production)
AZURE_COGNITIVE_SERVICES_KEY=<api-key>
认证
DefaultAzureCredential(推荐):
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
...
API 密钥:
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
model="gpt-4o-realtime-preview"
) as conn:
...
快速开始
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async def main():
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
# Update session with instructions
await conn.session.update(session={
"instructions": "You are a helpful assistant.",
"modalities": ["text", "audio"],
"voice": "alloy"
})
# Listen for events
async for event in conn:
print(f"Event: {event.type}")
if event.type == "response.audio_transcript.done":
print(f"Transcript: {event.transcript}")
elif event.type == "response.done":
break
asyncio.run(main())
核心架构
连接资源
VoiceLiveConnection 暴露以下资源:
| 资源 | 用途 | 关键方法 |
|------|------|----------|
| conn.session | 会话配置 | update(session=...) |
| conn.response | 模型响应 | create(), cancel() |
| conn.input_audio_buffer | 音频输入 | append(), commit(), clear() |
| conn.output_audio_buffer | 音频输出 | clear() |
| conn.conversation | 对话状态 | item.create(), item.delete(), item.truncate() |
| conn.transcription_session | 转录配置 | update(session=...) |
会话配置
from azure.ai.voicelive.models import RequestSession, FunctionTool
await conn.session.update(session=RequestSession(
instructions="You are a helpful voice assistant.",
modalities=["text", "audio"],
voice="alloy", # or "echo", "shimmer", "sage", etc.
input_audio_format="pcm16",
output_audio_format="pcm16",
turn_detection={
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
tools=[
FunctionTool(
type="function",
name="get_weather",
description="Get current weather",
parameters={
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
)
]
))
音频流
发送音频(Base64 PCM16)
import base64
# Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
接收音频
async for event in conn:
if event.type == "response.audio.delta":
audio_bytes = base64.b64decode(event.delta)
await play_audio(audio_bytes)
elif event.type == "response.audio.done":
print("Audio complete")
事件处理
async for event in conn:
match event.type:
# Session events
case "session.created":
print(f"Session: {event.session}")
case "session.updated":
print("Session updated")
# Audio input events
case "input_audio_buffer.speech_started":
print(f"Speech started at {event.audio_start_ms}ms")
case "input_audio_buffer.speech_stopped":
print(f"Speech stopped at {event.audio_end_ms}ms")
# Transcription events
case "conversation.item.input_audio_transcription.completed":
print(f"User said: {event.transcript}")
兼容工具
Claude CodeCursor
标签
AI与机器学习