系统架构文档

双向流式语音对话系统,支持 LLM 纯对话和 Agent 工具调用两种模式。


架构总览

graph LR FE["前端\nBrowser"] SRV["服务器\nFastAPI"] ASR["ASR\nWhisper / 阿里云"] LLM["LLM\nOpenAI Compatible"] MCP["MCP 服务器\n外部工具(SSE / stdio)"] TTS["TTS 集群\nGPT-SoVITS × N"] FE <-->|"WebSocket\n(Opus / JSON)"| SRV SRV -->|"PCM"| ASR ASR -->|"text"| SRV SRV <-->|"stream"| LLM SRV <-->|"SSE / stdio"| MCP SRV -->|"HTTP"| TTS TTS -->|"WAV"| SRV

核心数据结构

classDiagram class LLMChunk { +str text # TTS用文本(无标签,已剥离voice/意图标签) +str raw_text # 原始输出(含标签:voice/no_tool/need_tool) +bool voice # 是否带<voice>标签(ChatHandler根据use_voice决定是否TTS) +ToolCallInfo tool_call # 工具调用(可选) +bool round_end # 本轮LLM调用的最后一个chunk } class ToolCallInfo { +str name +dict arguments +str result +bool approval_required } class TTSResult { +bool success +str text +str raw_text +bytes audio_data +bool is_end +bool is_text_only +dict tool_call # 工具调用信息(可选) +bool is_round_end # 轮次结束信号 } LLMChunk "1" --> "0..1" ToolCallInfo TTSResult ..> LLMChunk : 由ChatHandler转换

时序图

1. 语音输入 → AI语音回复

sequenceDiagram actor User participant Browser participant WS as WebSocket Server participant ASR participant CH as ChatHandler participant LLM as LLMClient participant TTS as TTSClient User->>Browser: 说话(触发VAD) Browser->>Browser: MediaRecorder 录制 Opus Browser->>Browser: 检测到静音 → stop() Browser->>WS: voice_input {session_id, audio_base64, format:opus} WS->>ASR: opus_to_pcm() + recognize() ASR-->>WS: text (识别结果) WS->>Browser: {type:asr_result, text} Browser->>Browser: 显示用户消息 WS->>CH: handle_voice_chat(messages, session_id) loop 双层流水线 CH->>LLM: chat_stream(messages) LLM-->>CH: LLMChunk(text, voice=true) CH->>TTS: submit(task_id, text) → Future Note over CH: 并发:LLM产句 & TTS合成同时进行 TTS-->>CH: TTSResult(audio_data) CH->>WS: 按task_id有序推送 WS->>Browser: {type:audio, audio:base64, format:opus} Browser->>Browser: audioQueue入队 → 顺序播放 Browser->>User: 播放语音 end CH->>TTS: submit(is_end=True) TTS-->>CH: TTSResult(is_end=True) WS->>Browser: {type:end} Browser->>Browser: checkAndSaveHistory() Browser->>Browser: 返回监听状态

2. Agent 工具调用流

sequenceDiagram actor User participant Browser participant WS as WebSocket Server participant CH as ChatHandler participant AM as ApprovalManager participant AGC as AgentClient participant MCP as MCPClient participant LLM as LLM API User->>Browser: 输入请求 Browser->>WS: chat {use_agent:true, messages} WS->>CH: handle_text_chat(use_agent=True) Note over AGC,LLM: ── 第 0 轮:规划轮(不传 tools,意图识别) ── CH->>AGC: chat_stream(messages) AGC->>AGC: full_messages.append(planning_prompt) AGC->>LLM: completions(无 tools,无工具描述) alt LLM 输出 <no_tool>(闲聊/问答) LLM-->>AGC: 流式输出 "<no_tool><voice>你好呀,有什么..." AGC->>AGC: 检测到 <no_tool>,标签保留在 raw_text AGC-->>CH: LLMChunk(voice=true, text="你好呀,有什么...") CH->>WS: {type:audio} 语音回复(use_voice=true时) AGC->>AGC: 拔出 planning_prompt,跳过 Agent 循环 WS->>Browser: {type:end} else LLM 输出 <need_tool>(需要工具) LLM-->>AGC: 流式输出 "<need_tool><voice>好的,我来帮你查看..." AGC-->>CH: LLMChunk(voice=true, text="好的,我来帮你查看...") CH->>WS: {type:audio} 语音播报计划 AGC-->>CH: LLMChunk(round_end=true) CH->>Browser: {type:round_end} Note over Browser: historySegments 插入 separator AGC->>AGC: 拔出 planning_prompt,清理意图标签,补 assistant + [continue] loop Agent 循环(最多 max_agent_steps 次) AGC->>LLM: completions (带tools定义) LLM-->>AGC: 流式文本 + tool_calls AGC-->>CH: LLMChunk(tool_call: start) CH->>Browser: {type:tool_call, state:start} AGC-->>CH: LLMChunk(tool_call: approval_required) CH->>AM: register(session_id, task_id) CH->>Browser: {type:tool_call, state:approval_required} CH->>CH: create_task(approval_manager.wait(key)) Note over CH: 非阻塞,主循环继续收集 TTS 结果 alt 用户批准 / auto-approve Browser->>WS: POST /api/tool-approve {approved:true} WS->>AM: respond(key, True) → Event.set() else 用户拒绝 Browser->>WS: POST /api/tool-approve {approved:false} WS->>AM: cancel() → Event.set() CH->>Browser: {type:tool_rejected} CH->>Browser: {type:interrupted} Note over CH: 中断会话 end AGC->>MCP: call_tool(name, arguments) MCP-->>AGC: result AGC->>LLM: 追加 tool 结果,继续循环 LLM-->>AGC: 流式最终回复文本 AGC-->>CH: LLMChunk(voice=true, "结果是...") CH->>WS: {type:audio} 语音播报结果 AGC-->>CH: LLMChunk(round_end=true) CH->>Browser: {type:round_end} end WS->>Browser: {type:end} end

3. 用户中断流

sequenceDiagram actor User participant Browser participant WS as WebSocket Server participant SM as SessionManager participant AM as ApprovalManager participant CH as ChatHandler participant LLM as LLMClient / AgentClient participant TTS as TTSClient Worker Note over Browser,TTS: 正在进行对话(LLM流式 + TTS合成中) User->>Browser: 点击"中断"按钮 Browser->>Browser: 立即停止音频播放 (audioPlayer.pause) Browser->>Browser: 清空 audioQueue,重置 approvalPendingTaskId Browser->>Browser: 保存已播放内容到 conversationHistory Browser->>WS: POST /api/interrupt {session_id} WS->>SM: cancel(session_id) SM->>SM: sessions[id].cancelled = True SM->>AM: on_cancel 回调 → cancel_session(session_id) AM->>AM: 唤醒所有 pending Event (approved=False) Note over CH,LLM: 各层在下一次轮询时检测到中断 LLM->>SM: is_cancelled(session_id) → True LLM-->>LLM: 退出流式输出循环 (async for chunk 中断) Note over CH: approval_wait_task.done() 返回 False → 中断 CH->>SM: is_cancelled(session_id) → True CH-->>CH: 退出主循环 TTS->>SM: is_cancelled(session_id) → True TTS-->>TTS: Worker 跳过剩余 TTS 任务 WS->>Browser: {type:interrupted} Browser->>Browser: 重置状态 → 返回监听/输入

LLM 层中断检查点(轮询式,非实时):

检查位置 组件 说明
规划轮流式循环 AgentClient 每个 stream chunk 前检查
Agent 循环入口 AgentClient 每轮开始前检查
Agent 流式循环 AgentClient 每个 stream chunk 前检查
工具执行前 AgentClient 每个工具调用前检查
审批等待后 AgentClient 审批返回后检查是否被取消
LLM 流式循环 LLMClient 每个 stream chunk 前检查(非Agent模式)

4. TTS 负载均衡 + 并发控制

sequenceDiagram participant CH as ChatHandler participant TC as TTSClient participant SEM as Semaphore(3) participant Q as GlobalQueue participant W1 as Worker-1 participant W2 as Worker-2 participant LB as LoadBalancer:9880 participant B1 as TTS后端:9881 participant B2 as TTS后端:9882 CH->>TC: submit(task_id=0, text="句子A") TC->>Q: put {task_id:0, future_A} CH->>TC: submit(task_id=1, text="句子B") TC->>Q: put {task_id:1, future_B} CH->>TC: submit(task_id=2, text="句子C") TC->>Q: put {task_id:2, future_C} W1->>Q: get → task_id=0 W2->>Q: get → task_id=1 W1->>SEM: acquire(已有0个) W2->>SEM: acquire(已有1个) W1->>LB: GET /tts?text=句子A LB->>B1: Round-robin → :9881 W2->>LB: GET /tts?text=句子B LB->>B2: Round-robin → :9882 B1-->>W1: audio_data W1->>SEM: release W1-->>CH: future_A.set_result(TTSResult) B2-->>W2: audio_data W2->>SEM: release W2-->>CH: future_B.set_result(TTSResult) Note over CH: next_id=0 → 先推送A,再推送B(保序)

5. 多模态(图片附件)

用户可通过前端附件面板上传图片(点击/拖拽/粘贴),图片以 base64 data URI 内联在消息中发送。

消息格式

content 字段支持两种格式:

// 纯文本(传统格式)
{ "role": "user", "content": "你好" }

// 多模态(图片 + 文本)
{ "role": "user", "content": [
    { "type": "text", "text": "这张图是什么?" },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
]}

处理流程

sequenceDiagram participant FE as 前端 participant WS as WebSocket Server participant LC as LLMClient / AgentClient participant API as LLM API FE->>WS: content: [text, image_url] WS->>WS: validate_attachments()<br/>(MIME白名单 + 10MB上限) alt 校验失败 WS-->>FE: {type:error, error:"不支持的图片格式"} else 校验通过 WS->>LC: messages(含多模态 content) LC->>LC: _build_messages() LC->>LC: sanitize_messages()<br/>(兼容 list content) alt multimodal = True LC->>API: 原样发送(保留 image_url) else multimodal = False LC->>LC: strip_images()<br/>(移除 image_url,退化为字符串) LC->>API: 发送纯文本 messages end API-->>LC: 流式响应 LC-->>WS: LLMChunk WS-->>FE: 文本 / 音频 end

Provider 多模态能力

llm/providers.py 中每个 provider 配置 "multimodal": True/False

模型 multimodal 说明
claude-sonnet-4-6, claude-haiku-4-5, kimi-k2.5 True 原生支持图片
glm-5, qwen3.5-plus, deepseek-v3.2 False 不支持,静默过滤