系统架构文档
双向流式语音对话系统,支持 LLM 纯对话和 Agent 工具调用两种模式。
架构总览
graph LR
FE["前端\nBrowser"]
SRV["服务器\nFastAPI"]
ASR["ASR\nWhisper / 阿里云"]
LLM["LLM\nOpenAI Compatible"]
MCP["MCP 服务器\n外部工具(SSE / stdio)"]
TTS["TTS 集群\nGPT-SoVITS × N"]
FE <-->|"WebSocket\n(Opus / JSON)"| SRV
SRV -->|"PCM"| ASR
ASR -->|"text"| SRV
SRV <-->|"stream"| LLM
SRV <-->|"SSE / stdio"| MCP
SRV -->|"HTTP"| TTS
TTS -->|"WAV"| SRV
核心数据结构
classDiagram
class LLMChunk {
+str text # TTS用文本(无标签,已剥离voice/意图标签)
+str raw_text # 原始输出(含标签:voice/no_tool/need_tool)
+bool voice # 是否带<voice>标签(ChatHandler根据use_voice决定是否TTS)
+ToolCallInfo tool_call # 工具调用(可选)
+bool round_end # 本轮LLM调用的最后一个chunk
}
class ToolCallInfo {
+str name
+dict arguments
+str result
+bool approval_required
}
class TTSResult {
+bool success
+str text
+str raw_text
+bytes audio_data
+bool is_end
+bool is_text_only
+dict tool_call # 工具调用信息(可选)
+bool is_round_end # 轮次结束信号
}
LLMChunk "1" --> "0..1" ToolCallInfo
TTSResult ..> LLMChunk : 由ChatHandler转换
时序图
1. 语音输入 → AI语音回复
sequenceDiagram
actor User
participant Browser
participant WS as WebSocket Server
participant ASR
participant CH as ChatHandler
participant LLM as LLMClient
participant TTS as TTSClient
User->>Browser: 说话(触发VAD)
Browser->>Browser: MediaRecorder 录制 Opus
Browser->>Browser: 检测到静音 → stop()
Browser->>WS: voice_input {session_id, audio_base64, format:opus}
WS->>ASR: opus_to_pcm() + recognize()
ASR-->>WS: text (识别结果)
WS->>Browser: {type:asr_result, text}
Browser->>Browser: 显示用户消息
WS->>CH: handle_voice_chat(messages, session_id)
loop 双层流水线
CH->>LLM: chat_stream(messages)
LLM-->>CH: LLMChunk(text, voice=true)
CH->>TTS: submit(task_id, text) → Future
Note over CH: 并发:LLM产句 & TTS合成同时进行
TTS-->>CH: TTSResult(audio_data)
CH->>WS: 按task_id有序推送
WS->>Browser: {type:audio, audio:base64, format:opus}
Browser->>Browser: audioQueue入队 → 顺序播放
Browser->>User: 播放语音
end
CH->>TTS: submit(is_end=True)
TTS-->>CH: TTSResult(is_end=True)
WS->>Browser: {type:end}
Browser->>Browser: checkAndSaveHistory()
Browser->>Browser: 返回监听状态
2. Agent 工具调用流
sequenceDiagram
actor User
participant Browser
participant WS as WebSocket Server
participant CH as ChatHandler
participant AM as ApprovalManager
participant AGC as AgentClient
participant MCP as MCPClient
participant LLM as LLM API
User->>Browser: 输入请求
Browser->>WS: chat {use_agent:true, messages}
WS->>CH: handle_text_chat(use_agent=True)
Note over AGC,LLM: ── 第 0 轮:规划轮(不传 tools,意图识别) ──
CH->>AGC: chat_stream(messages)
AGC->>AGC: full_messages.append(planning_prompt)
AGC->>LLM: completions(无 tools,无工具描述)
alt LLM 输出 <no_tool>(闲聊/问答)
LLM-->>AGC: 流式输出 "<no_tool><voice>你好呀,有什么..."
AGC->>AGC: 检测到 <no_tool>,标签保留在 raw_text
AGC-->>CH: LLMChunk(voice=true, text="你好呀,有什么...")
CH->>WS: {type:audio} 语音回复(use_voice=true时)
AGC->>AGC: 拔出 planning_prompt,跳过 Agent 循环
WS->>Browser: {type:end}
else LLM 输出 <need_tool>(需要工具)
LLM-->>AGC: 流式输出 "<need_tool><voice>好的,我来帮你查看..."
AGC-->>CH: LLMChunk(voice=true, text="好的,我来帮你查看...")
CH->>WS: {type:audio} 语音播报计划
AGC-->>CH: LLMChunk(round_end=true)
CH->>Browser: {type:round_end}
Note over Browser: historySegments 插入 separator
AGC->>AGC: 拔出 planning_prompt,清理意图标签,补 assistant + [continue]
loop Agent 循环(最多 max_agent_steps 次)
AGC->>LLM: completions (带tools定义)
LLM-->>AGC: 流式文本 + tool_calls
AGC-->>CH: LLMChunk(tool_call: start)
CH->>Browser: {type:tool_call, state:start}
AGC-->>CH: LLMChunk(tool_call: approval_required)
CH->>AM: register(session_id, task_id)
CH->>Browser: {type:tool_call, state:approval_required}
CH->>CH: create_task(approval_manager.wait(key))
Note over CH: 非阻塞,主循环继续收集 TTS 结果
alt 用户批准 / auto-approve
Browser->>WS: POST /api/tool-approve {approved:true}
WS->>AM: respond(key, True) → Event.set()
else 用户拒绝
Browser->>WS: POST /api/tool-approve {approved:false}
WS->>AM: cancel() → Event.set()
CH->>Browser: {type:tool_rejected}
CH->>Browser: {type:interrupted}
Note over CH: 中断会话
end
AGC->>MCP: call_tool(name, arguments)
MCP-->>AGC: result
AGC->>LLM: 追加 tool 结果,继续循环
LLM-->>AGC: 流式最终回复文本
AGC-->>CH: LLMChunk(voice=true, "结果是...")
CH->>WS: {type:audio} 语音播报结果
AGC-->>CH: LLMChunk(round_end=true)
CH->>Browser: {type:round_end}
end
WS->>Browser: {type:end}
end
3. 用户中断流
sequenceDiagram
actor User
participant Browser
participant WS as WebSocket Server
participant SM as SessionManager
participant AM as ApprovalManager
participant CH as ChatHandler
participant LLM as LLMClient / AgentClient
participant TTS as TTSClient Worker
Note over Browser,TTS: 正在进行对话(LLM流式 + TTS合成中)
User->>Browser: 点击"中断"按钮
Browser->>Browser: 立即停止音频播放 (audioPlayer.pause)
Browser->>Browser: 清空 audioQueue,重置 approvalPendingTaskId
Browser->>Browser: 保存已播放内容到 conversationHistory
Browser->>WS: POST /api/interrupt {session_id}
WS->>SM: cancel(session_id)
SM->>SM: sessions[id].cancelled = True
SM->>AM: on_cancel 回调 → cancel_session(session_id)
AM->>AM: 唤醒所有 pending Event (approved=False)
Note over CH,LLM: 各层在下一次轮询时检测到中断
LLM->>SM: is_cancelled(session_id) → True
LLM-->>LLM: 退出流式输出循环 (async for chunk 中断)
Note over CH: approval_wait_task.done() 返回 False → 中断
CH->>SM: is_cancelled(session_id) → True
CH-->>CH: 退出主循环
TTS->>SM: is_cancelled(session_id) → True
TTS-->>TTS: Worker 跳过剩余 TTS 任务
WS->>Browser: {type:interrupted}
Browser->>Browser: 重置状态 → 返回监听/输入LLM 层中断检查点(轮询式,非实时):
| 检查位置 |
组件 |
说明 |
| 规划轮流式循环 |
AgentClient |
每个 stream chunk 前检查 |
| Agent 循环入口 |
AgentClient |
每轮开始前检查 |
| Agent 流式循环 |
AgentClient |
每个 stream chunk 前检查 |
| 工具执行前 |
AgentClient |
每个工具调用前检查 |
| 审批等待后 |
AgentClient |
审批返回后检查是否被取消 |
| LLM 流式循环 |
LLMClient |
每个 stream chunk 前检查(非Agent模式) |
4. TTS 负载均衡 + 并发控制
sequenceDiagram
participant CH as ChatHandler
participant TC as TTSClient
participant SEM as Semaphore(3)
participant Q as GlobalQueue
participant W1 as Worker-1
participant W2 as Worker-2
participant LB as LoadBalancer:9880
participant B1 as TTS后端:9881
participant B2 as TTS后端:9882
CH->>TC: submit(task_id=0, text="句子A")
TC->>Q: put {task_id:0, future_A}
CH->>TC: submit(task_id=1, text="句子B")
TC->>Q: put {task_id:1, future_B}
CH->>TC: submit(task_id=2, text="句子C")
TC->>Q: put {task_id:2, future_C}
W1->>Q: get → task_id=0
W2->>Q: get → task_id=1
W1->>SEM: acquire(已有0个)
W2->>SEM: acquire(已有1个)
W1->>LB: GET /tts?text=句子A
LB->>B1: Round-robin → :9881
W2->>LB: GET /tts?text=句子B
LB->>B2: Round-robin → :9882
B1-->>W1: audio_data
W1->>SEM: release
W1-->>CH: future_A.set_result(TTSResult)
B2-->>W2: audio_data
W2->>SEM: release
W2-->>CH: future_B.set_result(TTSResult)
Note over CH: next_id=0 → 先推送A,再推送B(保序)
5. 多模态(图片附件)
用户可通过前端附件面板上传图片(点击/拖拽/粘贴),图片以 base64 data URI 内联在消息中发送。
消息格式
content 字段支持两种格式:
// 纯文本(传统格式)
{ "role": "user", "content": "你好" }
// 多模态(图片 + 文本)
{ "role": "user", "content": [
{ "type": "text", "text": "这张图是什么?" },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
]}
处理流程
sequenceDiagram
participant FE as 前端
participant WS as WebSocket Server
participant LC as LLMClient / AgentClient
participant API as LLM API
FE->>WS: content: [text, image_url]
WS->>WS: validate_attachments()<br/>(MIME白名单 + 10MB上限)
alt 校验失败
WS-->>FE: {type:error, error:"不支持的图片格式"}
else 校验通过
WS->>LC: messages(含多模态 content)
LC->>LC: _build_messages()
LC->>LC: sanitize_messages()<br/>(兼容 list content)
alt multimodal = True
LC->>API: 原样发送(保留 image_url)
else multimodal = False
LC->>LC: strip_images()<br/>(移除 image_url,退化为字符串)
LC->>API: 发送纯文本 messages
end
API-->>LC: 流式响应
LC-->>WS: LLMChunk
WS-->>FE: 文本 / 音频
endProvider 多模态能力
llm/providers.py 中每个 provider 配置 "multimodal": True/False:
| 模型 |
multimodal |
说明 |
| claude-sonnet-4-6, claude-haiku-4-5, kimi-k2.5 |
True |
原生支持图片 |
| glm-5, qwen3.5-plus, deepseek-v3.2 |
False |
不支持,静默过滤 |
评论区