[feat] 说话人分离:可选返回每个说话人的声纹质心向量 (spk_embedding_center)#2967
[feat] 说话人分离:可选返回每个说话人的声纹质心向量 (spk_embedding_center)#2967phoenixray2000 wants to merge 2 commits into
Conversation
Add a return_spk_center option so AutoModel.generate surfaces the per-speaker centroid embeddings (mean of clustered chunk embeddings) that diarization already computes in postprocess() but currently discards. Lets downstream speaker voiceprint / identity reuse them without re-embedding. Backward compatible: default off; postprocess return shape is unchanged unless return_spk_center=True. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new option return_spk_center to retrieve per-speaker ERes2NetV2 centroids (speaker embedding centers) during speaker diarization. When enabled, the postprocess function returns both the results and the speaker centroids, which are then saved in the output dictionary. Feedback on these changes includes: 1) converting the PyTorch tensor spk_embedding.cpu() to a NumPy array using .numpy() before passing it to postprocess to match its type hint; 2) updating the return type hint of postprocess to Union[list, tuple] to reflect the conditional return type; and 3) optimizing performance by lazily computing spk_embs only when return_spk_center is enabled.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
- pass np.ndarray (not torch.Tensor) to postprocess to match its type hint - update postprocess return hint to Union[list, tuple] - compute spk_embs lazily, only when return_spk_center=True Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
背景 / 动机
AutoModel.generate做说话人分离时,postprocess()内部其实已经算出了每个说话人的质心向量(同一聚类簇内各 chunk 嵌入的均值),但用完即丢、没有返回。下游若想做说话人声纹 / 身份识别,只能再额外跑一遍声纹模型重新抽取,白白浪费一次算力。改动
新增可选参数
return_spk_center(默认False):funasr/models/campplus/utils.py::postprocess:当return_spk_center=True时,额外返回spk_embs(每个说话人的质心,按correct_labels修正后的说话人 id 对齐)。funasr/auto/auto_model.py:当generate(..., return_spk_center=True)时,在结果中加入spk_embedding_center,形状[说话人数, 嵌入维度],其下标与sentence_info中的spk一一对应。逐 chunk 的spk_embedding仍按原逻辑删除,输出不膨胀。兼容性
完全向后兼容:默认关闭;不传
return_spk_center=True时postprocess()返回结构保持不变,现有调用方(auto_model、auto_frontend)均不受影响。验证
本地用
paraformer-zh+ERes2NetV2(spk_mode=punc_segment)对一段双人音频测试:spk_embedding_center形状为(2, 192),与sentence_info中的 2 个说话人对齐;两人质心 L2 归一化后余弦相似度约 0.34(可区分)。用法
🤖 Generated with Claude Code