Skip to content

[feat] 说话人分离:可选返回每个说话人的声纹质心向量 (spk_embedding_center)#2967

Open
phoenixray2000 wants to merge 2 commits into
modelscope:mainfrom
phoenixray2000:feat/spk-embedding-center
Open

[feat] 说话人分离:可选返回每个说话人的声纹质心向量 (spk_embedding_center)#2967
phoenixray2000 wants to merge 2 commits into
modelscope:mainfrom
phoenixray2000:feat/spk-embedding-center

Conversation

@phoenixray2000
Copy link
Copy Markdown

背景 / 动机

AutoModel.generate 做说话人分离时,postprocess() 内部其实已经算出了每个说话人的质心向量(同一聚类簇内各 chunk 嵌入的均值),但用完即丢、没有返回。下游若想做说话人声纹 / 身份识别,只能再额外跑一遍声纹模型重新抽取,白白浪费一次算力。

改动

新增可选参数 return_spk_center(默认 False):

  • funasr/models/campplus/utils.py::postprocess:当 return_spk_center=True 时,额外返回 spk_embs(每个说话人的质心,按 correct_labels 修正后的说话人 id 对齐)。
  • funasr/auto/auto_model.py:当 generate(..., return_spk_center=True) 时,在结果中加入 spk_embedding_center,形状 [说话人数, 嵌入维度],其下标与 sentence_info 中的 spk 一一对应。逐 chunk 的 spk_embedding 仍按原逻辑删除,输出不膨胀。

兼容性

完全向后兼容:默认关闭;不传 return_spk_center=Truepostprocess() 返回结构保持不变,现有调用方(auto_modelauto_frontend)均不受影响。

验证

本地用 paraformer-zh + ERes2NetV2(spk_mode=punc_segment)对一段双人音频测试:spk_embedding_center 形状为 (2, 192),与 sentence_info 中的 2 个说话人对齐;两人质心 L2 归一化后余弦相似度约 0.34(可区分)。

用法

res = model.generate(input=wav, return_spk_center=True)
centers = res[0]["spk_embedding_center"]  # np.ndarray, 形状 [说话人数, 维度]

🤖 Generated with Claude Code

Add a return_spk_center option so AutoModel.generate surfaces the per-speaker centroid embeddings (mean of clustered chunk embeddings) that diarization already computes in postprocess() but currently discards. Lets downstream speaker voiceprint / identity reuse them without re-embedding. Backward compatible: default off; postprocess return shape is unchanged unless return_spk_center=True.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new option return_spk_center to retrieve per-speaker ERes2NetV2 centroids (speaker embedding centers) during speaker diarization. When enabled, the postprocess function returns both the results and the speaker centroids, which are then saved in the output dictionary. Feedback on these changes includes: 1) converting the PyTorch tensor spk_embedding.cpu() to a NumPy array using .numpy() before passing it to postprocess to match its type hint; 2) updating the return type hint of postprocess to Union[list, tuple] to reflect the conditional return type; and 3) optimizing performance by lazily computing spk_embs only when return_spk_center is enabled.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread funasr/auto/auto_model.py Outdated
Comment thread funasr/models/campplus/utils.py Outdated
Comment thread funasr/models/campplus/utils.py
- pass np.ndarray (not torch.Tensor) to postprocess to match its type hint
- update postprocess return hint to Union[list, tuple]
- compute spk_embs lazily, only when return_spk_center=True

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant