- Notifications
You must be signed in to change notification settings - Fork149
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
License
modelscope/3D-Speaker
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible onModelScope. Furthermore, we present a large-scale speech corpus also called3D-Speaker-Dataset to facilitate the research of speech representation disentanglement.
The EER results on VoxCeleb, CNCeleb and 3D-Speaker datasets for fully-supervised speaker verification.
Model | Params | VoxCeleb1-O | CNCeleb | 3D-Speaker |
---|---|---|---|---|
Res2Net | 4.03 M | 1.56% | 7.96% | 8.03% |
ResNet34 | 6.34 M | 1.05% | 6.92% | 7.29% |
ECAPA-TDNN | 20.8 M | 0.86% | 8.01% | 8.87% |
ERes2Net-base | 6.61 M | 0.84% | 6.69% | 7.21% |
CAM++ | 7.2 M | 0.65% | 6.78% | 7.75% |
ERes2NetV2 | 17.8M | 0.61% | 6.14% | 6.52% |
ERes2Net-large | 22.46 M | 0.52% | 6.17% | 6.34% |
The DER results on public and internal multi-speaker datasets for speaker diarization.
Test | 3D-Speaker | pyannote.audio | DiariZen_WavLM |
---|---|---|---|
Aishell-4 | 10.30% | 12.2% | 11.7% |
Alimeeting | 19.73% | 24.4% | 17.6% |
AMI_SDM | 21.76% | 22.4% | 15.4% |
VoxConverse | 11.75% | 11.3% | 28.39% |
Meeting-CN_ZH-1 | 18.91% | 22.37% | 32.66% |
Meeting-CN_ZH-2 | 12.78% | 17.86% | 18% |
git clone https://github.com/modelscope/3D-Speaker.git&&cd 3D-Speakerconda create -n 3D-Speaker python=3.8conda activate 3D-Speakerpip install -r requirements.txt
# Speaker verification: ERes2NetV2 on 3D-Speaker datasetcd egs/3dspeaker/sv-eres2netv2/bash run.sh# Speaker verification: CAM++ on 3D-Speaker datasetcd egs/3dspeaker/sv-cam++/bash run.sh# Speaker verification: ECAPA-TDNN on 3D-Speaker datasetcd egs/3dspeaker/sv-ecapa/bash run.sh# Self-supervised speaker verification: SDPN on VoxCeleb datasetcd egs/voxceleb/sv-sdpn/bash run.sh# Audio and multimodal Speaker diarization:cd egs/3dspeaker/speaker-diarization/bash run_audio.shbash run_video.sh# Language identificationcd egs/3dspeaker/language-idenitficationbash run.sh
All pretrained models are released onModelscope.
# Install modelscopepip install modelscope# ERes2Net trained on 200k labeled speakersmodel_id=iic/speech_eres2net_sv_zh-cn_16k-common# ERes2NetV2 trained on 200k labeled speakersmodel_id=iic/speech_eres2netv2_sv_zh-cn_16k-common# CAM++ trained on 200k labeled speakersmodel_id=iic/speech_campplus_sv_zh-cn_16k-common# Run CAM++ or ERes2Net inferencepython speakerlab/bin/infer_sv.py --model_id$model_id# Run batch inferencepython speakerlab/bin/infer_sv_batch.py --model_id$model_id --wavs$wav_list# SDPN trained on VoxCelebmodel_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k# Run SDPN inferencepython speakerlab/bin/infer_sv_ssl.py --model_id$model_id# Run diarization inferencepython speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir$out_dir# Enable overlap detectionpython speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir$out_dir --include_overlap --hf_access_token$hf_access_token
Supervised Speaker Verification
CAM++,ERes2Net,ERes2NetV2,ECAPA-TDNN,ResNet andRes2Net training recipes on3D-Speaker.
CAM++,ERes2Net,ERes2NetV2,ECAPA-TDNN,ResNet andRes2Net training recipes onVoxCeleb.
CAM++,ERes2Net,ERes2NetV2,ECAPA-TDNN,ResNet andRes2Net training recipes onCN-Celeb.
Self-supervised Speaker Verification
Speaker Diarization
- Speaker diarization inference recipes which comprise multiple modules, including overlap detection[optional], voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering.
Language Identification
- Language identification training recipes on3D-Speaker.
3D-Speaker Dataset
- Dataset introduction and download address:3D-Speaker
- Related paper address:3D-Speaker
- Dataset introduction and download address:3D-Speaker
- [2024.12] Updatediarization recipes and add results on multiple diarization benchmarks.
- [2024.8] ReleasingERes2NetV2 andERes2NetV2_w24s4ep4 pretrained models trained on 200k-speaker datasets.
- [2024.5] ReleasingSDPN model andX-vector model training and inference recipes for VoxCeleb.
- [2024.5] Releasingvisual module andsemantic module training recipes.
- [2024.4] ReleasingONNX Runtime and the relevant scripts for inference.
- [2024.4] ReleasingERes2NetV2 model with lower parameters and faster inference speed on VoxCeleb datasets.
- [2024.2] Releasinglanguage identification integrating phonetic information recipes for more higher recognition accuracy.
- [2024.2] Releasingmultimodal diarization recipes which fuses audio and video image input to produce more accurate results.
- [2024.1] ReleasingResNet34 andRes2Net model training and inference recipes for 3D-Speaker, VoxCeleb and CN-Celeb datasets.
- [2024.1] Releasinglarge-margin finetune recipes in speaker verification and addingdiarization recipes.
- [2023.11]ERes2Net-base pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
- [2023.10] ReleasingECAPA model training and inference recipes for three datasets.
- [2023.9] ReleasingRDINO model training and inference recipes forCN-Celeb.
- [2023.8] ReleasingCAM++,ERes2Net-Base andERes2Net-Large benchmarks inCN-Celeb.
- [2023.8] ReleasingERes2Net anndCAM++ in language identification for Mandarin and English.
- [2023.7] ReleasingCAM++,ERes2Net-Base,ERes2Net-Large pretrained models trained on3D-Speaker.
- [2023.7] ReleasingDialogue Detection andSemantic Speaker Change Detection in speaker diarization.
- [2023.7] ReleasingCAM++ in language identification for Mandarin and English.
- [2023.6] Releasing3D-Speaker dataset and its corresponding benchmarks includingERes2Net,CAM++ andRDINO.
- [2023.5]ERes2Net andCAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
If you have any comment or question about 3D-Speaker, please contact us by
- email: {yfchen97, wanghuii}@mail.ustc.edu.cn, {dengchong.d, zsq174630, shuli.cly}@alibaba-inc.com
3D-Speaker is released under theApache License 2.0.
3D-Speaker contains third-party components and code modified from some open-source repos, including:
Speechbrain,Wespeaker,D-TDNN,DINO,Vicreg,TalkNet-ASD,Ultra-Light-Fast-Generic-Face-Detector-1MB,pyannote.audio
If you find this repository useful, please consider giving a star ⭐ and citation 🦖:
@article{chen20243d,title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},booktitle={ICASSP},year={2025}}