- Notifications
You must be signed in to change notification settings - Fork12
A toolkit for speaker diarization.
License
NotificationsYou must be signed in to change notification settings
BUTSpeechFIT/DiariZen
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DiariZen is a speaker diarization toolkit driven byAudioZen andPyannote 3.1.
# create virtual python environmentconda create --name diarizen python=3.10conda activate diarizen# install diarizen conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidiapip install -r requirements.txt && pip install -e .# install pyannote-audiocd pyannote-audio && pip install -e .[dev,testing]# install dscoregit submodule initgit submodule update
We useSDM (first channel from the first far-field microphone array) data from publicAMI,AISHELL-4, andAliMeeting for model training and evaluation. Please download these datasets firstly. Our data partition ishere.
- downloadWavLM Base+ model
- downloadResNet34-LM model
- modify the path of used dataset and configuration file
cd recipes/diar_ssl && bash -i run_stage.sh
- Our pre-trained checkpoints and the estimated rttm files can be foundhere. The local experimental path has been anonymized. To use the pre-trained models, please check the
diar_ssl/run_stage.sh
. - In case you have trouble reproducing our experiments, we also provide theintermediate results of
EN2002a
, an AMI test recording, during inference for debugging. - Our model also supports forHugging Face 🤗. Please check
example/run_example.py
.
We aim to make the whole pipeline as simple as possible. Therefore, for the results below:
- wedid not use any simulated data
- wedid not apply advanced learning scheduler strategies
- wedid not perform further domain adaptation to each dataset
- all experiments share thesame hyper-parameters for clustering
collar=0s --------------------------------------------------------------System Features AMI AISHELL-4 AliMeeting --------------------------------------------------------------Pyannote3 SincNet 21.1 13.9 22.8Proposed Fbank 19.7 12.5 21.0 WavLM-frozen 17.0 11.7 19.9 WavLM-updated 15.4 11.7 17.6--------------------------------------------------------------collar=0.25s --------------------------------------------------------------System Features AMI AISHELL-4 AliMeeting --------------------------------------------------------------Pyannote3 SincNet 13.7 7.7 13.6Proposed Fbank 12.9 6.9 12.6 WavLM-frozen 10.9 6.1 12.0 WavLM-updated 9.8 5.9 10.2--------------------------------------------------------------Note:The results above are different from our ICASSP submission. We made a few updates to experimental numbers but the conclusions in our paper are as same as the original ones.
If you found this work helpful, please consider citing:J. Han, F. Landini, J. Rohdin, A. Silnova, M. Diez, and L. Burget,Leveraging Self-Supervised Learning for Speaker Diarization, in Proc. ICASSP, 2025.
@inproceedings{han2025leveraging, title={Leveraging self-supervised learning for speaker diarization}, author={Han, Jiangyu and Landini, Federico and Rohdin, Johan and Silnova, Anna and Diez, Mireia and Burget, Luk{\'a}{\v{s}}}, booktitle={Proc. ICASSP}, year={2025}}
This repository under theMIT license.
If you have any comment or question, please contactihan@fit.vut.cz