- Notifications
You must be signed in to change notification settings - Fork4
Framework for training and evaluating self-supervised learning methods for speaker verification.
License
theolepage/sslsv
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
sslsv is a PyTorch-based Deep Learning framework consisting of a collection ofSelf-Supervised Learning (SSL) methods for learning speaker representations applicable to different speaker-related downstream tasks, notablySpeaker Verification (SV).
Our aim is to:(1) provide self-supervised SOTA methods by porting algorithms from the computer vision domain; and(2) evaluate them in a comparable environment.
Our training framework is depicted by the figure below.
- April 2024 – 👏 Introduction of new various methods and complete refactoring (v2.0).
- June 2022 – 🌠 First release of sslsv (v1.0).
General
- Data:
- Supervised and Self-supervised datasets (siamese and DINO sampling)
- Audio augmentation (noise and reverberation)
- Training:
- CPU, GPU and multi-GPUs (DataParallel andDistributedDataParallel)
- Checkpointing, resuming, early stopping and logging
- Tensorboard and wandb
- Evaluation:
- Speaker verification
- Backend: Cosine scoring and PLDA
- Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
- Classification (emotion, language, ...)
- Speaker verification
- Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
- Misc: scalable config, typing, documentation and tests
Encoders
TDNN (
sslsv.encoders.TDNN
)
X-vectors: Robust dnn embeddings for speaker recognition (PDF)
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev KhudanpurSimple Audio CNN (
sslsv.encoders.SimpleAudioCNN
)
Representation Learning with Contrastive Predictive Coding (arXiv)
Aaron van den Oord, Yazhe Li, Oriol VinyalsResNet-34 (
sslsv.encoders.ResNet34
)
VoxCeleb2: Deep Speaker Recognition (arXiv)
Joon Son Chung, Arsha Nagrani, Andrew ZissermanECAPA-TDNN (
sslsv.encoders.ECAPATDNN
)
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (arXiv)
Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck
Methods
LIM (
sslsv.methods.LIM
)
Learning Speaker Representations with Mutual Information (arXiv)
Mirco Ravanelli, Yoshua BengioCPC (
sslsv.methods.CPC
)
Representation Learning with Contrastive Predictive Coding (arXiv)
Aaron van den Oord, Yazhe Li, Oriol VinyalsSimCLR (
sslsv.methods.SimCLR
)
A Simple Framework for Contrastive Learning of Visual Representations (arXiv)
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey HintonMoCo v2+ (
sslsv.methods.MoCo
)
Improved Baselines with Momentum Contrastive Learning (arXiv)
Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming HeDeepCluster v2 (
sslsv.methods.DeepCluster
)
Deep Clustering for Unsupervised Learning of Visual Features (arXiv)
Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs DouzeSwAV (
sslsv.methods.SwAV
)
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (arXiv)
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand JoulinW-MSE (
sslsv.methods.WMSE
)
Whitening for Self-Supervised Representation Learning (arXiv)
Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu SebeBarlow Twins (
sslsv.methods.BarlowTwins
)
Barlow Twins: Self-Supervised Learning via Redundancy Reduction (arXiv)
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stéphane DenyVICReg (
sslsv.methods.VICReg
)
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (arXiv)
Adrien Bardes, Jean Ponce, Yann LeCunVIbCReg (
sslsv.methods.VIbCReg
)
Computer Vision Self-supervised Learning Methods on Time Series (arXiv)
Daesoo Lee, Erlend AuneBYOL (
sslsv.methods.BYOL
)
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (arXiv)
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal ValkoSimSiam (
sslsv.methods.SimSiam
)
Exploring Simple Siamese Representation Learning (arXiv)
Xinlei Chen, Kaiming HeDINO (
sslsv.methods.DINO
)
Emerging Properties in Self-Supervised Vision Transformers (arXiv)
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin
Methods (ours)
Combiner (
sslsv.methods.Combiner
)
Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (arXiv)
Theo Lepage, Reda DehakSimCLR Margins (
sslsv.methods.SimCLRMargins
)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
Theo Lepage, Reda DehakMoCo Margins (
sslsv.methods.MoCoMargins
)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
Theo Lepage, Reda DehakSSPS (
sslsv.methods._SSPS
)
Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling(arxiv)
Theo Lepage, Reda Dehak
sslsv runs on Python 3.8 with the following dependencies.
Module | Versions |
---|---|
torch | >= 1.11.0 |
torchaudio | >= 0.11.0 |
numpy | * |
pandas | * |
soundfile | * |
scikit-learn | * |
speechbrain | * |
tensorboard | * |
wandb | * |
ruamel.yaml | * |
dacite | * |
prettyprinter | * |
tqdm | * |
Note: developers will also needpytest
,pre-commit
andtwine
to work on this project.
Speaker recognition:
Language recognition:
Emotion recognition:
Data-augmentation:
Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.
python tools/prepare_data/prepare_voxceleb.py data/python tools/prepare_data/prepare_augmentation.py data/
The resultingdata
folder shoud have the structure presented below.
data├── musan_split/├── simulated_rirs/├── voxceleb1/├── voxceleb2/├── voxceleb1_test_O├── voxceleb1_test_H├── voxceleb1_test_E├── voxsrc2021_val├── voxceleb1_train.csv└── voxceleb2_train.csv
Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from thetools/prepare_data/
folder.
Example format of a train file (
voxceleb1_train.csv
)File,Speakervoxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001...voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251
Example format of a trials file (
voxceleb1_test_O
)1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav...0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
- Clone this repository:
git clone https://github.com/theolepage/sslsv.git
. - Install dependencies:
pip install -r requirements.txt
.
Note:sslsv can also be installed as a standalone package via pip withpip install sslsv
or withpip install .
(in the project root folder) to get the latest version.
- Start a training (2 GPUs):
./train_ddp.sh 2 <config_path>
. - Evaluate your model (2 GPUs):
./evaluate_ddp.sh 2 <config_path>
.
Note: usesslsv/bin/train.py
andsslsv/bin/evaluate.py
for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).
You can visualize your experiments withtensorboard --logdir models/your_model/
.
Usewandb online
andwandb offline
to toggle wandb. To log your experiments you first need to provide your API key withwandb login API_KEY
.
Documentation is currently being developed...
- Train set: VoxCeleb2
- Evaluation: VoxCeleb1-O (Original)
- Encoder: ECAPA-TDNN (C=1024)
Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint |
---|---|---|---|---|
SimCLR | ssl/voxceleb2/simclr/simclr_e-ecapa-1024 | 6.41 | 0.5160 | 🔗 |
MoCo | ssl/voxceleb2/moco/moco_e-ecapa-1024 | 6.38 | 0.5384 | 🔗 |
SwAV | ssl/voxceleb2/swav/swav_e-ecapa-1024 | 8.33 | 0.6120 | 🔗 |
VICReg | ssl/voxceleb2/vicreg/vicreg_e-ecapa-1024 | 7.85 | 0.6004 | 🔗 |
DINO | ssl/voxceleb2/dino/dino+_e-ecapa-1024 | 2.92 | 0.3523 | 🔗 |
Supervised | ssl/voxceleb2/supervised/supervised_e-ecapa-1024 | 1.34 | 0.1521 | 🔗 |
sslsv contains third-party components and code adapted from other open-source projects, including:voxceleb_trainer,voxceleb_unsupervised andsolo-learn.
If you usesslsv, please consider starring this repository on GitHub and citing one the following papers.
@Article{lepage2025SSLSVBootstrappedPositiveSampling,title ={Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling},author ={Lepage, Theo and Dehak, Reda},year ={2025},journal ={arXiv preprint library},url ={https://arxiv.org/abs/2501.17772},}@InProceedings{lepage2024AdditiveMarginSSLSV,title ={Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations},author ={Lepage, Theo and Dehak, Reda},year ={2024},booktitle ={The Speaker and Language Recognition Workshop (Odyssey 2024)},pages ={38--42},doi ={10.21437/odyssey.2024-6},url ={https://www.isca-archive.org/odyssey_2024/lepage24_odyssey.html},}@InProceedings{lepage2023ExperimentingAdditiveMarginsSSLSV,title ={Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification},author ={Lepage, Theo and Dehak, Reda},year ={2023},booktitle ={Interspeech 2023},pages ={4708--4712},doi ={10.21437/Interspeech.2023-1479},url ={https://www.isca-speech.org/archive/interspeech_2023/lepage23_interspeech.html},}@InProceedings{lepage2022LabelEfficientSSLSV,title ={Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning},author ={Lepage, Theo and Dehak, Reda},year ={2022},booktitle ={Interspeech 2022},pages ={4018--4022},doi ={10.21437/Interspeech.2022-802},url ={https://www.isca-speech.org/archive/interspeech_2022/lepage22_interspeech.html},}
This project is released under theMIT License.