- Notifications
You must be signed in to change notification settings - Fork0
WavLM Large + RawNetX Speaker Verification Base: End-to-End Speaker Verification Architecture
License
bunyaminergen/WavLMRawNetXSVBase
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This architecture combinesWavLM Large andRawNetX to learn bothmicro andmacro features directlyfrom raw waveforms. The goal is to obtain afully end-to-end model, avoiding any manual feature extraction (e.g.,MFCC, mel-spectrogram). Instead, the network itself discovers the most relevant frequency and temporal patterns forspeaker verification.
Note:If you would like to contribute to this repository,please read theCONTRIBUTING first.
- Introduction
- Architecture
- Reports
- Prerequisites
- Installation
- File Structure
- Version Control System
- Upcoming
- Documentations
- License
- Links
- Team
- Contact
- Citation
WavLM Large
- Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (
frame-level
) or “micro”acoustic features. - It produces a1024-dimensional embedding, focusing on localized, short-term variations in the speech signal.
- Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (
RawNetX
- Uses SincConv and residual blocks to summarize the raw signal on a broader (macro) scale.
- TheAttentive Stats Pooling layer aggregates mean + std across the entire time axis (with learnableattention),capturing global speaker characteristics.
- Outputs a256-dimensional embedding, representing the overall, longer-term structure of the speech.
These two approaches complement each other: WavLM Large excels at fine-detailed temporal features, while RawNetXcaptures a more global, statistical overview.
Raw Audio Input
- No manual preprocessing (like MFCC or mel-spectrogram).
- A minimalTransform andSegment step (mono conversion, resample, slice/pad) formats the data into shape
(B, T)
.
RawNetX (Macro Features)
- SincConv: Learns band-pass filters in a frequency-focused manner, constrained by low/high cutoff frequencies.
- ResidualStack: A set of residual blocks (optionally with SEBlock) refines the representation.
- Attentive Stats Pooling: Aggregates time-domain information into mean and std with a learnable attentionmechanism.
- A finalFC layer yields a 256-dimensional embedding.
WavLM Large (Micro Features)
- Transformer layers operate at
frame-level
, capturing fine-grained details. - Produces a1024-dimensional embedding after mean pooling across time.
- Transformer layers operate at
Fusion Layer
- Concatenate the256-dim RawNetX embedding with the1024-dim WavLM embedding, resulting in1280dimensions.
- ALinear(1280 → 256) + ReLU layer reduces it to a256-dim Fusion Embedding, combining micro and macroinsights.
AMSoftmax Loss
- During training, the 256-dim fusion embedding is passed to an AMSoftmax classifier (with margin + scale).
- Embeddings of the same speaker are pulled closer, while different speakers are pushed apart in the angular space.
- Fully Automatic: Raw waveforms go in, final speaker embeddings come out.
- No Manual Feature Extraction: We do not rely on handcrafted features like MFCC or mel-spectrogram.
- Data-Driven: The model itself figures out which frequency bands or time segments matter most.
- Enhanced Representation: WavLM delivers local detail, RawNetX captures global stats, leading to a more robustspeaker representation.
- Deep Learning Principle: The model should learn how to process raw signals rather than relying on human-definedfeature pipelines.
- Better Generalization: Fewer hand-tuned hyperparameters; the model adapts better to various speakers, languages, andenvironments.
- Scientific Rigor: Manual feature engineering can introduce subjective design choices. Letting the network learndirectly from data is more consistent with data-driven approaches.
Micro + Macro Features Combined
- Captures both short-term acoustic nuances (WavLM) and holistic temporal stats (RawNetX).
Truly End-to-End
- Beyond minimal slicing/padding, all layers are trainable.
- No handcrafted feature extraction is involved.
VoxCeleb1 Test Results
- Achieved anEER of 4.67% on the VoxCeleb1 evaluation set.
Overall Benefits
- Potentially outperforms using WavLM or RawNetX alone on standard metrics like EER and minDCF.
- Combining both scales of analysis yields a richer speaker representation.
In essence,WavLM Large + RawNetX merges two scales of speaker representation to produce aunified 256-dimembedding. By staying fully end-to-end, the architecture remains flexible and can leverage large amounts of data forimproved speaker verification results.
Speaker Verification Benchmark on VoxCeleb1 Dataset
Model | EER |
---|---|
ReDimNet-B6-SF2-LM-ASNorm | 0.37 |
WavLM+ECAPA-TDNN | 0.39 |
... | ... |
TitanNet-L | 0.68 |
... | ... |
SpeechNAS | 1.02 |
... | ... |
Multi Task SSL | 1.98 |
... | ... |
WavLMRawNetXSVBase | 4.67 |
Python3.11
(or above)
10GB Disk Space
(for VoxCeleb1 Dataset)12GB VRAM GPU
(or above)
sudo apt update -y&& sudo apt upgrade -y
sudo apt install -y ffmpeg
git clone https://github.com/bunyaminergen/WavLMRawNetXSVBase
cd WavLMRawNetXSVBase
conda env create -f environment.yaml
conda activate WavLMRawNetXSVBase
Please go to the url and register:KAIST MM
After receiving the e-mail, you can download the dataset directly from the e-mail by clicking on the link or you canuse the following commands.
Note:To download from the command line, you must take the key parameter from thelinkin the e-mail and place it in the relevant place in the command line below.
To download
List of trial pairs - VoxCeleb1 (cleaned)
please go to theurl:VoxCeleb
VoxCeleb1
Dev A
wget -c --no-check-certificate -O vox1_dev_wav_partaa"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partaa"
Dev B
wget -c --no-check-certificate -O vox1_dev_wav_partab"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partab"
Dev C
wget -c --no-check-certificate -O vox1_dev_wav_partac"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partac"
Dev D
wget -c --no-check-certificate -O vox1_dev_wav_partad"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partad"
Concatenate
cat vox1_dev*> vox1_dev_wav.zip
Test
wget -c --no-check-certificate -O vox1_test_wav.zip"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_test_wav.zip"
List of trial pairs - VoxCeleb1 (cleaned)
wget https://mm.kaist.ac.kr/datasets/voxceleb/meta/veri_test2.txt
.├── .data│ ├── dataset│ │ ├── raw│ │ │ └── VoxCeleb1│ │ │ ├── dev│ │ │ │ └── vox1_dev_wav.zip│ │ │ └── test│ │ │ └── vox1_test_wav.zip│ │ └── train│ │ └── VoxCeleb1│ │ ├── dev│ │ │ └── vox1_dev_wav│ │ │ └── wav│ │ │ ├── id10001│ │ │ │ ├── 1zcIwhmdeo4│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ ├── 7gWzIy6yIIk│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ └── ...│ │ │ │ └── ...│ │ │ ├── id10002│ │ │ │ ├── 6WO410QOeuo│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ ├── C7k7C-PDvAA│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ └── ...│ │ │ │ └── ...│ │ │ ├── id10003│ │ │ │ ├── 5ablueV_1tw│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ ├── A7Hh1WKmHsg│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ └── ...│ │ │ │ └── ...│ │ │ ├── ...│ │ │ │ └── ...│ │ │ │ └── ...│ │ │ ├── id11250│ │ │ │ ├── 09AvzdGWvhA│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ ├── 1BmQvhvvrhY│ │ │ │ │ ├── 00001.wav│ │ │ │ │ ├── 00002.wav│ │ │ │ │ ├── 00003.wav│ │ │ │ │ └── ...│ │ │ │ └── ...│ │ │ │ └── ...│ │ │ └── id11251│ │ │ ├── 5-6lI5JQtb8│ │ │ │ ├── 00001.wav│ │ │ │ ├── 00002.wav│ │ │ │ ├── 00003.wav│ │ │ │ └── ...│ │ │ └── XHCSVYEZvlM│ │ │ ├── 00001.wav│ │ │ ├── 00002.wav│ │ │ ├── 00003.wav│ │ │ └── ...│ │ └── test│ │ ├── veri_test2.txt│ │ └── vox1_test_wav│ │ └── wav│ │ ├── id10270│ │ │ ├── 5r0dWxy17C8│ │ │ │ ├── 00001.wav│ │ │ │ ├── 00002.wav│ │ │ │ ├── 00003.wav│ │ │ │ └── ...│ │ │ ├── 5sJomL_D0_g│ │ │ │ ├── 00001.wav│ │ │ │ ├── 00002.wav│ │ │ │ ├── 00003.wav│ │ │ │ └── ...│ │ │ └── ...│ │ │ └── ...│ │ ├── id10271│ │ │ ├── 1gtz-CUIygI│ │ │ │ ├── 00001.wav│ │ │ │ ├── 00002.wav│ │ │ │ ├── 00003.wav│ │ │ │ └── ...│ │ │ ├── 37nktPRUJ58│ │ │ │ ├── 00001.wav│ │ │ │ ├── 00002.wav│ │ │ │ ├── 00003.wav│ │ │ │ └── ...│ │ │ └── ...│ │ │ └── ...│ │ ├── ...│ │ │ └── ...│ │ │ └── ...│ │ └── id10309│ │ ├── 0b1inHMAr6o│ │ │ ├── 00001.wav│ │ │ ├── 00002.wav│ │ │ ├── 00003.wav│ │ │ └── ...│ │ └── Zx-zA-D_DvI│ │ ├── 00001.wav│ │ ├── 00002.wav│ │ ├── 00003.wav│ │ └── ...│ └── example│ ├── enroll│ │ ├── speaker1_enroll_en.wav│ │ └── speaker1_enroll_tr.wav│ └── test│ ├── speaker1_test_en.wav│ ├── speaker1_test_tr.wav│ ├── speaker2_test_en.wav│ └── speaker2_test_tr.wav├── .docs│ ├── documentation│ │ ├── CONTRIBUTING.md│ │ └── RESOURCES.md│ └── img│ └── architecture│ ├── WavLMRawNetXSVBase.drawio│ └── WavLMRawNetXSVBase.gif├── environment.yaml├── .github│ └── CODEOWNERS├── .gitignore├── LICENSE├── main.py├── notebook│ └── test.ipynb├── README.md├── requirements.txt└── src ├── config │ ├── config.yaml │ └── schema.py ├── evaluate │ └── metric.py ├── model │ ├── backbone.py │ ├── block.py │ ├── convolution.py │ ├── fusion.py │ ├── loss.py │ └── pooling.py ├── preprocess │ ├── feature.py │ └── transformation.py ├── process │ ├── test.py │ └── train.py └── utils └── data └── manager.py23779 directories, 153552 files
- BasePlus Model: Build a new archtitecture and train for better EER.
- HuggingFace Model Hub: Add model to HuggingFace Model Hub.
- HuggingFace Space: Add demo to HuggingFace Space.
- Pytorch Hub: Add model to Pytorch Hub.
@software{WavLMRawNetXSVBase,author ={Bunyamin Ergen},title ={{WavLMRawNetXSVBase}},year ={2025},month ={02},url ={https://github.com/bunyaminergen/WavLMRawNetXSVBase},version ={v1.0.0},}
About
WavLM Large + RawNetX Speaker Verification Base: End-to-End Speaker Verification Architecture
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.