bunyaminergen/WavLMRawNetXSVBasePublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

WavLM Large + RawNetX Speaker Verification Base: End-to-End Speaker Verification Architecture

License

GPL-3.0 license

2 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.docs		.docs
.github		.github
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py
requirements.txt		requirements.txt

Repository files navigation

WavLMRawNetXSVBase

`WavLM Large + RawNetX Speaker Verification Base: End-to-End Speaker Verification Architecture`

This architecture combinesWavLM Large andRawNetX to learn bothmicro andmacro features directlyfrom raw waveforms. The goal is to obtain afully end-to-end model, avoiding any manual feature extraction (e.g.,MFCC, mel-spectrogram). Instead, the network itself discovers the most relevant frequency and temporal patterns forspeaker verification.

Note:If you would like to contribute to this repository,please read theCONTRIBUTING first.

Introduction

Combine WavLM Large and RawNetX

WavLM Large
- Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (frame-level) or “micro”acoustic features.
- It produces a1024-dimensional embedding, focusing on localized, short-term variations in the speech signal.
RawNetX
- Uses SincConv and residual blocks to summarize the raw signal on a broader (macro) scale.
- TheAttentive Stats Pooling layer aggregates mean + std across the entire time axis (with learnableattention),capturing global speaker characteristics.
- Outputs a256-dimensional embedding, representing the overall, longer-term structure of the speech.

These two approaches complement each other: WavLM Large excels at fine-detailed temporal features, while RawNetXcaptures a more global, statistical overview.

Architectural Flow

Raw Audio Input
- No manual preprocessing (like MFCC or mel-spectrogram).
- A minimalTransform andSegment step (mono conversion, resample, slice/pad) formats the data into shape(B, T).
RawNetX (Macro Features)
- SincConv: Learns band-pass filters in a frequency-focused manner, constrained by low/high cutoff frequencies.
- ResidualStack: A set of residual blocks (optionally with SEBlock) refines the representation.
- Attentive Stats Pooling: Aggregates time-domain information into mean and std with a learnable attentionmechanism.
- A finalFC layer yields a 256-dimensional embedding.
WavLM Large (Micro Features)
- Transformer layers operate atframe-level, capturing fine-grained details.
- Produces a1024-dimensional embedding after mean pooling across time.
Fusion Layer
- Concatenate the256-dim RawNetX embedding with the1024-dim WavLM embedding, resulting in1280dimensions.
- ALinear(1280 → 256) + ReLU layer reduces it to a256-dim Fusion Embedding, combining micro and macroinsights.
AMSoftmax Loss
- During training, the 256-dim fusion embedding is passed to an AMSoftmax classifier (with margin + scale).
- Embeddings of the same speaker are pulled closer, while different speakers are pushed apart in the angular space.

A Single End-to-End Learning Pipeline

Fully Automatic: Raw waveforms go in, final speaker embeddings come out.
No Manual Feature Extraction: We do not rely on handcrafted features like MFCC or mel-spectrogram.
Data-Driven: The model itself figures out which frequency bands or time segments matter most.
Enhanced Representation: WavLM delivers local detail, RawNetX captures global stats, leading to a more robustspeaker representation.

Why Avoid Preprocessing?

Deep Learning Principle: The model should learn how to process raw signals rather than relying on human-definedfeature pipelines.
Better Generalization: Fewer hand-tuned hyperparameters; the model adapts better to various speakers, languages, andenvironments.
Scientific Rigor: Manual feature engineering can introduce subjective design choices. Letting the network learndirectly from data is more consistent with data-driven approaches.

Performance & Advantages

Micro + Macro Features Combined
- Captures both short-term acoustic nuances (WavLM) and holistic temporal stats (RawNetX).
Truly End-to-End
- Beyond minimal slicing/padding, all layers are trainable.
- No handcrafted feature extraction is involved.
VoxCeleb1 Test Results
- Achieved anEER of 4.67% on the VoxCeleb1 evaluation set.
Overall Benefits
- Potentially outperforms using WavLM or RawNetX alone on standard metrics like EER and minDCF.
- Combining both scales of analysis yields a richer speaker representation.

In essence,WavLM Large + RawNetX merges two scales of speaker representation to produce aunified 256-dimembedding. By staying fully end-to-end, the architecture remains flexible and can leverage large amounts of data forimproved speaker verification results.

Architecture

Reports

Benchmark

Speaker Verification Benchmark on VoxCeleb1 Dataset

Model	EER
ReDimNet-B6-SF2-LM-ASNorm	0.37
WavLM+ECAPA-TDNN	0.39
...	...
TitanNet-L	0.68
...	...
SpeechNAS	1.02
...	...
Multi Task SSL	1.98
...	...
WavLMRawNetXSVBase	4.67

Prerequisites

Inference

Python3.11(or above)

For trainig from scratch

10GB Disk Space(for VoxCeleb1 Dataset)
12GB VRAM GPU(or above)

Installation

Linux/Ubuntu

sudo apt update -y&& sudo apt upgrade -y

sudo apt install -y ffmpeg

git clone https://github.com/bunyaminergen/WavLMRawNetXSVBase

cd WavLMRawNetXSVBase

conda env create -f environment.yaml

conda activate WavLMRawNetXSVBase

Dataset Download (if training from scratch)

Please go to the url and register:KAIST MM
After receiving the e-mail, you can download the dataset directly from the e-mail by clicking on the link or you canuse the following commands.
Note:To download from the command line, you must take the key parameter from thelinkin the e-mail and place it in the relevant place in the command line below.
To downloadList of trial pairs - VoxCeleb1 (cleaned) please go to theurl:VoxCeleb

VoxCeleb1

Dev A

wget -c --no-check-certificate -O vox1_dev_wav_partaa"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partaa"

Dev B

wget -c --no-check-certificate -O vox1_dev_wav_partab"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partab"

Dev C

wget -c --no-check-certificate -O vox1_dev_wav_partac"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partac"

Dev D

wget -c --no-check-certificate -O vox1_dev_wav_partad"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partad"

Concatenate

cat vox1_dev*> vox1_dev_wav.zip

Test

wget -c --no-check-certificate -O vox1_test_wav.zip"https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_test_wav.zip"

List of trial pairs - VoxCeleb1 (cleaned)

wget https://mm.kaist.ac.kr/datasets/voxceleb/meta/veri_test2.txt

File Structure

.├── .data│   ├── dataset│   │   ├── raw│   │   │   └── VoxCeleb1│   │   │       ├── dev│   │   │       │   └── vox1_dev_wav.zip│   │   │       └── test│   │   │           └── vox1_test_wav.zip│   │   └── train│   │       └── VoxCeleb1│   │           ├── dev│   │           │   └── vox1_dev_wav│   │           │       └── wav│   │           │           ├── id10001│   │           │           │   ├── 1zcIwhmdeo4│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   ├── 7gWzIy6yIIk│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   └── ...│   │           │           │       └── ...│   │           │           ├── id10002│   │           │           │   ├── 6WO410QOeuo│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   ├── C7k7C-PDvAA│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   └── ...│   │           │           │       └── ...│   │           │           ├── id10003│   │           │           │   ├── 5ablueV_1tw│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   ├── A7Hh1WKmHsg│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   └── ...│   │           │           │       └── ...│   │           │           ├── ...│   │           │           │   └── ...│   │           │           │       └── ...│   │           │           ├── id11250│   │           │           │   ├── 09AvzdGWvhA│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   ├── 1BmQvhvvrhY│   │           │           │   │   ├── 00001.wav│   │           │           │   │   ├── 00002.wav│   │           │           │   │   ├── 00003.wav│   │           │           │   │   └── ...│   │           │           │   └── ...│   │           │           │       └── ...│   │           │           └── id11251│   │           │               ├── 5-6lI5JQtb8│   │           │               │   ├── 00001.wav│   │           │               │   ├── 00002.wav│   │           │               │   ├── 00003.wav│   │           │               │   └── ...│   │           │               └── XHCSVYEZvlM│   │           │                   ├── 00001.wav│   │           │                   ├── 00002.wav│   │           │                   ├── 00003.wav│   │           │                   └── ...│   │           └── test│   │               ├── veri_test2.txt│   │               └── vox1_test_wav│   │                   └── wav│   │                       ├── id10270│   │                       │   ├── 5r0dWxy17C8│   │                       │   │   ├── 00001.wav│   │                       │   │   ├── 00002.wav│   │                       │   │   ├── 00003.wav│   │                       │   │   └── ...│   │                       │   ├── 5sJomL_D0_g│   │                       │   │   ├── 00001.wav│   │                       │   │   ├── 00002.wav│   │                       │   │   ├── 00003.wav│   │                       │   │   └── ...│   │                       │   └── ...│   │                       │       └── ...│   │                       ├── id10271│   │                       │   ├── 1gtz-CUIygI│   │                       │   │   ├── 00001.wav│   │                       │   │   ├── 00002.wav│   │                       │   │   ├── 00003.wav│   │                       │   │   └── ...│   │                       │   ├── 37nktPRUJ58│   │                       │   │   ├── 00001.wav│   │                       │   │   ├── 00002.wav│   │                       │   │   ├── 00003.wav│   │                       │   │   └── ...│   │                       │   └── ...│   │                       │       └── ...│   │                       ├── ...│   │                       │   └── ...│   │                       │       └── ...│   │                       └── id10309│   │                           ├── 0b1inHMAr6o│   │                           │   ├── 00001.wav│   │                           │   ├── 00002.wav│   │                           │   ├── 00003.wav│   │                           │   └── ...│   │                           └── Zx-zA-D_DvI│   │                               ├── 00001.wav│   │                               ├── 00002.wav│   │                               ├── 00003.wav│   │                               └── ...│   └── example│       ├── enroll│       │   ├── speaker1_enroll_en.wav│       │   └── speaker1_enroll_tr.wav│       └── test│           ├── speaker1_test_en.wav│           ├── speaker1_test_tr.wav│           ├── speaker2_test_en.wav│           └── speaker2_test_tr.wav├── .docs│   ├── documentation│   │   ├── CONTRIBUTING.md│   │   └── RESOURCES.md│   └── img│       └── architecture│           ├── WavLMRawNetXSVBase.drawio│           └── WavLMRawNetXSVBase.gif├── environment.yaml├── .github│   └── CODEOWNERS├── .gitignore├── LICENSE├── main.py├── notebook│   └── test.ipynb├── README.md├── requirements.txt└── src    ├── config    │   ├── config.yaml    │   └── schema.py    ├── evaluate    │   └── metric.py    ├── model    │   ├── backbone.py    │   ├── block.py    │   ├── convolution.py    │   ├── fusion.py    │   ├── loss.py    │   └── pooling.py    ├── preprocess    │   ├── feature.py    │   └── transformation.py    ├── process    │   ├── test.py    │   └── train.py    └── utils        └── data            └── manager.py23779 directories, 153552 files

Version Control System

Releases

v1.0.0.zip
v1.0.0.tar.gz

Branches

Upcoming

BasePlus Model: Build a new archtitecture and train for better EER.
HuggingFace Model Hub: Add model to HuggingFace Model Hub.
HuggingFace Space: Add demo to HuggingFace Space.
Pytorch Hub: Add model to Pytorch Hub.

Documentations

Citation

@software{WavLMRawNetXSVBase,author       ={Bunyamin Ergen},title        ={{WavLMRawNetXSVBase}},year         ={2025},month        ={02},url          ={https://github.com/bunyaminergen/WavLMRawNetXSVBase},version      ={v1.0.0},}