cwx-worst-one/EATPublic

NotificationsYou must be signed in to change notification settings
Fork13
Star221

[IJCAI 2024] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

License

MIT license

221 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
config		config
data		data
evaluation		evaluation
feature_extract		feature_extract
inference		inference
models		models
results		results
scripts		scripts
src		src
tasks		tasks
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Guides

News 🔥

[Update May. 3, 2025] 🎉🎉🎉 EAT now supportsHugging Face integration! You can extract features or run inferencewithout relying on Fairseq — try EAT as your new audio encoder today!
We releaseEAT-large (20 epochs) with SOTA performance on AS-2M, AS-20K, ESC-50 and SPC-2.
Checkpoints and code are updated — EAT now seamlessly supportsvariable-length audio across training, extraction, inference, and evaluation.

Introduction

EAT is an audio SSL model with high effectiveness and efficiency during self-supervised pre-training. You can find details in the paperEAT: Self-Supervised Pre-Training with Efficient Audio Transformer.

Requirements and Installation

The minimum environment requirements arePython >= 3.8 andPyTorch >= 1.13. Please usepip < 24.1 due to dependency issues.

We now supportHugging Face integration — if you're only performingfeature extraction or inference, you no longer need to install Fairseq!

🟡 For feature extraction or inference only (Hugging Face)

No Fairseq needed. Simply run:

git clone https://github.com/cwx-worst-one/EATcd EATpip install -r requirements.txt

🔵 For pre-training or fine-tuning (Fairseq-based)

You need to install Fairseq manually:

git clone https://github.com/pytorch/fairseqcd fairseqpip install --editable ./git clone https://github.com/cwx-worst-one/EATpip install -r EAT/requirements.txt

Model Checkpoints

We provide several checkpoints for download, including both the original paper version and enhanced variants.

🔹 EAT-base (introduced in paper; for efficient pre-training)

🔹 Updated & Recommended Versions

These enhanced versions feature extended pre-training or larger backbones.

Checkpoints viaGoogle Drive are compatible with Fairseq for further pre-training or fine-tuning.
Hugging Face versions support direct usage viaAutoModel.from_pretrained for feature extraction or inference.

Version	📦 Google Drive	🤗 Hugging Face
EAT-base (Epoch 30, Pre-trained)	Link	Link
EAT-base (Epoch 30, Fine-tuned on AS-2M)	Link	Link
EAT-large (Epoch 20, Pre-trained)	Link	Link
EAT-large (Epoch 20, Fine-tuned on AS-2M)	Link	Link

🧠Browse collection on Hugging Face
⚠️ Note: Due to our limited AudioSet subset compared to other models, werecommend pre-training EAT on your own data for better performance.

📈 Performance Summary

Model	Backbone	Params	Pre-train Epochs	AS-20K mAP (%)	AS-2M mAP (%)
EAT-base	ViT-B	88M	10	40.3	48.6
EAT-base	ViT-B	88M	30	41.3	48.9
EAT-large	ViT-L	309M	20	42.0	49.5

Feature Extraction

We provide a simple script to extract audio features from the last layer of the EAT encoder. You can run feature extraction using either aFairseq checkpoint or aHugging Face model.

To get started, simply run:

bash EAT/scripts/feature_extract.sh

For more detailed instructions, see thefeature extraction guide.

Data Preparation

The main dataset in our experiment isAudioSet. Regrettably, we are unable to distribute the original audio files due to copyright restrictions.

However, you can access ourdata manifesthere, which provides metadata and paths necessary for processing. We follow the file format inwav2vec anddata2vec, where.tsv format file is for index while.lbl and.csv format files are specific for classification task. You are free to modify the files to suit your own datasets or experimental needs.

Pre-Training

Our codes are adapted fromAudio-MAE anddata2vec. The default configuration file for pre-training ispretraining_AS2M.yaml. To pre-train the EAT model on AudioSet, simply run the following script:

bash EAT/scripts/pretraining_AS2M.sh

If you wish to pre-train the model on a different dataset where audio clips arenot fixed to 10 seconds, please refer to thefeature extraction guide for detailed instructions on how to adjust the target length accordingly.

Fine-Tuning

We usefinetuning.yaml as the default configuration file for fine-tuning. To fine-tune the EAT model in different downstream tasks, you could run the scriptfinetuning_{task}.sh, where{task} refers to one of the supported datasets, includingAS20K,AS2M,ESC50, andSPCv2. For example, you can fine-tune EAT onAS20K by executing:

bash EAT/scripts/finetuning_AS20K.sh

Inference and Evaluation

We support both local inference and loading models via Hugging Face. To run inference on a single AudioSet audio clip using fine-tuned EAT models, you may use our checkpoint fine-tuned onAS-2M (recommended) orAS-20K. Alternatively, you can load the models directly from Hugging Face.

To start inference, run:

bash EAT/scripts/inference.sh

An example output is as follows:

# top_k_prediction = 12************ Acoustic Event Inference ************LABEL                          PREDICTIONPercussion                     0.523Drum kit                       0.437Vibraphone                     0.420Drum                           0.316Music                          0.303Snare drum                     0.277Glockenspiel                   0.225Marimba, xylophone             0.223Cymbal                         0.213Bass drum                      0.207Hi-hat                         0.196Mallet percussion              0.170**************************************************

To evaluate on the full AudioSet evaluation set, use:

bash EAT/scripts/eval.sh

This will report the evaluation value of mAP on the test set. Per-class AP scores are saved to./EAT/ap_log.txt. You could also refer to our results of finetuned EAT models on evaluation set of Audioset in./EAT/results.

Performance

Pre-training on AS-2M, EAT gains state-of-the-art (SOTA) performance on several audio and speech classification datasets including AS-20K, AS-2M, ESC-50 and SPC-2.

Efficiency

EAT achieves a total pre-training time reduction of ~15x compared to BEATs and ~10x relative to Audio-MAE. The full pre-training on AS-2M requires only 10 epochs.

Experiment Logs

We report the experiment logs usingwandb. We have published a short WandB report detailing the training process and performance metrics of the EAT model. You could visit ithere.

TODO

Release the final EAT large
Update codes and checkpoints for friendly usage
Provide model support on Hugging Face
Release the Docker image

Acknowledgement

Our codebase is based on the awesomeAudio-MAE anddata2vec repo.

Institutional Contributors

Institution	Contribution
Shanghai Jiao Tong University	Researchers; Computing power
Peng Cheng Laboratory	Researchers; Computing power

Citation

If you find our EAT codes and models useful, please cite the following paper:

@inproceedings{chen2024eat,title={{EAT}: Self-Supervised Pre-Training with Efficient Audio Transformer},author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},booktitle={IJCAI},year={2024}}

About

[IJCAI 2024] EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Movatterモバイル変換

License

cwx-worst-one/EAT

Folders and files

Latest commit

History

Repository files navigation

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

News 🔥

Introduction

Requirements and Installation

🟡 For feature extraction or inference only (Hugging Face)

🔵 For pre-training or fine-tuning (Fairseq-based)

Model Checkpoints

🔹 EAT-base (introduced in paper; for efficient pre-training)

🔹 Updated & Recommended Versions

📈 Performance Summary

Feature Extraction

Data Preparation

Pre-Training

Fine-Tuning

Inference and Evaluation

Performance

Efficiency

Experiment Logs

TODO

Acknowledgement

Institutional Contributors

Citation

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages