- Notifications
You must be signed in to change notification settings - Fork51
This repo hosts the code and models of "Masked Autoencoders that Listen".
License
facebookresearch/AudioMAE
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This repo hosts the code and models of "Masked Autoencoders that Listen" [NeurIPS 2022bib].
- This repo follows theMAE repo, Installation and preparation follow that repo.
- Copy files and patch the timm package by ``bash timm_patch.sh'' (Please change the path to your own timm package path). We use timm==0.3.2, for which afix is needed to work with PyTorch 1.8.1+.
- Please findmae_env.yml for all the dependencies.
- You may also use download the conda-packedconda env, untar it, and then:
source path_to_env/bin/activatePlease download AudioSet athere. Due to copyright we cannot release the data. The data annotation json parased and used in this work is availablehere. The format follows the one inAST. Please be sure to modify the path in the scripts accordingly to reflect your own setup.
For the brave ones to pre-train on AudioSet-2M: Please use the pretrain_audioset2M.sh by:
bash pretrain_audioset2M.shFor Finetuning from an AuioSet-pretrained model. Please use your own pretrained model from the previous step or download our pre-trainedckpt and put it under ./ckpt/. Please use the script submit_ft_mask_bal.sh by
bash submit_ft_mask_bal.sh 2e-4 0.2 0.2 ./ckpt/pretrained.pth"This will perform weighted distributed sampling on the unbalanded Audioset to fine-tuned the model with class-balanced data for 100 epochs. The resulting mAP on the AudioSet should be around 47.3. We provide our finetuned checkpoint athere. An example log of finetuning is as follows:
[07:10:32.717347] log_dir: /checkpoint/berniehuang/experiments/419909[07:10:36.394431] Epoch: [99] [ 0/781] eta: 0:47:51 lr: 0.000001 loss: 0.0066 (0.0066) time: 3.6761 data: 1.6724 max mem: 2606[07:12:24.728503] Epoch: [99] [500/781] eta: 0:01:02 lr: 0.000001 loss: 0.0116 (0.0128) time: 0.2130 data: 0.0002 max mem: 2606[07:13:24.602830] Epoch: [99] [780/781] eta: 0:00:00 lr: 0.000001 loss: 0.0122 (0.0128) time: 0.1837 data: 0.0003 max mem: 2606[07:13:24.853957] Epoch: [99] Total time: 0:02:52 (0.2204 s / it)[07:13:25.085416] Averaged stats: lr: 0.000001 loss: 0.0122 (0.0126)[07:13:28.343364] Test: [ 0/79] eta: 0:02:01 time: 1.5353 data: 1.5029 max mem: 2606[07:13:30.942012] Test: [78/79] eta: 0:00:00 time: 0.0206 data: 0.0001 max mem: 2606[07:13:31.180169] Test: Total time: 0:00:04 (0.0554 s / it)[07:13:42.547896] mAP: 0.472873[07:13:42.552120] mAP of the network on the 19148 test images: 0.4728[07:13:42.552198] Max mAP: 0.473[07:13:42.566228] Training time 5:16:14submitit INFO (2022-04-22 07:13:43,404) - Job completed successfullyYou can also try fine-tuning on AudioSet-20K for 60 epochs with
sbatch ft_as.sh 1e-3 ./ckpt/pretrained.pthThe log.txt will look like:
{"train_lr": 2.1997867184321786e-06, "train_loss": 0.01310475811136991, "test_mAP": 0.36981118189071294, "epoch": 56, "n_parameters": 85659407}{"train_lr": 1.6171788925401227e-06, "train_loss": 0.01304934614071496, "test_mAP": 0.37001905352752995, "epoch": 57, "n_parameters": 85659407}{"train_lr": 1.2277041313086816e-06, "train_loss": 0.013038477757025324, "test_mAP": 0.36998449127640076, "epoch": 58, "n_parameters": 85659407}{"train_lr": 1.0325878664284776e-06, "train_loss": 0.012981618695671238, "test_mAP": 0.36999196624276054, "epoch": 59, "n_parameters": 85659407}The peformance on AudioSet-20K is around 37.0 mAP.
For inference the finetuned model. Please put your finetuned model under ./ckpt, or please download our finetunedckpt. Then:
bash inf.sh ckpt/finetuned.pthThis should give you 47.3 mAP on AudioSet. An example log is as follows:
[18:22:12.877430] number of params (M): 85.66[18:22:12.877460] base lr: 2.00e-03[18:22:12.877479] actual lr: 1.25e-04[18:22:12.877495] accumulate grad iterations: 1[18:22:12.877511] effective batch size: 16[18:22:12.898235] criterion = BCEWithLogitsLoss()[18:22:14.068845] Test: [ 0/1197] eta: 0:23:19 time: 1.1690 data: 1.0901 max mem: 1035[18:22:55.447027] Test: [ 300/1197] eta: 0:02:06 time: 0.1402 data: 0.0001 max mem: 1046[18:23:37.699615] Test: [ 600/1197] eta: 0:01:24 time: 0.1411 data: 0.0001 max mem: 1061[18:24:20.110863] Test: [ 900/1197] eta: 0:00:41 time: 0.1417 data: 0.0001 max mem: 1075[18:25:02.194206] Test: [1196/1197] eta: 0:00:00 time: 0.1526 data: 0.0001 max mem: 1090[18:25:02.321579] Test: Total time: 0:02:49 (0.1415 s / it)[18:25:11.997641] mAP: 0.472873[18:25:12.004128] Accuracy of the network on the 19148 test images: 0.4729Per-class AP can be found under ./aps.txt and per-example results is inf_output.npy
- ViT-B, AS-2Mpretrained
- ViT-B, AS-2M pretrained+finetuned
- Code and Model Release
- Provide conda-pack envs
- Notebook demos for reconstruction (legal blocked)
- Additional exps
@inproceedings{huang2022amae, title = {Masked Autoencoders that Listen}, author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph} booktitle = {NeurIPS}, year = {2022}}Please contact Bernie Huang (berniehuang@meta.com) if you have any questions. Thank you.
The codebase is based on the awesomeMAE andAST repos.
This project is under the CC-BY 4.0 license. SeeLICENSE for details.
About
This repo hosts the code and models of "Masked Autoencoders that Listen".
Resources
License
Code of conduct
Contributing
Security policy
Uh oh!
There was an error while loading.Please reload this page.

