v-iashin/SpecVQGANPublic

NotificationsYou must be signed in to change notification settings
Fork39
Star363

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

License

MIT license

363 stars 39 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
data		data
evaluation		evaluation
feature_extraction		feature_extraction
specvqgan		specvqgan
vocoder		vocoder
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conda_env.yml		conda_env.yml
evaluate.py		evaluate.py
generation_demo.ipynb		generation_demo.ipynb
neural_audio_codec_demo.ipynb		neural_audio_codec_demo.ipynb
sample_visualization.py		sample_visualization.py
train.py		train.py

Repository files navigation

Taming Visually Guided Sound Generation

BMVC 2021 – Oral Presentation

• [Project Page]• [ArXiv]• [BMVC Proceedings]• [Poster (for PAISS)]• [Presentation on YouTube] (Can't watch YouTube?)•

Listen for the samples on ourproject page.

Overview

We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook.These codebook vectors can, then, be controllably sampled to form a novel sound given a set of visual cues as a prime.

The codebook is trained on spectrograms similarly toVQGAN (an upgradedVQVAE).We refer to it asSpectrogram VQGAN

Once the spectrogram codebook is trained, we can train atransformer (a variant ofGPT-2) to autoregressively sample the codebook entries as tokens conditioned on a set of visual features

This approach allows training a spectrogram generation model which produces long, relevant, and high-fidelity sounds while supporting tens of data classes.

Environment Preparation

During experimentation, we used Linux machines withconda virtual environments, PyTorch 1.8 and CUDA 11.

Start by cloning this repo

git clone https://github.com/v-iashin/SpecVQGAN.git

Next, install the environment.For your convenience, we provide bothconda anddocker environments.

Conda

conda env create -f conda_env.yml

Test your environment

conda activate specvqganpython -c"import torch; print(torch.cuda.is_available())"# True

Docker

Download the image from Docker Hub and test if CUDA is available:

docker run \    --mount type=bind,source=/absolute/path/to/SpecVQGAN/,destination=/home/ubuntu/SpecVQGAN/ \    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SpecVQGAN/logs/ \    --mount type=bind,source=/absolute/path/to/vggsound/features/,destination=/home/ubuntu/SpecVQGAN/data/vggsound/ \    --shm-size 8G \    -it --gpus'"device=0"' \    iashin/specvqgan:latest \    python>>> import torch;print(torch.cuda.is_available())# True

or build it yourself

docker build -< Dockerfile --tag specvqgan

Data

In this project, we usedVAS andVGGSound datasets.VAS can be downloaded directly using the link provided in theRegNet repository.For VGGSound, however, one might need to retrieve videos directly from YouTube.

Download

The scripts will download features, check themd5 sum, unpack, and do a clean-up for each part of the dataset:

cd ./data# 24GBbash ./download_vas_features.sh# 420GB (+ 420GB if you also need ResNet50 Features)bash ./download_vggsound_features.sh

The unpacked features are going to be saved in./data/downloaded_features/*.Move them to./data/vas and./data/vggsound such that the folder structure would match the structure of the demo files.By default, it will downloadBN Inception features, to downloadResNet50 features uncomment the lines in scripts./download_*_features.sh

If you wish to download the parts manually, use the following URL templates:

https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vas/*.tar
https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vggsound/*.tar

Also, make sure to check themd5 sums provided in./data/md5sum_vas.md5 and./data/md5sum_vggsound.md5 along with file names.

Note, we distribute features for the VGGSound dataset in 64 parts.Each part holds ~3k clips and can be used independently as a subset of the whole dataset (the parts are not class-stratified though).

Extract Features Manually

ForBN Inception features, we employ the same procedure asRegNet.

ForResNet50 features, we rely onvideo_features (branchspecvqgan)repository and used these commands:

# VAS (few hours on three 2080Ti)strings=("dog""fireworks""drum""baby""gun""sneeze""cough""hammer")forclassin"${strings[@]}";do    python main.py \        --feature_type resnet50 \        --device_ids 0 1 2 \        --batch_size 86 \        --extraction_fps 21.5 \        --file_with_video_paths ./paths_to_mp4_${class}.txt \        --output_path ./data/vas/features/${class}/feature_resnet50_dim2048_21.5fps \        --on_extraction save_pickledone# VGGSound (6 days on three 2080Ti)python main.py \    --feature_type resnet50 \    --device_ids 0 1 2 \    --batch_size 86 \    --extraction_fps 21.5 \    --file_with_video_paths ./paths_to_mp4s.txt \    --output_path ./data/vggsound/feature_resnet50_dim2048_21.5fps \    --on_extraction save_pickle

Similar toBN Inception, we need to "tile" (cycle) a video if it is shorter than 10s. ForResNet50 we achieve this by tiling the resulting frame-level features up to 215 on temporal dimension, e.g. as follows:

feats=pickle.load(open(path,'rb')).astype(np.float32)reps=1+ (215//feats.shape[0])feats=np.tile(feats, (reps,1))[:215, :]withopen(new_path,'wb')asfile:pickle.dump(feats,file)

Pretrained Models

Unpack the pre-trained models to./logs/ directory.

Codebooks

Trained on	Evaluated on	FID ↓	Avg. MKL ↓	Link / MD5SUM
VGGSound	VGGSound	1.0	0.8	7ea229427297b5d220fb1c80db32dbc5
VAS	VAS	6.0	1.0	0024ad3705c5e58a11779d3d9e97cc8a

RunSampling Tool to see the reconstruction results for available data.

Transformers

The setting(a): the transformer is trained onVGGSound to sample from theVGGSound codebook:

Condition	Features	FID ↓	Avg. MKL ↓	Sample Time️ ↓	Link / MD5SUM
No Feats		13.5	9.7	7.7	b1f9bb63d831611479249031a1203371
1 Feat	BN Inception	8.6	7.7	7.7	f2fe41dab17e232bd94c6d119a807fee
1 Feat	ResNet50	11.5*	7.3*	7.7	27a61d4b74a72578d13579333ed056f6
5 Feats	BN Inception	9.4	7.0	7.9	b082d894b741f0d7a1af9c2732bad70f
5 Feats	ResNet50	11.3*	7.0*	7.9	f4d7105811589d441b69f00d7d0b8dc8
212 Feats	BN Inception	9.6	6.8	11.8	79895ac08303b1536809cad1ec9a7502
212 Feats	ResNet50	10.5*	6.9*	11.8	b222cc0e7aeb419f533d5806a08669fe

* – calculated on 1 sample per video the test set instead of 10 samples per video that is used for the rest.Evaluating a model on a larger number of samples per video is an expensive procedure.When evaluated on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting(b): the transformer is trained onVAS to sample from theVGGSound codebook

Condition	Features	FID ↓	Avg. MKL ↓	Sample Time️ ↓	Link / MD5SUM
No Feats		33.7	9.6	7.7	e6b0b5be1f8ac551700f49d29cda50d7
1 Feat	BN Inception	38.6	7.3	7.7	a98a124d6b3613923f28adfacba3890c
1 Feat	ResNet50	26.5*	6.7*	7.7	37cd48f06d74176fa8d0f27303841d94
5 Feats	BN Inception	29.1	6.9	7.9	38da002f900fb81275b73e158e919e16
5 Feats	ResNet50	22.3*	6.5*	7.9	7b6951a33771ef527f1c1b1f99b7595e
212 Feats	BN Inception	20.5	6.0	11.8	1c4e56077d737677eac524383e6d98d3
212 Feats	ResNet50	20.8*	6.2*	11.8	6e553ea44c8bc7a3310961f74e7974ea

* – calculated on 10 samples per video the test set instead of 100 samples per video that is used for the rest.Evaluating a model on a larger number of samples per video is an expensive procedure.When evaluated on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting(c): the transformer is trained onVAS to sample from theVAS codebook

Condition	Features	FID ↓	Avg. MKL ↓	Sample Time ↓	Link / MD5SUM
No Feats		28.7	9.2	7.6	ea4945802094f826061483e7b9892839
1 Feat	BN Inception	25.1	6.6	7.6	8a3adf60baa049a79ae62e2e95014ff7
1 Feat	ResNet50	25.1*	6.3*	7.6	a7a1342030653945e97f68a8112ed54a
5 Feats	BN Inception	24.8	6.2	7.8	4e1b24207780eff26a387dd9317d054d
5 Feats	ResNet50	20.9*	6.1*	7.8	78b8d42be19dd1b0a346b1f512967302
212 Feats	BN Inception	25.4	5.9	11.6	4542632b3c5bfbf827ea7868cedd4634
212 Feats	ResNet50	22.6*	5.8*	11.6	dc2b5cbd28ad98d2f9ca4329e8aa0f64

A transformer can also be trained to generate a spectrogram given a specificclass.We also provide pre-trained models for all three settings:The setting(c): the transformer is trained onVAS to sample from theVAS codebook

Setting	Codebook	Sampling for	FID ↓	Avg. MKL ↓	Sample Time ↓	Link / MD5SUM
(a)	VGGSound	VGGSound	7.8	5.0	7.7	98a3788ab973f1c3cc02e2e41ad253bc
(b)	VGGSound	VAS	39.6	6.7	7.7	16a816a270f09a76bfd97fe0006c704b
(c)	VAS	VAS	23.9	5.5	7.6	412b01be179c2b8b02dfa0c0b49b9a0f

VGGish-ish, Melception, and MelGAN

These will be downloaded automatically during the first run.However, if you need them separately, here are the checkpoints

VGGish-ish (1.54GB,197040c524a07ccacf7715d7080a80bd) + Normalization Parameters (in/specvqgan/modules/losses/vggishish/data/)
Melception (0.27GB,a71a41041e945b457c7d3d814bbcf72d) + Normalization Parameters (in/specvqgan/modules/losses/vggishish/data/)
MelGAN. If you wish to continue training it here are checkpointsnetD.pt,netG.pt,optD.pt,optG.pt.

The reference performance of VGGish-ish and Melception:

Model	Top-1 Acc	Top-5 Acc	mAP	mAUC
VGGish-ish	34.70	63.71	36.63	95.70
Melception	44.49	73.79	47.58	96.66

RunSampling Tool to see Melception and MelGAN in action.

Training

The training is done intwo stages.First, aspectrogram codebook should be trained.Second, atransformer is trained to sample from the codebookThe first and the second stages can be trained on the same or separate datasets as long as the process of spectrogram extraction is the same.

Training a Spectrogram Codebook

Erratum: during training with the default config, the code will silently fail to load the checkpoint ofthe perceptual loss. This leads to the results which are as good as without the perceptual loss.For this reason, one may try turning it off completely:perceptual_weight=0.0 and benefit from fasteriterations. For details please refer toIssue#13

To train a spectrogram codebook, we tried two datasets: VAS and VGGSound.We run our experiments on a relatively expensive hardware setup with four40GB NVidia A100 but the modelscan also be trained on one12GB NVidia 2080Ti with smaller batch size.When training on four40GB NVidia A100, change arguments to--gpus 0,1,2,3 anddata.params.batch_size=8 for the codebook and=16 for the transformer.The training will hang a bit at0, 2, 4, 8, ... steps because of the logging.If folders with features and spectrograms are located elsewhere, the paths can be specified indata.params.spec_dir_path,data.params.rgb_feats_dir_path, anddata.params.flow_feats_dir_patharguments but use the same format as in the config file e.g. notice the*in the path which globs class folders.

# VAS Codebook# mind the comma after `0,`python train.py --base configs/vas_codebook.yaml -t True --gpus 0,# or# VGGSound codebookpython train.py --base configs/vggsound_codebook.yaml -t True --gpus 0,

Training a Transformer

A transformer (GPT-2) is trained to sample from the spectrogram codebook given a set of frame-level visual features.

VAS Transformer

# with the VAS codebookpython train.py --base configs/vas_transformer.yaml -t True --gpus 0, \    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt# or with the VGGSound codebook which has 1024 codespython train.py --base configs/vas_transformer.yaml -t True --gpus 0, \    model.params.transformer_config.params.GPT_config.vocab_size=1024 \    model.params.first_stage_config.params.n_embed=1024 \    model.params.first_stage_config.params.ckpt_path=./logs/2021-05-19T22-16-54_vggsound_codebook/checkpoints/epoch_39.ckpt

VGGSound Transformer

python train.py --base configs/vggsound_transformer.yaml -t True --gpus 0, \    model.params.first_stage_config.params.ckpt_path=./logs/2021-05-19T22-16-54_vggsound_codebook/checkpoints/epoch_39.ckpt

Controlling the Condition Size

The size of the visual condition is controlled by two arguments in the config file.Thefeat_sample_size is the size of the visual features resampled equidistantly from all available features (212) andblock_size is the attention span.Make sure to useblock_size = 53 * 5 + feat_sample_size.For instance, forfeat_sample_size=212 theblock_size=477.However, the longer the condition, the more memory and more timely the sampling.By default, the configs are usingfeat_sample_size=212 for VAS and5 for VGGSound.Feel free to tweak it to your liking/application for example:

python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \    model.params.transformer_config.params.GPT_config.block_size=318 \    data.params.feat_sampler_cfg.params.feat_sample_size=53 \    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt

TheNo Feats settings (without visual condition) are trained similarly to the settings with visual conditioning where the condition is replaced with random vectors.The optimal approach here is to usereplace_feats_with_random=true along withfeat_sample_size=1 for example (VAS):

python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \    data.params.replace_feats_with_random=true \    model.params.transformer_config.params.GPT_config.block_size=266 \    data.params.feat_sampler_cfg.params.feat_sample_size=1 \    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt

Training VGGish-ish and Melception

We include all necessary files for training bothvggishish andmelception in./specvqgan/modules/losses/vggishish.Run it on a 12GB GPU as

cd ./specvqgan/modules/losses/vggishish# vggish-ishpython train_vggishish.py config=./configs/vggish.yaml device='cuda:0'# melceptionpython train_melception.py config=./configs/melception.yaml device='cuda:0'

Training MelGAN

To train the vocoder, use this command:

cd ./vocoderpython scripts/train.py \    --save_path ./logs/`date +"%Y-%m-%dT%H-%M-%S"` \    --data_path /path/to/melspec_10s_22050hz \    --batch_size 64

Evaluation

The evaluation is done in two steps.First, the samples are generated for each video. Second, evaluation script is run.The sampling procedure supports multi-gpu multi-node parallization.We provide a multi-gpu command which can easily be applied on a multi-node setup by replacing--master_addr to your main machine and--node_rank for every worker's id (also see ansbatch script in./evaluation/sbatch_sample.sh if you have a SLURM cluster at your disposal):

# Samplepython -m torch.distributed.launch \    --nproc_per_node=3 \    --nnodes=1 \    --node_rank=0 \    --master_addr=localhost \    --master_port=62374 \    --use_env \        evaluation/generate_samples.py \        sampler.config_sampler=evaluation/configs/sampler.yaml \        sampler.model_logdir=$EXPERIMENT_PATH \        sampler.splits=$SPLITS \        sampler.samples_per_video=$SAMPLES_PER_VIDEO \        sampler.batch_size=$SAMPLER_BATCHSIZE \        sampler.top_k=$TOP_K \        data.params.spec_dir_path=$SPEC_DIR_PATH \        data.params.rgb_feats_dir_path=$RGB_FEATS_DIR_PATH \        data.params.flow_feats_dir_path=$FLOW_FEATS_DIR_PATH \        sampler.now=$NOW# Evaluatepython -m torch.distributed.launch \    --nproc_per_node=3 \    --nnodes=1 \    --node_rank=0 \    --master_addr=localhost \    --master_port=62374 \    --use_env \    evaluate.py \        config=./evaluation/configs/eval_melception_${DATASET,,}.yaml \        input2.path_to_exp=$EXPERIMENT_PATH \        patch.specs_dir=$SPEC_DIR_PATH \        patch.spec_dir_path=$SPEC_DIR_PATH \        patch.rgb_feats_dir_path=$RGB_FEATS_DIR_PATH \        patch.flow_feats_dir_path=$FLOW_FEATS_DIR_PATH \        input1.params.root=$EXPERIMENT_PATH/samples_$NOW/$SAMPLES_FOLDER

The variables for theVAS dataset:

EXPERIMENT_PATH="./logs/<folder-name-of-vas-transformer-or-codebook>"SPEC_DIR_PATH="./data/vas/features/*/melspec_10s_22050hz/"RGB_FEATS_DIR_PATH="./data/vas/features/*/feature_rgb_bninception_dim1024_21.5fps/"FLOW_FEATS_DIR_PATH="./data/vas/features/*/feature_flow_bninception_dim1024_21.5fps/"SAMPLES_FOLDER="VAS_validation"SPLITS="\"[validation, ]\""SAMPLER_BATCHSIZE=4SAMPLES_PER_VIDEO=10TOP_K=64# use TOP_K=512 when evaluating a VAS transformer trained with a VGGSound codebookNOW=`date +"%Y-%m-%dT%H-%M-%S"`

The variables for theVGGSound dataset:

EXPERIMENT_PATH="./logs/<folder-name-of-vggsound-transformer-or-codebook>"SPEC_DIR_PATH="./data/vggsound/melspec_10s_22050hz/"RGB_FEATS_DIR_PATH="./data/vggsound/feature_rgb_bninception_dim1024_21.5fps/"FLOW_FEATS_DIR_PATH="./data/vggsound/feature_flow_bninception_dim1024_21.5fps/"SAMPLES_FOLDER="VGGSound_test"SPLITS="\"[test, ]\""SAMPLER_BATCHSIZE=32SAMPLES_PER_VIDEO=1TOP_K=512NOW=`date +"%Y-%m-%dT%H-%M-%S" the`

Sampling Tool

For interactive sampling, we rely on theStreamlit library.To start the streamlit server locally, run

# mind the trailing `--`streamlit run --server.port 5555 ./sample_visualization.py --# go to `localhost:5555` in your browser

or.

We also alternatively provide a similar notebook in./generation_demo.ipynb to play with the demo ona local machine.

The Neural Audio Codec Demo

While the Spectrogram VQGAN was never designed to be a neural audio codec butit happened to be highly effective for this task.We can employ our Spectrogram VQGAN pre-trained on an open-domain dataset as aneural audio codec without a change

If you wish to apply the SpecVQGAN for audio compression for arbitrary audio,please see our Google Colab demo:.

Integrated toHuggingface Spaces withGradio. See demo:

We also alternatively provide a similar notebook in./neural_audio_codec_demo.ipynb to play with the demo ona local machine.

Citation

Our paper was accepted as an oral presentation for the BMVC 2021.Please, use this bibtex if you would like to cite our work

@InProceedings{SpecVQGAN_Iashin_2021,  title={Taming Visually Guided Sound Generation},  author={Iashin, Vladimir and Rahtu, Esa},  booktitle={British Machine Vision Conference (BMVC)},  year={2021}}

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

We also acknowledge the following work:

The code base is built upon an amazingtaming-transformers repo.Check it out if you are into high-res image generation.
The implementation of some evaluation metrics is partially borrowed and adapted fromtorch-fidelity.
The feature extraction pipeline for BN-Inception relies on the baseline implementationRegNet.
MelGAN training scripts are built upon theofficial implementation for text-to-speech MelGAN.
ThanksAK391 for adapting our neural audio codec demo as aGradio app at

About

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

v-iashin.github.io/SpecVQGAN

Movatterモバイル変換

License

v-iashin/SpecVQGAN

Folders and files

Latest commit

History

Repository files navigation

Taming Visually Guided Sound Generation

Overview

Environment Preparation

Conda

Docker

Data

Download

Extract Features Manually

Pretrained Models

Codebooks

Transformers

VGGish-ish, Melception, and MelGAN

Training

Training a Spectrogram Codebook

Training a Transformer

VAS Transformer

VGGSound Transformer

Controlling the Condition Size

Training VGGish-ish and Melception

Training MelGAN

Evaluation

Sampling Tool

The Neural Audio Codec Demo

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors2

Uh oh!

Languages