- Notifications
You must be signed in to change notification settings - Fork56
Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
License
v-iashin/BMT
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Project Page•ArXiv•BMVC Page•Presentation (Can't watch YouTube? I gotchu! 🤗)•
This is a PyTorch implementation for our paper: A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer (BMVC 2020).
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting the visual information alone, while completely neglecting the audio track.
To this end, we presentBi-modal Transformer with Proposal Generator (BMT), which efficiently utilizes audio and visual input sequences to select events in a video and, then, use these clips to generate a textual description.
Audio and visual features are encoded withVGGish andI3D while caption tokens withGloVe. First, VGGish and I3D features are passed through the stack ofN bi-modal encoder layers where audio and visual sequences are encoded to, what we call, audio-attended visual and video-attended audio features. These features are passed to the bi-modal multi-headed proposal generator, which generates a set of proposals using information from both modalities.
Then, the input features are trimmed according to the proposed segments and encoded in the bi-modal encoder again. The stack ofN bi-modal decoder layers inputs both: a) GloVe embeddings of the previously generated caption sequence, b) the internal representation from the last layer of the encoder for both modalities. The decoder produces its internal representation which is, then, used in the generator model the distribution over the vocabulary for the caption next word.
The code is tested onUbuntu 16.04/18.04 with oneNVIDIA GPU 1080Ti/2080Ti. If you are planning to use it with other software/hardware, you might need to adaptconda environment files or even the code.
Clone the repository. Mind the--recursive flag to make suresubmodules are also cloned (evaluation scripts for Python 3 and scripts for feature extraction).
git clone --recursive https://github.com/v-iashin/BMT.git
Download features (I3D and VGGish) and word embeddings (GloVe). The script will download them (~10 GB) and unpack into./data and./.vector_cache folders.Make sure to run it while being in BMT folder
bash ./download_data.sh
Set up aconda environment
conda env create -f ./conda_env.ymlconda activate bmt# install spacy language model. Make sure you activated the conda environmentpython -m spacy download enWe train our model in two staged: training of the captioning module on ground truth proposals and training of the proposal generator using the pre-trained encoder from the captioning module.
- Train the captioning module. You may also download the pre-trained modelbest_cap_model.pt (
md5 hash 7b4d48cd77ec49a027a4a1abc6867ee7).
python main.py \ --procedure train_cap \ --B 32
- Train proposal generation module. You may also download the pre-trained modelbest_prop_model.pt (
md5 hash 5f8b20826b09eadd41b7a5be662c198b)
python main.py \ --procedure train_prop \ --pretrained_cap_model_path /your_exp_path/best_cap_model.pt \ --B 16
Since a part of videos in ActivityNet Captions became unavailable over the time, we could only obtain ~91 % of videos in the dataset (see./data/available_mp4.txt for ids). To this end, we evaluate the performance of our model against ~91 % of the validation videos. We provide the validation sets without such videos in./data/val_*_no_missings.json. Please seeExperiments andSupplementary Material sections for details and performance of other models on the same validation sets.
- Ground truth proposals. The performance of the captioning module on ground truth segments might be obtained from the file with pre-trained captioning module. You may also want to use theofficial evaluation script with
./data/val_*_no_missings.jsonas references (-rargument).
importtorchcap_model_cpt=torch.load('./path_to_pre_trained_model/best_cap_model.pt',map_location='cpu')print(cap_model_cpt['val_1_metrics'])print(cap_model_cpt['val_2_metrics'])# To obtain the final results, average values in both dicts
- Learned proposals. Create a file with captions for every proposal provided in
--prop_pred_pathusing the captioning model specified in--pretrained_cap_model_path. The script will automatically evaluate it againts both ground truth validation sets. Alternatively, use the predictionsprop_results_val_1_e17_maxprop100.jsonin./resultsandofficial evaluation script with./data/val_*_no_missings.jsonas references (-rargument).
python main.py \ --procedure evaluate \ --pretrained_cap_model_path /path_to_best_cap_model.pt \ --prop_pred_path /path_to_generated_json_file \ --device_ids 0
Check out our script for extraction of I3D and VGGish features from a set of videos:video_features on GitHub (make sure to checkout to662ec51caf591e76724237f0454bdf7735a8dcb1 commit). Also see#7 for more details on configuration.
We would like to note that, despite a fixed random seed, some randomness occurs in our experimentation. Therefore, during the training of the captioning module, one might achieve slightly different results. Specifically, the numbers in your case might differ (higher or lower) from ours or the model will saturate in a different number of epochs. At the same time, we observed quite consistent results when training the proposal generation module with the pre-trained captioning module.
We relate this problem to padding and how it is implemented in PyTorch. (seePyTorch Reproducibility for details). Also, any suggestions on how to address this issue are greatly appreciated.
Comparison betweenMDVC and Bi-modal Transformer (BMT) on ActivityNet Captions validation set captioning ground truth proposals. BMT performs on par while having three times fewer parameters and using only two modalities.
| Model | Params (Mill) | BLEU@3 | BLEU@4 | METEOR |
|---|---|---|---|---|
| MDVC | 149 | 4.52 | 1.98 | 11.07 |
| BMT | 51 | 4.63 | 1.99 | 10.90 |
The experience with Google Colab is pretty poor. To ensure a more optimal experience, we recommend following the installation guide and setting up the software locally as described below.
Start by extracting audio and visual features from your video usingvideo_features repository. This repo is also included in./submodules/video_features (commit662ec51caf591e76724237f0454bdf7735a8dcb1).
Extract I3D features
# run this from the video_features folder:cd ./submodules/video_featuresconda deactivateconda activate i3dpython main.py \ --feature_type i3d \ --on_extraction save_numpy \ --device_ids 0 \ --extraction_fps 25 \ --video_paths ../../sample/women_long_jump.mp4 \ --output_path ../../sample/
Extract VGGish features (ifValueError, download the vggish model first--seeREADME.md in./submodules/video_features)
conda deactivateconda activate vggishpython main.py \ --feature_type vggish \ --on_extraction save_numpy \ --device_ids 0 \ --video_paths ../../sample/women_long_jump.mp4 \ --output_path ../../sample/
Run the inference
# run this from the BMT main folder:cd ../../conda deactivateconda activate bmtpython ./sample/single_video_prediction.py \ --prop_generator_model_path ./sample/best_prop_model.pt \ --pretrained_cap_model_path ./sample/best_cap_model.pt \ --vggish_features_path ./sample/women_long_jump_vggish.npy \ --rgb_features_path ./sample/women_long_jump_rgb.npy \ --flow_features_path ./sample/women_long_jump_flow.npy \ --duration_in_secs 35.155 \ --device_id 0 \ --max_prop_per_vid 100 \ --nms_tiou_thresh 0.4
Expected output
[ {'start': 0.1, 'end': 4.9, 'sentence': 'We see a title screen'}, {'start': 5.0, 'end': 7.9, 'sentence': 'A large group of people are seen standing around a building'}, {'start': 0.7, 'end': 11.9, 'sentence': 'A man is seen standing in front of a large crowd'}, {'start': 19.6, 'end': 33.3, 'sentence': 'The woman runs down a track and jumps into a sand pit'}, {'start': 7.5, 'end': 10.0, 'sentence': 'A large group of people are seen standing around a building'}, {'start': 0.6, 'end': 35.1, 'sentence': 'A large group of people are seen running down a track while others watch on the sides'}, {'start': 8.2, 'end': 13.7, 'sentence': 'A man runs down a track'}, {'start': 0.1, 'end': 2.0, 'sentence': 'We see a title screen'}]Note that in our research we avoided non-maximum suppression for computational efficiency and to allow the event prediction to be dense. Feel free to play with--nms_tiou_thresh parameter: for example, try to make it0.4 as in the provided example.
The sample video credits:Women's long jump historical World record in 1978
If you are having an error
RuntimeError: Vector for token b'<something>' has <some-number> dimensions, but previously read vectorshave 300 dimensions.try to remove*.txt and*.txt.pt from the hidden folder./.vector_cache/ and check if youare not running out of disk space (unpacking ofglove.840B.300d.zip requires extra ~8.5G).Then runsingle_video_prediction.py again.
Our paper was accepted at BMVC 2020. Please, use this bibtex if you would like to cite our work
@InProceedings{BMT_Iashin_2020, title={A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer}, author={Iashin, Vladimir and Rahtu, Esa}, booktitle={British Machine Vision Conference (BMVC)}, year={2020}}@InProceedings{MDVC_Iashin_2020, title = {Multi-Modal Dense Video Captioning}, author = {Iashin, Vladimir and Rahtu, Esa}, booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, pages={958--959}, year = {2020}}Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.
- Prithviraj contributed to theGoogle Colab demo
About
Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.