- Notifications
You must be signed in to change notification settings - Fork23
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)
License
yalesong/pvse
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains a PyTorch implementation of the PVSE network and the MRW dataset proposed in our paperPolysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019). The code and data are free to use foracademic purposes only.
Please also visit ourproject page
- MRW Dataset
- Setting up an environment
- Download and prepare data
- Evaluate pretrained models
- Train your own model
OurMy Reaction When (MRW) dataset contains50,107
video-sentence pairs crawled from social media, where videos display physical or emotional reactions to the situations described in sentences. This subreddit/r/reactiongifs contains several examples; below shows some representative examples pairs;
Below shows the descriptive statistics of the datset. The word vocabulary size is34,835
. The dataset can be used for evaluting cross-modal retrieval systems underambiguous/weak-association between vision and language.
Train | Validation | Test | Total | |
---|---|---|---|---|
#pairs | 44,107 | 1,000 | 5,000 | 50,107 |
Avg. #frames | 104.91 | 209.04 | 209.55 | 117.43 |
Avg. #words | 11.36 | 15.02 | 14.79 | 11.78 |
Avg. word frequency | 15.48 | 4.80 | 8.57 | 16.94 |
We provide detailed analysis of the dataset in thesupplementary material of the main paper.
Follow the instructionbelow to download the dataset.
We recommend creating a virtual environment and install packages there. Note, you must install the Cython package first.
python3 -m venv <your virtual environment name>source <your virtual environment name>/bin/activatepip3 install Cythonpip3 install -r requirements.txt
cd databash prepare_mrw_dataset.sh
This will download the dataset (without videos) in a JSON format, a vocabulary file, and train/val/test splits. It will then prompt an option:
Do you wish to download video data and gulp them? [y/n]
We provide two ways to obtain the data. A recommended option is to download pre-compiled data in aGulpIO binary storage format, which contains video frames sampled at 8 FPS. For this, simpliy hitn
(this will terminate the script) and download our pre-compiled GulpIO data inthis link (54 GB). After finish downloading, extract the tarball underdata/mrw/gulp
to train and/or test our models.
If you wish to download raw video clips and gulp them on your own, hity
once prompted with the message above. This will start downloading videos and, once finished, start gulping the video files at 8 FPS (you can change this indownload_gulp_mrw.py). If you encounter any problem downloading the video files, you may also download them directly fromthis link (19 GB), and then continue gulping them using the scriptdownload_gulp_mrw.py.
cd databash prepare_tgif_dataset.sh
This will download the dataset (without videos) in a TSV format, a vocabulary file, and train/val/test splits. Please note, we use a slightly modified version of theTGIF dataset because of invalid video files; the script will automatically download the modified version.
It will then prompt an option:
Do you wish to gulp the data? [y/n]
Similar to the MRW data, we provide two options to obtain the data: (1) download pre-compiled GulpIO data, or (2) download raw video clips and gulp them on your own, and we recommend the first option for an easy start. For this, simply hitn
and download our pre-compiled GulpIO data inthis link (89 GB). After finish downloadingtgif-gulp.tar.gz
, extract the tarball underdata/tgif/gulp
.
If you wish to gulp your own dataset, hity
and follow the prompt. Note that you must first download a tarball containing the videos before gulping. You can download the filetgif.tar.gz
(124 GB) fromthis link and place it under./data/tgif
. Once you have the video data, the script will start gulping the video files.
cd databash prepare_coco_dataset.sh
Download all six pretrained models in a tarball atthis link. You can also download each individual files using the links below.
Dataset | Model | Command |
---|---|---|
COCO | PVSE (k=1)[download] | python3 eval.py --data_name coco --num_embeds 1 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse_k1.pth |
COCO | PVSE[download] | python3 eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse.pth |
MRW | PVSE (k=1)[download] | python3 eval.py --data_name mrw --num_embeds 1 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse_k1.pth |
MRW | PVSE[download] | python3 eval.py --data_name mrw --num_embeds 5 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse.pth |
TGIF | PVSE (k=1)[download] | python3 eval.py --data_name tgif --num_embeds 1 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse_k1.pth |
TGIF | PVSE[download] | python3 eval.py --data_name tgif --num_embeds 3 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse.pth |
Using the pretrained models you should be able to reproduce the results in the table below
Dataset | Model | Image/Video-to-Text R@1 / R@5 / R@10 / Med r (nMR) | Text-to-Image/Video R@1 / R@5 / R@10 / Med r (nMR) |
---|---|---|---|
COCO 1K | PVSE (K=1) | 66.72 / 91.00 / 96.22 / 1 (0.00) | 53.49 / 85.14 / 92.70 / 1 (0.00) |
COCO 1K | PVSE | 69.24 / 91.62 / 96.64 / 1 (0.00) | 55.21 / 86.50 / 93.73 / 1 (0.00) |
COCO 5K | PVSE (K=1) | 41.72 / 72.96 / 82.90 / 2 (0.00) | 30.64 / 61.37 / 73.62 / 3 (0.00) |
COCO 5K | PVSE | 45.18 / 74.28 / 84.46 / 2 (0.00) | 32.42 / 62.97 / 74.96 / 3 (0.00) |
MRW | PVSE (K=1) | 0.16 / 0.68 / 0.90 / 1700 (0.34) | 0.16 / 0.56 / 0.88 / 1650 (0.33) |
MRW | PVSE | 0.18 / 0.62 / 1.18 / 1624 (0.32) | 0.20 / 0.70 / 1.16 / 1552 (0.31) |
TGIF | PVSE (K=1) | 2.82 / 9.07 / 14.02 / 128 (0.01) | 2.63 / 9.37 / 14.58 / 115 (0.01) |
TGIF | PVSE | 3.28 / 9.87 / 15.56 / 115 (0.01) | 3.01 / 9.70 / 14.85 / 109 (0.01) |
You can train your own model usingtrain.py; checkoption.py for all available options.
For example, you can train our PVSE model (k=2) on COCO using the command below. It uses ResNet152 as a backbone CNN, GloVe word embedding, MMD loss weight 0.01 and DIV loss weight 0.1, and bacth size of 256:
python3 train.py --data_name coco --cnn_type resnet152 --wemb_type glove --margin 0.1 --max_violation --num_embeds 2 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 256
For video models, you should set the parameter--max_video_length
; otherwise it defaults to 1 (single frame). Here's an example command:
python3 train.py --data_name mrw --max_video_length 4 --cnn_type resnet18 --wemb_type glove --margin 0.1 --num_embeds 4 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 128
If you use any of the material in this repository we ask you to cite:
@inproceedings{song-pvse-cvpr19, author = {Yale Song and Mohammad Soleymani}, title = {Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval}, booktitle = {CVPR}, year = 2019
Our code is based onthe implementation by Faghri et al.
Last edit: Tuesday July 16, 2019