- Notifications
You must be signed in to change notification settings - Fork4
[IJCAI 2022] Official Pytorch code for paper “S2 Transformer for Image Captioning”
License
zchoi/S2-Transformer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Official code implementation for the paperS2 Transformer for Imgae Captioning
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao
Clone this repository and create them2release
conda environment using theenvironment.yml
file:
conda env create -f environment.yamlconda activate m2release
Then download spacy data by executing the following command:
python-mspacydownloaden_core_web_md
Note
Python 3 is required to run our code. If you suffer network problems, please downloaden_core_web_md
library fromhere, unzip and place it to/your/anaconda/path/envs/m2release/lib/python*/site-packages/
- Annotation. Download the annotation filem2_annotations [1]. Extract and put it in the project root directory.
- Feature. Download processed image featuresResNeXt-101 andResNeXt-152 features [2] (code
9vtB
), put it in the project root directory.
Update: Image features onOneDrive
Runpython train_transformer.py
using the following arguments:
Argument | Possible values |
---|---|
--exp_name | Experiment name |
--batch_size | Batch size (default: 50) |
--workers | Number of workers, accelerate model training in the xe stage. |
--head | Number of heads (default: 8) |
--resume_last | If used, the training will be resumed from the last checkpoint. |
--resume_best | If used, the training will be resumed from the best checkpoint. |
--features_path | Path to visual features file (h5py) |
--annotation_folder | Path to annotations |
--num_clusters | Number of pseudo regions |
For example, to train the model, run the following command:
pythontrain_transformer.py--exp_nameS2--batch_size50--m40--head8--features_path/path/to/features--num_clusters5
or just run:
bash train.sh
Note
We applytorch.distributed
to train our model, you can set theworldSize
intrain_transformer.py to determine the number of GPUs for your training.
Runpython test_transformer.py
to evaluate the model using the following arguments:
python test_transformer.py --batch_size 10 --features_path /path/to/features --model_path /path/to/saved_transformer_models/ckpt --num_clusters 5
Tip
We have removed theSPICE
evaluation metric during training because it is time-cost. You can add it when evaluating the model: download thisfile and put it in/path/to/evaluation/
, then uncomment codes ininit.py.
We provide checkpointhere, you will get the following results (second row):
Model | B@1 | B@4 | M | R | C | S |
---|---|---|---|---|---|---|
Our Paper (ResNext101) | 81.1 | 39.6 | 29.6 | 59.1 | 133.5 | 23.2 |
Reproduced Model (ResNext101) | 81.2 | 39.9 | 29.6 | 59.1 | 133.7 | 23.3 |
We also report the performance of our model on the online COCO test server with an ensemble of four S2 models. The detailed online test code can be obtained in thisrepo.
[1] Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[2] Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, FeiyueHuang, and Rongrong Ji. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15465–15474, 2021.
@inproceedings{S2,author ={Pengpeng Zeng* and Haonan Zhang* and Jingkuan Song and Lianli Gao},title ={S2 Transformer for Image Captioning},booktitle ={IJCAI},pages ={1608--1614} year ={2022}}
Thanks Zhanget.al for releasing the visual features (ResNeXt-101 and ResNeXt-152). Our code implementation is also based on theirrepo.
Thanks for the original annotations prepared byM2 Transformer, and effective visual representation fromgrid-feats-vqa.