aimagelab/show-control-and-tellPublic

NotificationsYou must be signed in to change notification settings
Fork61
Star281

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. CVPR 2019

License

BSD-3-Clause license

281 stars 61 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
demo		demo
images		images
logs		logs
models		models
samples		samples
utils		utils
LICENSE		LICENSE
README.md		README.md
conda.yml		conda.yml
config.py		config.py
test_region_sequence.py		test_region_sequence.py
test_region_set.py		test_region_set.py
train.py		train.py

Repository files navigation

Show, Control and Tell

This repository contains the reference code for the paperShow, Control and Tell: A Framework for Generating Controllable and Grounded Captions (CVPR 2019).

Please cite with the following BibTeX:

@inproceedings{cornia2019show,  title={{Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions}},  author={Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},  year={2019}}

Environment setup

Clone the repository and create thesct conda environment using theconda.yml file:

conda env create -f conda.ymlconda activate sct

Our code is based onSpeakSee: a Python package that provides utilities for working with Visual-Semantic data, developed by us. The conda enviroment we provide already includes a beta version of this package.

Data preparation

COCO Entities

Download the annotations and metadata filedataset_coco.tgz (~85.6 MB) and extract it in the code folder usingtar -xzvf dataset_coco.tgz.

Download the pre-computed features filecoco_detections.hdf5 (~53.5 GB) and place it under thedatasets/coco folder, which gets created after decompressing the annotation file.

Flickr30k Entities

As before, download the annotations and metadata filedataset_flickr.tgz (~32.8 MB) and extract it in the code folder usingtar -xzvf dataset_flickr.tgz.

Download the pre-computed features fileflickr30k_detections.hdf5 (~13.1 GB) and place it under thedatasets/flickr folder, which gets created after decompressing the annotation file.

Evaluation

To reproduce the results in the paper, download the pretrained model filesaved_models.tgz (~4 GB) and extract it in the code folder withtar -xzvf saved_models.tgz.

Sequence controllability

Runpython test_region_sequence.py using the following arguments:

Argument	Possible values
`--dataset`	`coco`,`flickr`
`--exp_name`	`ours`,`ours_without_visual_sentinel`,`ours_with_single_sentinel`
`--sample_rl`	If used, tests the model with CIDEr optimization
`--sample_rl_nw`	If used, tests the model with CIDEr + NW optimization
`--batch_size`	Batch size (default: 16)
`--nb_workers`	Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 2, bottom right), use:

python test_region_sequence.py --dataset coco --exp_name ours --sample_rl_nw

Set controllability

Runpython test_region_set.py using the following arguments:

Argument	Possible values
`--dataset`	`coco`,`flickr`
`--exp_name`	`ours`,`ours_without_visual_sentinel`,`ours_with_single_sentinel`
`--sample_rl`	If used, tests the model with CIDEr optimization
`--sample_rl_nw`	If used, tests the model with CIDEr + NW optimization
`--batch_size`	Batch size (default: 16)
`--nb_workers`	Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 4, bottom row), use:

python test_region_set.py --dataset coco --exp_name ours --sample_rl_nw

Expected output

Underlogs/, you may also find the expected output of all experiments.

Training procedure

Runpython train.py using the following arguments:

Argument	Possible values
`--exp_name`	Experiment name
`--batch_size`	Batch size (default: 100)
`--lr`	Initial learning rate (default: 5e-4)
`--nb_workers`	Number of workers (default: 0)
`--sample_rl`	If used, the model will be trained with CIDEr optimization
`--sample_rl_nw`	If used, the model will be trained with CIDEr + NW optimization

For example, to train the model with cross entropy, use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-4

To train the model with CIDEr optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl

To train the model with CIDEr + NW optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl_nw

Note: the current training code only supports the use of the COCO Entities dataset.

COCO Entities

If you want to use only the annotations of our COCO Entities dataset, you can download the annotation filecoco_entities_release.json (~403 MB).

The annotation file contains a python dictionary structured as follows:

coco_entities_release.json └── <id_image>      └── <caption>           └── 'det_sequences'           └── 'noun_chunks'           └── 'detections'           └── 'split'

In details, for each image-caption pair, we provide the following information:

det_sequences, which contains a list of detection classes associated to each word of the caption (for an exact match with caption words, split the caption by spaces).None indicates the words that are not part of noun chunks, while_ indicates noun chunk words for which an association with a detection in the image was not possible.
noun_chunks, which is a list of tuples representing the noun chunks of the captions associated with a detection in the image. Each tuple is composed by two elements: the first one represents the noun chunk in the caption, while the second is the detection class associated to that noun chunk.
detections, which contains a dictionary with a number of elements equal to the number of detection classes associated with at least a noun chunk in the caption. For each detection class, it provides a list of tuples representing the image regions detected by Faster R-CNN re-trained on Visual Genome [1] and corresponding to that detection class. Each tuple is composed by the detection id and the corresponding boundig box in the form[x1, y1, x2, y2]. The detection id can be used to recover the detection feature vector from the pre-computed features filecoco_detections.hdf5 (~53.5 GB). See the demo section below for more details.
split, which indicates the dataset split of that sample (i.e. train, val or test) following the COCO splits provided by [2].

Note that this annotation file includes all image-caption pairs for which at least one noun chunk-detection association has been found. However, in validation and testing phase of our controllable captioning model, we dropped all captions with empty region sets (i.e. those captions with at least one_ in thedet_sequences field).

By downloading the dataset, you declare that you will use it for research and educational purposes only, any commercialuse is prohibited.

Demo

An example of how to use the COCO Entities annotations can be found in thecoco_entities_demo.ipynb file.

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[2] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

Contact

If you have any general doubt about our work, please use thepublic issues section on this github repo. Alternatively, drop us an e-mail at marcella.cornia [at] unimore.it or lorenzo.baraldi [at] unimore.it.

About

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. CVPR 2019

arxiv.org/abs/1811.10652

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Show, Control and Tell

Environment setup

Data preparation

COCO Entities

Flickr30k Entities

Evaluation

Sequence controllability

Set controllability

Expected output

Training procedure

COCO Entities

Demo

References

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Contributors2

Languages

Movatterモバイル変換

License

aimagelab/show-control-and-tell

Folders and files

Latest commit

History

Repository files navigation

Show, Control and Tell

Environment setup

Data preparation

COCO Entities

Flickr30k Entities

Evaluation

Sequence controllability

Set controllability

Expected output

Training procedure

COCO Entities

Demo

References

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Languages

Packages