- Notifications
You must be signed in to change notification settings - Fork5
Official Repository for ECCV 2020 paper "AiR: Attention with Reasoning Capability"
License
szzexpoi/AiR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Update 2021-6: We now provide a new hold-out testset of AiR-D for benchmarking task-driven saliency prediction and scanpath prediction!
This code implements the Attention with Reasoning capability (AiR) framework. It contains three principal components:
- AiR-E: an quantitative evaluation method for measuring the alignments between different attentions and reasoning process,
- AiR-M: an attention supervision method for learning attention progressively throughout the reasoning process, and
- AiR-D: the first human eye-tracking dataset for Visual Question Answering task.
An example for evaluating the human attentions with correct and incorrect answers is illustrated below.
If you use our code or data, please cite our paper:
@InProceedings{air,author = {Chen, Shi and Jiang, Ming and Yang, Jinhui and Zhao, Qi},title = {AiR: Attention with Reasoning Capability},booktitle = {ECCV},year = {2020}}
We adopt the official implementation of theBilinear Attention Network as a backbone model for attention supervision. We use the bottom-up features provided inthis repository. For GRU with Bayesian Dropout (used in our UpDown reimplementation), we adopt the implementation fromthis repository. Please refer to these links for further README information.
- Requirements for Pytorch. We use Pytorch 1.2.0 in our experiments.
- Requirements for Tensorflow. We only use the tensorboard for visualization.
- Python 3.6+ (for most of our experiments)
- Python 2.7 (only for extracting features from the tsv files)
- You may need to install the OpenCV package (CV2) for Python.
- Download theGQA Dataset.
- Download thebottom-up features and unzip it.
- Extracting features from the raw tsv files (Important: You need to run the code with Python 2):
python2 ./pre_processing/extract_tsv.py --input $TSV_FILE --output $FEATURE_DIR
- Generate our atomic operations abstracted from GQA annotations:
python ./pre_processing/process_semantics.py --question $GQA_ROOT/question --scene_graph $GQA_ROOT/scene_graph --mapping ./data --save ./data
- Generate ground truth attention mask for supervision:
python ./pre_processing/generate_att_mask.py --bbox_dir $FEATURE_DIR/box --question $GQA_ROOT/question --scene_graph $GQA_ROOT/scene_graph --semantics ./data --save ./AiR-M/processed_data
The AiR-E score can be computed with:
python ./AiR-E/attention_eval.py --image $GQA_ROOT/images --question $GQA_ROOT/question --scene_graph $GQA_ROOT/scene_graph --semantics ./data --att_dir $ATTENTION_DIR --bbox_dir $FEATURE_DIR/box --att_type ATTENTION_TYPE --save ./data
For evaluating the human attention in our dataset, specifyATTENTION_TYPE ashuman
and$ATTENTION_DIR as the directory storing the saliency maps. For evaluating machine attention, specifyATTENTION_TYPE asspatial
orobject
and$ATTENTION_DIR as the file storing the attention. Our code assumes the machine attention is stored in a Numpy dictionary where the keys are qid and values are the maps. We use spatial attention with spatial size 7x7 and object-based attention with size 36.
The output is a Json file storing the scores. You can access the scores for a specific question (for example qid011000868
) as:
>>> import json>>> data = json.load(open('./data/att_score.json'))>>> data['011000868'][['select', 1, '7.908'], ['relate', 2, '7.053'], ['query', 3, '6.197']]
For each step, the result contains the reasoning operation, level of dependency in the process, and the AiR-E score.
We provide the implementation of our method on two state-of-the-art VQA models, including UpDown and BAN. Before training with our method, you need to first generate the dictionary for questions and answers:
cd ./AiR-Epython preprocess_lang.py --question $GQA_ROOT/question
Then the training process for UpDown can be called as:
python main_updown.py --mode train --anno_dir $GQA_ROOT/question --prep_dir ./processed_data --img_dir $FEATURE_DIR/features --checkpoint_dir $CHECKPOINT --ss_factor 2 --ss_step 1
We implement a schedule sampling strategy for optimizing the operation prediction. However, it did not contribute too much in boosting the performance and thus we disabled it in our experiments with--ss_factor 2
and--ss_step 1
. If you want to enable it, simply replace the increase factor and step with appropriate values (e.g.,--ss_factor 0.2
and--ss_step 3
).
For training with BAN, you need to input the bounding box positions as well:
python main_ban.py --mode train --anno_dir $GQA_ROOT/question --prep_dir ./processed_data --img_dir $FEATURE_DIR/features --bbox_dir $FEATURE_DIR/box --checkpoint_dir $CHECKPOINT --ss_factor 2 --ss_step 1
Evaluating the performance on the GQA test-dev set:
python main_updown.py --mode eval --anno_dir $GQA_ROOT/question --prep_dir ./processed_data --img_dir $FEATURE_DIR/features --weights $CHECKPOINT/model_best.pth
Note that it will also generate the predicted attentions, answers and operations. To generate the data on validation set (for comparison with human attention in our AiR-D dataset), simply modify line 27 of./AiR-M/util/dataloader.py
to load the annotations of validation set.
To create a submission for the GQA online server (test-standard set), call:
python main_updown.py --mode submission --anno_dir $GQA_ROOT/question --prep_dir ./processed_data --img_dir $FEATURE_DIR/features --weights $CHECKPOINT/model_best.pth
Our data is available athttps://drive.google.com/file/d/1_yWlv3GXYw0-qBan5pPEmV8vd-Y8Duh5/view?usp=sharing. We provide both the saliency maps with aggregated fixations throughout the reasoning process (aggregated_maps
) and the saliency maps for different time steps (temporal_maps
). Saliency maps for correct and incorrect answers are stored in different folders (fixmaps_corr
andfixmaps_incorr
), and the saliency maps regardless of answer correctness are stored infixmaps
. For the saliency maps of different time steps, we highlight their starting times (e.g.,_1s
for saliency maps of 1-2 second), and put them in different folders.
We also provide the raw fixation sequences (scanpath) for AiR-D, available athttps://drive.google.com/file/d/1HFlX09kRB2lOa6qYihFjQ0Q8KSvxfPcZ/view?usp=sharing. The data contains both the answer responses from our participants and their fixation sequences. The responses are stored inconsolidated_answers.json
(nan
means the participant does not attempt the question), together with the ground truth answers. The fixations sequences are stored in thefix
folder, where each mat file corresponds to the sequence of a single participant (anonymized) on the specific question. The location and time interval (starting time and end time) of a fixation are stored under the namexy
andt
, respectively.
A new computational model is developed for task-driven scanpath prediction, please refer tothis repo for details.
We now provide a new hold-out testset for evaluating task-driven attention modeling. The testset consists two evaluation splits, includinggeneral split that is agnostic to answer correctness, and thecorrectness split for evaluation with both correct and incorrect attention. For each sample in the testset, we provide its GQA IDs (question ID and image ID), the corresponding split in the GQA dataset, and the question-answer pair. The eye-tracking annotations are kept private, and only for evaluation purpose.
To evaluate your models with the testset, please email your results tochen4595@umn.edu. The results should be in the following formats:
For saliency prediction, we accept results in the form of saliency maps stored as PNG files. Each result should be named based on the corresponding question ID in the GQA dataset. Results for different splits should be organized as:
<AiR-saliency> -- ./general_split # results for the general split 201064885.png # saliency map for sample with qid 201064885 201735202.png ... -- ./correctness_split # results for the correctness split -- ./correctness_split/correct # saliency maps for correct attention 19209372.png 16934027.png ... -- ./correctness_split/incorrect # saliency maps for incorrect attention 19209372.png 16934027.png ...
For scanpath prediction, we accept result in the form of JSON files. Please submit a single file for the general split (i.e.,air_scanpath_general.json
), and two separate files for the correctness split (i.e.,air_scanpath_correct.json
andair_scanpath_incorrect.json
). Our evaluation is carried out on 10 generated scanpaths for each sample, and assumes scanpaths for images with width and height equals 320 and 240. Each scanpath should include the information for a sequence of fixations, including their x-axis, y-axis and duration. An example for the submission file is includehere, where the question IDs are used as keys.