- Notifications
You must be signed in to change notification settings - Fork27
TAT-QA (Tabular And Textual dataset for Question Answering) contains 16,552 questions associated with 2,757 hybrid contexts from real-world financial reports.
License
NExTplusplus/TAT-QA
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
TAT-QA (TabularAndTextual dataset forQuestionAnswering) contains 16,552 questions associated with 2,757 hybrid contextsfrom real-world financial reports.
You can download our TAT-QA dataset viaTAT-QA dataset.
For more information, please refer to ourTAT-QA website or read our ACL2021 paperPDF.
To create an environment withMiniConda and activate it.
conda create -n tat-qa python==3.7conda activate tat-qapip install -r requirement.txtpip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+${CUDA}.htmlWe adoptRoBERTa as our encoder to develop our TagOp and use the following commands to prepare RoBERTa model
cd dataset_tagopmkdir roberta.large&&cd roberta.largewget -O pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-pytorch_model.binwget -O config.json https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.jsonwget -O vocab.json https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.jsonwget -O merges.txt https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt
We heuristicly generate the "facts" and "mapping" fields based on raw dataset, which are stored under the folder ofdataset_tagop.
PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/tag_op python tag_op/prepare_dataset.py --mode [train/dev/test]
Note: The result will be written into the folder./tag_op/cache default.
CUDA_VISIBLE_DEVICES=2 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/trainer.py --data_dir tag_op/cache/ \--save_dir ./checkpoint --batch_size 48 --eval_batch_size 8 --max_epoch 50 --warmup 0.06 --optimizer adam --learning_rate 5e-4 \--weight_decay 5e-5 --seed 123 --gradient_accumulation_steps 4 --bert_learning_rate 1.5e-5 --bert_weight_decay 0.01 \--log_per_updates 50 --eps 1e-6 --encoder roberta
CUDA_VISIBLE_DEVICES=2 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/predictor.py --data_dir tag_op/cache/ --test_data_dir tag_op/cache/\\--save_dir tag_op/ --eval_batch_size 32 --model_path ./checkpoint --encoder roberta
Note: The training process may take around 2 days using a single 32GB v100.
You may download this checkpoint of the trained TagOp model vaiTagOp Checkpoint
Please kindly cite our work if you use our dataset or codes, thank you.
@inproceedings{zhu-etal-2021-tat, title ="{TAT}-{QA}: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance", author ="Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng", booktitle ="Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", month = aug, year ="2021", address ="Online", publisher ="Association for Computational Linguistics", url ="https://aclanthology.org/2021.acl-long.254", doi ="10.18653/v1/2021.acl-long.254", pages ="3277--3287"}The TAT-QA dataset is under the license ofCreative Commons (CC BY) Attribution 4.0 International
For any issues please create an issuehere or kindly drop an email to the author: Fengbin Zhuzhfengbin@gmail.com, thank you.
About
TAT-QA (Tabular And Textual dataset for Question Answering) contains 16,552 questions associated with 2,757 hybrid contexts from real-world financial reports.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.