- Notifications
You must be signed in to change notification settings - Fork0
Source code for Denoising Table-Text Retrieval for Open-Domain Question Answering (LREC-COLING 2024)
License
deokhk/DoTTeR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Source code for "Denoising Table-Text Retrieval for Open-Domain Question Answering" (LREC-COLING 2024)
We provide script to create a conda environment with all the required packages. Make sure to have conda installed on your system.
Then, run the following command to create the environment.
conda create -n dotter python=3.10conda activate dottersh create_env.sh
This codebase is built uponOTTeR.We follow the same data preprocessing steps as OTTeR, and provide the instruction mostly taken from OTTeR's README. For the rest of the README, we assume you are at the root of the repository, if not explicitly mentioned by "cd".
mkdir data_wikitablemkdir data_ottqagit clone https://github.com/wenhuchen/OTT-QA.gitcp OTT-QA/released_data/* ./data_ottqa
cd data_wikitable/wget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_plain_tables.jsonwget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_passages.jsoncd ../
Download OTTeR's processed linked passage fromall_constructed_blink_tables.json. Then unzip it withgunzip
and move the json file to./data_wikitable
.
To denoise the OTT-QA dataset, we need to train false-positive removal model.Run the following command below to prepare the data for training the model.
mkdir ./preprocessed_data/mkdir ./preprocessed_data/false_positive_removalmkdir ./model/mkdir ./model/trained_modelscd ./preprocessingpython false_positive_removal_preprocess.py --split train --nega intable_bm25 --aug_blinkpython false_positive_removal_preprocess.py --split dev --aug_blink
Then it will make "train_intable_bm25_blink_false_positive_removal.pkl" and "dev__blink_false_positive_removal.pkl" in "./preprocessed_data/false_positive_removal". Let the path of former beTRAIN_FILE
and the latter beDEV_FILE
.
Then, train the false positive removal model with the following command.
#!/bin/bashNUM_GPUS=2CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node=${NUM_GPUS} ./scripts/train_false_positive_removal.py \ --train_file${TRAIN_FILE} \ --dev_file${DEV_FILE} \ --seed 42 \ --effective_batch_size 32 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 16 \ --learning_rate 2e-5 \ --num_epochs 5 \ --model_name_or_path bert-base-cased \ --do_train_and_eval \ --logging_steps 10 \ --output_dir"./model/trained_models/false_positive_removal"
This will save the best model to./model/trained_models/false_positive_removal/best_model
. Let the path of the best model beMODEL_PATH
.
mkdir ./preprocessed_data/retrievalcd ./preprocessingCUDA_VISIBLE_DEVICES=0 python retriever_preprocess.py --split train --nega intable_contra --aug_blink --denoise --denoise_model_path${MODEL_PATH}CUDA_VISIBLE_DEVICES=1 python retriever_preprocess.py --split dev --nega intable_contra --aug_blink --denoise --denoise_model_path${MODEL_PATH}
This will make "train_intable_contra_blink_row_denoise.pkl" and "dev_intable_contra_blink_row_denoise.pkl" in "./preprocessed_data/retrieval". We denote the path of former asDENOISED_TRAIN_FILE
and the latter asDENOISED_DEV_FILE
.
python -m scripts.train_RATE \ --num_train_steps 60000 \ --evaluation_steps 1000 \ --logging_steps 20 \ --batch_size 32 \ --evaluation_batch_size 128 \ --wikitable_path${WIKITABLE_PATH} \ --output_dir${OUTPUT_DIR} \
To train rank-aware column encoder, you need to specify the path to./data_wikitable/all_plain_tables.json
asWIKITABLE_PATH
and the output directory asOUTPUT_DIR
.We recommendOUTPUT_DIR
to be an absolute path for./model/trained_models/RATE
.
This will save the best model to./model/trained_models/RATE/best_checkpoint
. Let the path of the best model beRATE_MODEL_PATH
.
We initialize the encoder with the mixed-modality synthetic pretrained checkpoint from OTTeR.Download the checkpoint fromhere.
unzip -d ./checkpoint-pretrain checkpoint-pretrain.zip
Then, move the ./checkpoint-pretrain to./model/
.
We provide a shell script to train the DoTTeR model.Before running the script, you need to specify the path to the preprocessed data and the path to the RATE model.
sh train_dotter.sh
This will save the best model ascheckpoint_best.pt
inRT_MODEL_PATH
.
cd ./preprocessingpython corpus_preprocess.py
This will make "table_corpus_blink.pkl" in "./preprocessed_data/retrieval".
We first encode the OTT-QA dev set, then the table corpus(fused blocks) with the trained DoTTeR model.
export BASIC_PATH="."export RATE_MODEL_PATH=${BASIC_PATH}/model/trained_models/RATE/best_checkpointexport RT_MODEL_PATH=${BASIC_PATH}/model/trained_models/dotterpython -m scripts.encode_corpus \ --do_predict \ --predict_batch_size 100 \ --model_name roberta-base \ --shared_encoder \ --predict_file${BASIC_PATH}/data_ottqa/dev.json \ --init_checkpoint${RT_MODEL_PATH}/checkpoint_best.pt \ --embed_save_path${RT_MODEL_PATH}/indexed_embeddings/question_dev \ --inject_summary \ --injection_scheme"column" \ --rate_model_path${RATE_MODEL_PATH}\ --normalize_summary_table \ --max_c_len 512 \ --num_workers 8export DATA_PATH=${BASIC_PATH}/preprocessed_data/retrievalexport TABLE_CORPUS=table_corpus_blinkpython -m scripts.encode_corpus \ --do_predict \ --encode_table \ --shared_encoder \ --predict_batch_size 800 \ --model_name roberta-base \ --predict_file${DATA_PATH}/${TABLE_CORPUS}.pkl \ --init_checkpoint${RT_MODEL_PATH}/checkpoint_best.pt \ --embed_save_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS} \ --inject_summary \ --injection_scheme"column" \ --rate_model_path${RATE_MODEL_PATH}\ --normalize_summary_table \ --max_c_len 512 \ --num_workers 24
Table recall can be evaluated with the following command.
python -m scripts.eval_ottqa_retrieval \ --raw_data_path${BASIC_PATH}/data_ottqa/dev.json \ --eval_only_ans \ --query_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/question_dev.npy \ --corpus_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \ --id2doc_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \ --output_save_path${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \ --beam_size 100
This will save the retrieval results todev_output_k100_${TABLE_CORPUS}.json
inRT_MODEL_PATH/indexed_embeddings
.
Block recall can be evaluated with the following command, after evaluating table recall.
python -m scripts.eval_block_recall \ --split dev \ --retrieval_results_file${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json
This step will prepare the QA dev data from the retrieval outputs. We use the top 15 table-text blocks(fused blocks) for QA
export CONCAT_TBS=15python -m preprocessing.qa_preprocess \ --split dev \ --topk_tbs${CONCAT_TBS} \ --retrieval_results_file${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \ --qa_save_path${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json
This step will find the top 15 table-text blocks for each question in the training set using DoTTeR, and prepare the training data for the QA model.
export BASIC_PATH="."export RATE_MODEL_PATH=${BASIC_PATH}/model/trained_models/RATE/best_checkpointexport RT_MODEL_PATH=${BASIC_PATH}/model/trained_models/dotterexport TABLE_CORPUS=table_corpus_blinkexport CONCAT_TBS=15python -m scripts.encode_corpus \ --do_predict \ --predict_batch_size 100 \ --model_name roberta-base \ --shared_encoder \ --predict_file${BASIC_PATH}/data_ottqa/train.json \ --init_checkpoint${RT_MODEL_PATH}/checkpoint_best.pt \ --embed_save_path${RT_MODEL_PATH}/indexed_embeddings/question_train \ --inject_summary \ --injection_scheme"column" \ --rate_model_path${RATE_MODEL_PATH}\ --normalize_summary_table \ --max_c_len 512 \ --num_workers 16python -m scripts.eval_ottqa_retrieval \ --raw_data_path${BASIC_PATH}/data_ottqa/train.json \ --eval_only_ans \ --query_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/question_train.npy \ --corpus_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \ --id2doc_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \ --output_save_path${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \ --beam_size 100python ../preprocessing/qa_preprocess.py \ --split train \ --topk_tbs${CONCAT_TBS} \ --retrieval_results_file${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \ --qa_save_path${RT_MODEL_PATH}/train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json
We use the same training script from OTTeR to train the QA model.
export BASIC_PATH="."export TABLE_CORPUS=table_corpus_blinkexport MODEL_NAME=mrm8488/longformer-base-4096-finetuned-squadv2export RT_MODEL_PATH=${BASIC_PATH}/model/trained_models/dotterexport QA_MODEL_PATH=${BASIC_PATH}/trained_models/qa_longformer_${TOPK}/dotterexport CONCAT_TBS=15export SEED=42export TOPK=15export EXP_NAME=dotter_qamkdir${QA_MODEL_PATH}python -m scripts.train_final_qa \ --do_train \ --do_eval \ --model_type longformer \ --dont_save_cache \ --overwrite_cache \ --model_name_or_path${MODEL_NAME} \ --evaluate_during_training \ --data_dir${RT_MODEL_PATH} \ --output_dir${QA_MODEL_PATH} \ --train_file${RT_MODEL_PATH}/train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \ --dev_file${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \ --per_gpu_train_batch_size 4 \ --per_gpu_eval_batch_size 8 \ --learning_rate 1e-5 \ --num_train_epochs 5 \ --max_seq_length 4096 \ --doc_stride 1024 \ --topk_tbs${TOPK} \ --seed${SEED} \ --run_name${EXP_NAME} \ --eval_steps 2000
In this script, we don't support setting effective batch size. Instead, we set the batch size per GPU and the number of GPUs. We use batch size 16 and 4 GPUs in the example above.
export PREDICT_OUT=dotter_qa_dev_resultexport MODEL_NAME=mrm8488/longformer-base-4096-finetuned-squadv2export TOPK=15export QA_MODEL_PATH=${BASIC_PATH}/trained_models/qa_longformer_${TOPK}/dotterpython -m scripts.train_final_qa \ --do_predict \ --model_type${MODEL_NAME} \ --dont_save_cache \ --overwrite_cache \ --model_name_or_path${MODEL_NAME} \ --data_dir${RT_MODEL_PATH} \ --output_dir${QA_MODEL_PATH} \ --predict_file${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \ --predict_output_file${PREDICT_OUT}.json \ --per_gpu_train_batch_size 4 \ --per_gpu_eval_batch_size 8 \ --doc_stride 1024 \ --topk_tbs${TOPK} \ --threads 4 \
@inproceedings{kang-etal-2024-denoising-table, title = "Denoising Table-Text Retrieval for Open-Domain Question Answering", author = "Kang, Deokhyung and Jung, Baikjin and Kim, Yunsu and Lee, Gary Geunbae", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.414", pages = "4634--4640", abstract = "In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.",}
This codebase is built upon the codebase fromOTTeR.We thank authors for open-sourcing them.