deokhk/DoTTeRPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star8

Source code for Denoising Table-Text Retrieval for Open-Domain Question Answering (LREC-COLING 2024)

License

Apache-2.0 license

8 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
RATE		RATE
preprocessing		preprocessing
qa		qa
retrieval		retrieval
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_env.sh		create_env.sh
train_dotter.sh		train_dotter.sh

Repository files navigation

Denoising Table-Text Retrieval for Open-Domain Question Answering

Source code for "Denoising Table-Text Retrieval for Open-Domain Question Answering" (LREC-COLING 2024)

Requirements

We provide script to create a conda environment with all the required packages. Make sure to have conda installed on your system.

Then, run the following command to create the environment.

conda create -n dotter python=3.10conda activate dottersh create_env.sh

This codebase is built uponOTTeR.We follow the same data preprocessing steps as OTTeR, and provide the instruction mostly taken from OTTeR's README. For the rest of the README, we assume you are at the root of the repository, if not explicitly mentioned by "cd".

Step 0: Download dataset

Step0-1: OTT-QA dataset

mkdir data_wikitablemkdir data_ottqagit clone https://github.com/wenhuchen/OTT-QA.gitcp OTT-QA/released_data/* ./data_ottqa

Step0-2: OTT-QA all tables and passages

cd data_wikitable/wget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_plain_tables.jsonwget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_passages.jsoncd ../

Step0-3: Download fused block preprocessed from OTTeR

Download OTTeR's processed linked passage fromall_constructed_blink_tables.json. Then unzip it withgunzip and move the json file to./data_wikitable.

Step 1: Denoising OTT-QA dataset

To denoise the OTT-QA dataset, we need to train false-positive removal model.Run the following command below to prepare the data for training the model.

Step1-1: Preprocess the data

mkdir ./preprocessed_data/mkdir ./preprocessed_data/false_positive_removalmkdir ./model/mkdir ./model/trained_modelscd ./preprocessingpython false_positive_removal_preprocess.py --split train --nega intable_bm25 --aug_blinkpython false_positive_removal_preprocess.py --split dev --aug_blink

Then it will make "train_intable_bm25_blink_false_positive_removal.pkl" and "dev__blink_false_positive_removal.pkl" in "./preprocessed_data/false_positive_removal". Let the path of former beTRAIN_FILE and the latter beDEV_FILE.

Step1-2: Train the model

Then, train the false positive removal model with the following command.

#!/bin/bashNUM_GPUS=2CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node=${NUM_GPUS} ./scripts/train_false_positive_removal.py \    --train_file${TRAIN_FILE} \    --dev_file${DEV_FILE} \    --seed 42 \    --effective_batch_size 32 \    --per_device_train_batch_size 16 \    --per_device_eval_batch_size 16 \    --learning_rate 2e-5 \    --num_epochs 5 \    --model_name_or_path bert-base-cased \    --do_train_and_eval \    --logging_steps 10 \    --output_dir"./model/trained_models/false_positive_removal"

This will save the best model to./model/trained_models/false_positive_removal/best_model. Let the path of the best model beMODEL_PATH.

Step1-3: Denoise the OTT-QA dataset

mkdir ./preprocessed_data/retrievalcd ./preprocessingCUDA_VISIBLE_DEVICES=0 python retriever_preprocess.py --split train --nega intable_contra --aug_blink --denoise --denoise_model_path${MODEL_PATH}CUDA_VISIBLE_DEVICES=1 python retriever_preprocess.py --split dev --nega intable_contra --aug_blink --denoise --denoise_model_path${MODEL_PATH}

This will make "train_intable_contra_blink_row_denoise.pkl" and "dev_intable_contra_blink_row_denoise.pkl" in "./preprocessed_data/retrieval". We denote the path of former asDENOISED_TRAIN_FILE and the latter asDENOISED_DEV_FILE.

Step 2: Training the rank-aware column encoder

python -m scripts.train_RATE \    --num_train_steps 60000 \    --evaluation_steps 1000 \    --logging_steps 20 \    --batch_size 32 \    --evaluation_batch_size 128 \    --wikitable_path${WIKITABLE_PATH} \    --output_dir${OUTPUT_DIR} \

To train rank-aware column encoder, you need to specify the path to./data_wikitable/all_plain_tables.json asWIKITABLE_PATH and the output directory asOUTPUT_DIR.We recommendOUTPUT_DIR to be an absolute path for./model/trained_models/RATE.

This will save the best model to./model/trained_models/RATE/best_checkpoint. Let the path of the best model beRATE_MODEL_PATH.

Step 3: Training the DoTTeR model (retriever)

Step3-1: Download synthetic-pretrained checkpoint from OTTeR

We initialize the encoder with the mixed-modality synthetic pretrained checkpoint from OTTeR.Download the checkpoint fromhere.

unzip -d ./checkpoint-pretrain checkpoint-pretrain.zip

Then, move the ./checkpoint-pretrain to./model/.

Step3-2: Train the DoTTeR model

We provide a shell script to train the DoTTeR model.Before running the script, you need to specify the path to the preprocessed data and the path to the RATE model.

sh train_dotter.sh

This will save the best model ascheckpoint_best.pt inRT_MODEL_PATH.

Step 4: Evaluation

Step 4-1: Build retrieval corpus (fused blocks)

cd ./preprocessingpython corpus_preprocess.py

This will make "table_corpus_blink.pkl" in "./preprocessed_data/retrieval".

Step 4-2: Encode corpus with the trained DoTTeR model

We first encode the OTT-QA dev set, then the table corpus(fused blocks) with the trained DoTTeR model.

export BASIC_PATH="."export RATE_MODEL_PATH=${BASIC_PATH}/model/trained_models/RATE/best_checkpointexport RT_MODEL_PATH=${BASIC_PATH}/model/trained_models/dotterpython -m scripts.encode_corpus \  --do_predict \  --predict_batch_size 100 \  --model_name roberta-base \  --shared_encoder \  --predict_file${BASIC_PATH}/data_ottqa/dev.json \  --init_checkpoint${RT_MODEL_PATH}/checkpoint_best.pt \  --embed_save_path${RT_MODEL_PATH}/indexed_embeddings/question_dev \  --inject_summary \  --injection_scheme"column" \  --rate_model_path${RATE_MODEL_PATH}\  --normalize_summary_table \  --max_c_len 512 \  --num_workers 8export DATA_PATH=${BASIC_PATH}/preprocessed_data/retrievalexport TABLE_CORPUS=table_corpus_blinkpython -m scripts.encode_corpus \    --do_predict \    --encode_table \    --shared_encoder \    --predict_batch_size 800 \    --model_name roberta-base \    --predict_file${DATA_PATH}/${TABLE_CORPUS}.pkl \    --init_checkpoint${RT_MODEL_PATH}/checkpoint_best.pt \    --embed_save_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS} \    --inject_summary \    --injection_scheme"column" \    --rate_model_path${RATE_MODEL_PATH}\    --normalize_summary_table \    --max_c_len 512 \    --num_workers 24

Step 4-3: Build index and search with FAISS

Table recall can be evaluated with the following command.

python -m scripts.eval_ottqa_retrieval \ --raw_data_path${BASIC_PATH}/data_ottqa/dev.json \ --eval_only_ans \ --query_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/question_dev.npy \ --corpus_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \ --id2doc_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \   --output_save_path${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \   --beam_size 100

This will save the retrieval results todev_output_k100_${TABLE_CORPUS}.json inRT_MODEL_PATH/indexed_embeddings.

Block recall can be evaluated with the following command, after evaluating table recall.

python -m scripts.eval_block_recall \     --split dev \     --retrieval_results_file${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json

Step 4-4: Preparing QA dev data from retrieval outputs

This step will prepare the QA dev data from the retrieval outputs. We use the top 15 table-text blocks(fused blocks) for QA

export CONCAT_TBS=15python -m preprocessing.qa_preprocess \     --split dev \     --topk_tbs${CONCAT_TBS} \     --retrieval_results_file${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \     --qa_save_path${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json

Step 5: QA model

Step 5-1: Prepare QA training data

This step will find the top 15 table-text blocks for each question in the training set using DoTTeR, and prepare the training data for the QA model.

export BASIC_PATH="."export RATE_MODEL_PATH=${BASIC_PATH}/model/trained_models/RATE/best_checkpointexport RT_MODEL_PATH=${BASIC_PATH}/model/trained_models/dotterexport TABLE_CORPUS=table_corpus_blinkexport CONCAT_TBS=15python -m scripts.encode_corpus \  --do_predict \  --predict_batch_size 100 \  --model_name roberta-base \  --shared_encoder \  --predict_file${BASIC_PATH}/data_ottqa/train.json \  --init_checkpoint${RT_MODEL_PATH}/checkpoint_best.pt \  --embed_save_path${RT_MODEL_PATH}/indexed_embeddings/question_train \  --inject_summary \  --injection_scheme"column" \  --rate_model_path${RATE_MODEL_PATH}\  --normalize_summary_table \  --max_c_len 512 \  --num_workers 16python -m scripts.eval_ottqa_retrieval \   --raw_data_path${BASIC_PATH}/data_ottqa/train.json \   --eval_only_ans \   --query_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/question_train.npy \   --corpus_embeddings_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \   --id2doc_path${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \   --output_save_path${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \   --beam_size 100python ../preprocessing/qa_preprocess.py \    --split train \    --topk_tbs${CONCAT_TBS} \    --retrieval_results_file${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \    --qa_save_path${RT_MODEL_PATH}/train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json

Step 5-2: Train the QA model

We use the same training script from OTTeR to train the QA model.

export BASIC_PATH="."export TABLE_CORPUS=table_corpus_blinkexport MODEL_NAME=mrm8488/longformer-base-4096-finetuned-squadv2export RT_MODEL_PATH=${BASIC_PATH}/model/trained_models/dotterexport QA_MODEL_PATH=${BASIC_PATH}/trained_models/qa_longformer_${TOPK}/dotterexport CONCAT_TBS=15export SEED=42export TOPK=15export EXP_NAME=dotter_qamkdir${QA_MODEL_PATH}python -m scripts.train_final_qa \    --do_train \    --do_eval \    --model_type longformer \    --dont_save_cache \    --overwrite_cache \    --model_name_or_path${MODEL_NAME} \    --evaluate_during_training \    --data_dir${RT_MODEL_PATH} \    --output_dir${QA_MODEL_PATH} \    --train_file${RT_MODEL_PATH}/train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \    --dev_file${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \    --per_gpu_train_batch_size 4 \    --per_gpu_eval_batch_size 8 \    --learning_rate 1e-5 \    --num_train_epochs 5 \    --max_seq_length 4096 \    --doc_stride 1024 \    --topk_tbs${TOPK} \    --seed${SEED} \    --run_name${EXP_NAME} \    --eval_steps 2000

In this script, we don't support setting effective batch size. Instead, we set the batch size per GPU and the number of GPUs. We use batch size 16 and 4 GPUs in the example above.

Step 5-3: Evaluting the QA model

export PREDICT_OUT=dotter_qa_dev_resultexport MODEL_NAME=mrm8488/longformer-base-4096-finetuned-squadv2export TOPK=15export QA_MODEL_PATH=${BASIC_PATH}/trained_models/qa_longformer_${TOPK}/dotterpython -m scripts.train_final_qa \    --do_predict \    --model_type${MODEL_NAME} \    --dont_save_cache \    --overwrite_cache \    --model_name_or_path${MODEL_NAME} \    --data_dir${RT_MODEL_PATH} \    --output_dir${QA_MODEL_PATH} \    --predict_file${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \    --predict_output_file${PREDICT_OUT}.json \    --per_gpu_train_batch_size 4 \    --per_gpu_eval_batch_size 8 \    --doc_stride 1024 \    --topk_tbs${TOPK} \    --threads 4 \

References

@inproceedings{kang-etal-2024-denoising-table,    title = "Denoising Table-Text Retrieval for Open-Domain Question Answering",    author = "Kang, Deokhyung  and      Jung, Baikjin  and      Kim, Yunsu  and      Lee, Gary Geunbae",    editor = "Calzolari, Nicoletta  and      Kan, Min-Yen  and      Hoste, Veronique  and      Lenci, Alessandro  and      Sakti, Sakriani  and      Xue, Nianwen",    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",    month = may,    year = "2024",    address = "Torino, Italia",    publisher = "ELRA and ICCL",    url = "https://aclanthology.org/2024.lrec-main.414",    pages = "4634--4640",    abstract = "In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.",}

Acknowledgments

This codebase is built upon the codebase fromOTTeR.We thank authors for open-sourcing them.

About

Source code for Denoising Table-Text Retrieval for Open-Domain Question Answering (LREC-COLING 2024)

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

License

deokhk/DoTTeR

Folders and files

Latest commit

History

Repository files navigation

Denoising Table-Text Retrieval for Open-Domain Question Answering

Requirements

Step 0: Download dataset

Step0-1: OTT-QA dataset

Step0-2: OTT-QA all tables and passages

Step0-3: Download fused block preprocessed from OTTeR

Step 1: Denoising OTT-QA dataset

Step1-1: Preprocess the data

Step1-2: Train the model

Step1-3: Denoise the OTT-QA dataset

Step 2: Training the rank-aware column encoder

Step 3: Training the DoTTeR model (retriever)

Step3-1: Download synthetic-pretrained checkpoint from OTTeR

Step3-2: Train the DoTTeR model

Step 4: Evaluation

Step 4-1: Build retrieval corpus (fused blocks)

Step 4-2: Encode corpus with the trained DoTTeR model

Step 4-3: Build index and search with FAISS

Step 4-4: Preparing QA dev data from retrieval outputs

Step 5: QA model

Step 5-1: Prepare QA training data

Step 5-2: Train the QA model

Step 5-3: Evaluting the QA model

References

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages