- Notifications
You must be signed in to change notification settings - Fork2
Leveraging passage embeddings for efficient listwise reranking with large language models.
License
liuqi6777/pe_rank
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code for paperLeveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models
Figure 1: Comparison between RankGPT (upper) and PE-Rank (lower). RankGPT takes the whole passages as input and outputs ordered numbers, while PE-Rank takes a list of special tokens as both input and output. On the right side, we show the reranking results on DL19 using different forms of inputs.
git clone git@github.com:liuqi6777/pe_rank.git
The checkpoint of the PE-Rank model is available in this link:PE-Rank.
We provide the scripts for first-stage retrieval, for example, you can use the following command to use BM25 as the retrieval model:
python src/scripts/run_evaluation.py --dataset dl19 --retriever bm25 --topk 100
This code will run the BM25 retrieval model on the DL19 dataset and save the retrieval results toresults/retrieval_results/bm25/dl19_top100.jsonl
.
As alternative, we also provide all the retrieval results in this link:https://huggingface.co/liuqi6777/pyserini_retrieval_results. You can download the retrieval results toresults/retrieval_results
folder.
To run the reranking stage, you can use the following command:
python src/evaluate.py --datasets dl19 --model-path liuqi6777/pe-rank-mistral-jina --retriever bm25 --topk 100
The reranking results will be saved toresults/rerank_results/bm25/eval_dl19_pe-rank-mistral-jina_listwise-embedding_top100.txt
and you can use the following compute the evaluation metrics:
python src/scripts/trec_eval.py --dataset dl19 --ranking results/rerank_results/bm25/eval_dl19_pe-rank-mistral-jina_listwise-embedding_top100.txt
For other datasets or other retrieval models, just replace the--datasets
and--retriever
arguments.
Comming soon.
If you want to train the PE-Rank model from scratch or using customized settings, you can follow the instructions below.
All datasets used in the paper are available in this link:pe_rank_data. Please download the data todata
folder.
For example, you can run the following command:
git clone git@hf.co:datasets/liuqi6777/pe_rank_data ./data
You can refer the paper for more details about the datasets.
To run the alignment stage, you can use the following command:
deepspeed --include="localhost:0,1,2,3" src/train.py \ --deepspeed scripts/zero2.json \ --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \ --data_path ./data/wiki2m.jsonl \ --encoder_name jinaai/jina-embeddings-v2-base-en \ --encoder_pooling mean \ --projector_type mlp2x_gelu \ --freeze_backbone \ --tune_mlp_adapter \ --bf16 \ --output_dir ./checkpoints/mistral.jina.projector \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy"no" \ --save_strategy"steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 1e-3 \ --warmup_ratio 0.03 \ --lr_scheduler_type"cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 512 \ --gradient_checkpointing \ --attn_implementation flash_attention_2 \ --dataloader_num_workers 4
This command will run the alignment stage using the Mistral-7B model as the backbone and Jina-Embeddings as the encoder.
To run the learning-to-rank stage, you can use the following command:
deepspeed --include="localhost:4,5,6,7" --master_port="29700" src/train.py \ --deepspeed ./scripts/zero2.json \ --model_type rank_lm \ --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \ --data_path ./data/train.jsonl \ --use_embedding_with_content True \ --use_embedding_without_content True \ --kl_loss_weight 0.2 \ --loss1_weight 1 \ --loss2_weight 1 \ --encoder_name jinaai/jina-embeddings-v2-base-en \ --encoder_pooling mean \ --pretrain_mlp_adapter ./checkpoints/mistral.jina.projector/projector.bin \ --projector_type mlp2x_gelu \ --tune_mlp_adapter \ --bf16 True \ --tf32 True \ --output_dir"./checkpoints/pe-rank-mistral-jina" \ --overwrite_output_dir \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 2 \ --save_strategy"steps" \ --save_steps 3000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --warmup_ratio 0.03 \ --lr_scheduler_type"cosine" \ --logging_steps 1 \ --model_max_length 4096 \ --gradient_checkpointing True \ --attn_implementation flash_attention_2 \ --dataloader_num_workers 2
This command will run the full learning-to-rank stage.
@article{liu2024leveraging,title={Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models},author={Liu, Qi and Wang, Bo and Wang, Nan and Mao, Jiaxin},journal={arXiv preprint arXiv:2406.14848},year={2024}}
About
Leveraging passage embeddings for efficient listwise reranking with large language models.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.