liuqi6777/pe_rankPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star45

Leveraging passage embeddings for efficient listwise reranking with large language models.

License

Apache-2.0 license

45 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
docs/images		docs/images
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

PE-Rank

Code for paperLeveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models

Figure 1: Comparison between RankGPT (upper) and PE-Rank (lower). RankGPT takes the whole passages as input and outputs ordered numbers, while PE-Rank takes a list of special tokens as both input and output. On the right side, we show the reranking results on DL19 using different forms of inputs.

Installation

git clone git@github.com:liuqi6777/pe_rank.git

Evaluation

The checkpoint of the PE-Rank model is available in this link:PE-Rank.

Retrieval

We provide the scripts for first-stage retrieval, for example, you can use the following command to use BM25 as the retrieval model:

python src/scripts/run_evaluation.py --dataset dl19 --retriever bm25 --topk 100

This code will run the BM25 retrieval model on the DL19 dataset and save the retrieval results toresults/retrieval_results/bm25/dl19_top100.jsonl.

As alternative, we also provide all the retrieval results in this link:https://huggingface.co/liuqi6777/pyserini_retrieval_results. You can download the retrieval results toresults/retrieval_results folder.

Reranking

To run the reranking stage, you can use the following command:

python src/evaluate.py --datasets dl19 --model-path liuqi6777/pe-rank-mistral-jina --retriever bm25 --topk 100

The reranking results will be saved toresults/rerank_results/bm25/eval_dl19_pe-rank-mistral-jina_listwise-embedding_top100.txt and you can use the following compute the evaluation metrics:

python src/scripts/trec_eval.py --dataset dl19 --ranking results/rerank_results/bm25/eval_dl19_pe-rank-mistral-jina_listwise-embedding_top100.txt

For other datasets or other retrieval models, just replace the--datasets and--retriever arguments.

More usage

Comming soon.

Training

If you want to train the PE-Rank model from scratch or using customized settings, you can follow the instructions below.

Data Preparation

All datasets used in the paper are available in this link:pe_rank_data. Please download the data todata folder.

For example, you can run the following command:

git clone git@hf.co:datasets/liuqi6777/pe_rank_data ./data

You can refer the paper for more details about the datasets.

Alignment Stage

To run the alignment stage, you can use the following command:

deepspeed --include="localhost:0,1,2,3" src/train.py \    --deepspeed scripts/zero2.json \    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \    --data_path ./data/wiki2m.jsonl \    --encoder_name jinaai/jina-embeddings-v2-base-en \    --encoder_pooling mean \    --projector_type mlp2x_gelu \    --freeze_backbone \    --tune_mlp_adapter \    --bf16 \    --output_dir ./checkpoints/mistral.jina.projector \    --num_train_epochs 1 \    --per_device_train_batch_size 1 \    --gradient_accumulation_steps 1 \    --evaluation_strategy"no" \    --save_strategy"steps" \    --save_steps 1000 \    --save_total_limit 1 \    --learning_rate 1e-3 \    --warmup_ratio 0.03 \    --lr_scheduler_type"cosine" \    --logging_steps 1 \    --tf32 True \    --model_max_length 512 \    --gradient_checkpointing \    --attn_implementation flash_attention_2 \    --dataloader_num_workers 4

This command will run the alignment stage using the Mistral-7B model as the backbone and Jina-Embeddings as the encoder.

Learning-to-Rank Stage

To run the learning-to-rank stage, you can use the following command:

deepspeed --include="localhost:4,5,6,7" --master_port="29700" src/train.py \    --deepspeed ./scripts/zero2.json \    --model_type rank_lm \    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \    --data_path ./data/train.jsonl \    --use_embedding_with_content True \    --use_embedding_without_content True \    --kl_loss_weight 0.2 \    --loss1_weight 1 \    --loss2_weight 1 \    --encoder_name jinaai/jina-embeddings-v2-base-en \    --encoder_pooling mean \    --pretrain_mlp_adapter ./checkpoints/mistral.jina.projector/projector.bin \    --projector_type mlp2x_gelu \    --tune_mlp_adapter \    --bf16 True \    --tf32 True \    --output_dir"./checkpoints/pe-rank-mistral-jina" \    --overwrite_output_dir \    --num_train_epochs 1 \    --per_device_train_batch_size 4 \    --gradient_accumulation_steps 2 \    --save_strategy"steps" \    --save_steps 3000 \    --save_total_limit 2 \    --learning_rate 2e-5 \    --warmup_ratio 0.03 \    --lr_scheduler_type"cosine" \    --logging_steps 1 \    --model_max_length 4096 \    --gradient_checkpointing True \    --attn_implementation flash_attention_2 \    --dataloader_num_workers 2

This command will run the full learning-to-rank stage.

Citation

@article{liu2024leveraging,title={Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models},author={Liu, Qi and Wang, Bo and Wang, Nan and Mao, Jiaxin},journal={arXiv preprint arXiv:2406.14848},year={2024}}

About

Leveraging passage embeddings for efficient listwise reranking with large language models.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PE-Rank

Installation

Evaluation

Retrieval

Reranking

More usage

Training

Data Preparation

Alignment Stage

Learning-to-Rank Stage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

liuqi6777/pe_rank

Folders and files

Latest commit

History

Repository files navigation

PE-Rank

Installation

Evaluation

Retrieval

Reranking

More usage

Training

Data Preparation

Alignment Stage

Learning-to-Rank Stage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages