cambridgeltl/PairSPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star47

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)

License

MIT license

47 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
figs		figs
notebooks		notebooks
pairs		pairs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pairs_flat_run.sh		pairs_flat_run.sh
requirements.txt		requirements.txt

Repository files navigation

PairS: Pairwise Preference Search-based Rank Aggregation (PairS)

Code for paperAligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Link to paper:Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (arXiv preprint arXiv:2403.16950)
This paper has been accepted byCOLM 2024.

If you are interested in pairwise evaluator, please also checkout our latest work onzero-shot automatic prompt optimization for pairwise evaluators.

Code

Ready-to-use Package

We provide a ready-to-use Python library for Pairwise preference ranking (PairS). We show a ranking demonstration below.For an input source text and a sequence of output candidates,PairsGreedy andPairsBeam can be used to rank the output candidates inascending order.We currently support the following base models:google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Meta-Llama-3-8B-Instruct, microsoft/Phi-3-medium-4k-instruct, microsoft/Phi-3-mini-4k-instruct, mistralai/Mistral-7B-Instruct-v0.1, meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-13b-chat-hf, HuggingFaceH4/zephyr-7b-beta, gpt-3.5-turbo, gpt-4-turbo.

frompairsimportPairsGreedy,PairsBeamfromscripts.utilsimportshuffle_lists,load_summEval# Load example datasumm_eval_path='data/SummEval/model_annotations.aligned.paired.jsonl'input_doc,output_doc,_=load_summEval(summ_eval_path,flat_output=False)doc_id=42input,output=input_doc[doc_id],output_doc[doc_id]input,output=shuffle_lists(input,output)# The same input source text corresponds to multiple output summariesprint('Number of summary candidates:',len(output))method='PairsGreedy'ifmethod=='PairsGreedy':# Set hyperparametersparams= {# 'engine': "mistralai/Mistral-7B-Instruct-v0.1",'engine':"meta-llama/Llama-2-7b-chat-hf",'api_call':0,'with_input':True,# Use the prompt template for task with context input, e.g. Summarization'calibrate':False,# For each pairwise comparison, we average the probabilities of both permutations to cancel the positional bias.    }# Rank the output summaries from low to high qualityindices=PairsGreedy(input[0],output,params)print(indices)elifmethod=='PairsBeam':# Set hyperparametersparams= {'engine':"mistralai/Mistral-7B-Instruct-v0.1",'beam_size':2000,'api_call':0,'prob_gap':0.1,'with_input':True,'calibrate':False,    }# Rank the output summaries from low to high qualityindices=PairsBeam(input[0],output,params)print(indices)

Evaluate on Datasets

We also present the original code (in the folderscripts/) to evalute on the datasets reported in the paper.

For NewsRoom and SummEval

bashpairs_run.sh

Notebook Demo

We provide a Notebook demonstrations innotebooks/.

Break downs

Load dataset: We put all datasets loading inscripts/utils.py.

Prompts: We put all prompts and instructions inscripts/prompts.py.

Base models: We supports the following base models,mistralai/Mistral-7B-Instruct-v0.1,meta-llama/Llama-2-7b-chat-hf, all versions ofGPT-3.5-turbo andGPT-4-turbo.

Hyper-parameters:

dataset: We support 3 datasets, 'newsroom', 'SummEval' and 'hanna'.
eval_method: For all PairS method, we use 'pairwise comparison'.
engine: The base models.
with_input: If the data format has input text. For example, the summarization task has source text as input, but story writing task has no input text.
confidence_beam:True for PairS-beam andFalse for PairS-greedy.
prob_gap: The uncertainty tolerance.$0.1$ represents we will create beam candidates for both A and B if$0.5-0.1 < P(A\succ B) < 0.5+0.1$.
calibrate: LLMs suffer from positional bias. Set this asTrue will average the probabilities of both permutations of A and B for each pairwise comparison. This will cancel the positional bias.

More details and comments will be added soon.

Algorithm of PairS-Beam

The PairS-Greedy can be understood as a merge sort with pairwise comparison by LLMs, while the PairS-Beam is to do a beam-search for each merge operation. In order to improve the beam search efficiency and limit the search space, we also apply a local uncertainty-based prunning mechanism.

We show the algorithm of the modified merge operation for PairS-Beam below.

A Beam-search Merge Operation Demonstration

For more details please check out our paper.

Citation

If you find our work helpful, please consider citing our paper:

@article{liu2024aligning,  title={Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators},  author={Liu, Yinhong and Zhou, Han and Guo, Zhijiang and Shareghi, Ehsan and Vulic, Ivan and Korhonen, Anna and Collier, Nigel},  journal={arXiv preprint arXiv:2403.16950},  year={2024}}

About

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PairS: Pairwise Preference Search-based Rank Aggregation (PairS)

Code

Ready-to-use Package

Evaluate on Datasets

Notebook Demo

Break downs

Algorithm of PairS-Beam

A Beam-search Merge Operation Demonstration

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

cambridgeltl/PairS

Folders and files

Latest commit

History

Repository files navigation

PairS: Pairwise Preference Search-based Rank Aggregation (PairS)

Code

Ready-to-use Package

Evaluate on Datasets

Notebook Demo

Break downs

Algorithm of PairS-Beam

A Beam-search Merge Operation Demonstration

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages