- Notifications
You must be signed in to change notification settings - Fork3
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)
License
cambridgeltl/PairS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code for paperAligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Link to paper:Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (arXiv preprint arXiv:2403.16950)
This paper has been accepted byCOLM 2024.
If you are interested in pairwise evaluator, please also checkout our latest work onzero-shot automatic prompt optimization for pairwise evaluators.
We provide a ready-to-use Python library for Pairwise preference ranking (PairS). We show a ranking demonstration below.For an input source text and a sequence of output candidates,PairsGreedy
andPairsBeam
can be used to rank the output candidates inascending order.We currently support the following base models:google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Meta-Llama-3-8B-Instruct, microsoft/Phi-3-medium-4k-instruct, microsoft/Phi-3-mini-4k-instruct, mistralai/Mistral-7B-Instruct-v0.1, meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-13b-chat-hf, HuggingFaceH4/zephyr-7b-beta, gpt-3.5-turbo, gpt-4-turbo
.
frompairsimportPairsGreedy,PairsBeamfromscripts.utilsimportshuffle_lists,load_summEval# Load example datasumm_eval_path='data/SummEval/model_annotations.aligned.paired.jsonl'input_doc,output_doc,_=load_summEval(summ_eval_path,flat_output=False)doc_id=42input,output=input_doc[doc_id],output_doc[doc_id]input,output=shuffle_lists(input,output)# The same input source text corresponds to multiple output summariesprint('Number of summary candidates:',len(output))method='PairsGreedy'ifmethod=='PairsGreedy':# Set hyperparametersparams= {# 'engine': "mistralai/Mistral-7B-Instruct-v0.1",'engine':"meta-llama/Llama-2-7b-chat-hf",'api_call':0,'with_input':True,# Use the prompt template for task with context input, e.g. Summarization'calibrate':False,# For each pairwise comparison, we average the probabilities of both permutations to cancel the positional bias. }# Rank the output summaries from low to high qualityindices=PairsGreedy(input[0],output,params)print(indices)elifmethod=='PairsBeam':# Set hyperparametersparams= {'engine':"mistralai/Mistral-7B-Instruct-v0.1",'beam_size':2000,'api_call':0,'prob_gap':0.1,'with_input':True,'calibrate':False, }# Rank the output summaries from low to high qualityindices=PairsBeam(input[0],output,params)print(indices)
We also present the original code (in the folderscripts/
) to evalute on the datasets reported in the paper.
For NewsRoom and SummEval
bashpairs_run.sh
We provide a Notebook demonstrations innotebooks/
.
Load dataset: We put all datasets loading inscripts/utils.py
.
Prompts: We put all prompts and instructions inscripts/prompts.py
.
Base models: We supports the following base models,mistralai/Mistral-7B-Instruct-v0.1
,meta-llama/Llama-2-7b-chat-hf
, all versions ofGPT-3.5-turbo
andGPT-4-turbo
.
Hyper-parameters:
dataset
: We support 3 datasets, 'newsroom', 'SummEval' and 'hanna'.eval_method
: For all PairS method, we use 'pairwise comparison'.engine
: The base models.with_input
: If the data format has input text. For example, the summarization task has source text as input, but story writing task has no input text.confidence_beam
:True
for PairS-beam andFalse
for PairS-greedy.prob_gap
: The uncertainty tolerance.$0.1$ represents we will create beam candidates for both A and B if$0.5-0.1 < P(A\succ B) < 0.5+0.1$ .calibrate
: LLMs suffer from positional bias. Set this asTrue
will average the probabilities of both permutations of A and B for each pairwise comparison. This will cancel the positional bias.
More details and comments will be added soon.
The PairS-Greedy can be understood as a merge sort with pairwise comparison by LLMs, while the PairS-Beam is to do a beam-search for each merge operation. In order to improve the beam search efficiency and limit the search space, we also apply a local uncertainty-based prunning mechanism.
We show the algorithm of the modified merge operation for PairS-Beam below.
For more details please check out our paper.If you find our work helpful, please consider citing our paper:
@article{liu2024aligning, title={Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators}, author={Liu, Yinhong and Zhou, Han and Guo, Zhijiang and Shareghi, Ehsan and Vulic, Ivan and Korhonen, Anna and Collier, Nigel}, journal={arXiv preprint arXiv:2403.16950}, year={2024}}
About
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.