- Notifications
You must be signed in to change notification settings - Fork1
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)
License
cambridgeltl/PairS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code for paperAligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Link to paper:Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (arXiv preprint arXiv:2403.16950)
This paper has been accepted byCOLM 2024.
If you are interested in pairwise evaluator, please also checkout our latest work onzero-shot automatic prompt optimization for pairwise evaluators.
We provide a ready-to-use Python library for Pairwise preference ranking (PairS). We show a ranking demonstration below.For an input source text and a sequence of output candidates,PairsGreedy
andPairsBeam
can be used to rank the output candidates inascending order.We currently support the following base models:google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Meta-Llama-3-8B-Instruct, microsoft/Phi-3-medium-4k-instruct, microsoft/Phi-3-mini-4k-instruct, mistralai/Mistral-7B-Instruct-v0.1, meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-13b-chat-hf, HuggingFaceH4/zephyr-7b-beta, gpt-3.5-turbo, gpt-4-turbo
.
frompairsimportPairsGreedy,PairsBeamfromscripts.utilsimportshuffle_lists,load_summEval# Load example datasumm_eval_path='data/SummEval/model_annotations.aligned.paired.jsonl'input_doc,output_doc,_=load_summEval(summ_eval_path,flat_output=False)doc_id=42input,output=input_doc[doc_id],output_doc[doc_id]input,output=shuffle_lists(input,output)# The same input source text corresponds to multiple output summariesprint('Number of summary candidates:',len(output))method='PairsGreedy'ifmethod=='PairsGreedy':# Set hyperparametersparams= {# 'engine': "mistralai/Mistral-7B-Instruct-v0.1",'engine':"meta-llama/Llama-2-7b-chat-hf",'api_call':0,'with_input':True,# Use the prompt template for task with context input, e.g. Summarization'calibrate':False,# For each pairwise comparison, we average the probabilities of both permutations to cancel the positional bias. }# Rank the output summaries from low to high qualityindices=PairsGreedy(input[0],output,params)print(indices)elifmethod=='PairsBeam':# Set hyperparametersparams= {'engine':"mistralai/Mistral-7B-Instruct-v0.1",'beam_size':2000,'api_call':0,'prob_gap':0.1,'with_input':True,'calibrate':False, }# Rank the output summaries from low to high qualityindices=PairsBeam(input[0],output,params)print(indices)
We also present the original code (in the folderscripts/
) to evalute on the datasets reported in the paper.
For NewsRoom and SummEval
bashpairs_run.sh
We provide a Notebook demonstrations innotebooks/
.
Load dataset: We put all datasets loading inscripts/utils.py
.
Prompts: We put all prompts and instructions inscripts/prompts.py
.
Base models: We supports the following base models,mistralai/Mistral-7B-Instruct-v0.1
,meta-llama/Llama-2-7b-chat-hf
, all versions ofGPT-3.5-turbo
andGPT-4-turbo
.
Hyper-parameters:
dataset
: We support 3 datasets, 'newsroom', 'SummEval' and 'hanna'.eval_method
: For all PairS method, we use 'pairwise comparison'.engine
: The base models.with_input
: If the data format has input text. For example, the summarization task has source text as input, but story writing task has no input text.confidence_beam
:True
for PairS-beam andFalse
for PairS-greedy.prob_gap
: The uncertainty tolerance.$0.1$ represents we will create beam candidates for both A and B if$0.5-0.1 < P(A\succ B) < 0.5+0.1$ .calibrate
: LLMs suffer from positional bias. Set this asTrue
will average the probabilities of both permutations of A and B for each pairwise comparison. This will cancel the positional bias.
More details and comments will be added soon.
The PairS-Greedy can be understood as a merge sort with pairwise comparison by LLMs, while the PairS-Beam is to do a beam-search for each merge operation. In order to improve the beam search efficiency and limit the search space, we also apply a local uncertainty-based prunning mechanism.
We show the algorithm of the modified merge operation for PairS-Beam below.
For more details please check out our paper.If you find our work helpful, please consider citing our paper:
@article{liu2024aligning, title={Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators}, author={Liu, Yinhong and Zhou, Han and Guo, Zhijiang and Shareghi, Ehsan and Vulic, Ivan and Korhonen, Anna and Collier, Nigel}, journal={arXiv preprint arXiv:2403.16950}, year={2024}}