Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)

License

NotificationsYou must be signed in to change notification settings

cambridgeltl/PairS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for paperAligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

pairsLink to paper:Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (arXiv preprint arXiv:2403.16950)
This paper has been accepted byCOLM 2024.

If you are interested in pairwise evaluator, please also checkout our latest work onzero-shot automatic prompt optimization for pairwise evaluators.

Code

Ready-to-use Package

We provide a ready-to-use Python library for Pairwise preference ranking (PairS). We show a ranking demonstration below.For an input source text and a sequence of output candidates,PairsGreedy andPairsBeam can be used to rank the output candidates inascending order.We currently support the following base models:google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Meta-Llama-3-8B-Instruct, microsoft/Phi-3-medium-4k-instruct, microsoft/Phi-3-mini-4k-instruct, mistralai/Mistral-7B-Instruct-v0.1, meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-13b-chat-hf, HuggingFaceH4/zephyr-7b-beta, gpt-3.5-turbo, gpt-4-turbo.

frompairsimportPairsGreedy,PairsBeamfromscripts.utilsimportshuffle_lists,load_summEval# Load example datasumm_eval_path='data/SummEval/model_annotations.aligned.paired.jsonl'input_doc,output_doc,_=load_summEval(summ_eval_path,flat_output=False)doc_id=42input,output=input_doc[doc_id],output_doc[doc_id]input,output=shuffle_lists(input,output)# The same input source text corresponds to multiple output summariesprint('Number of summary candidates:',len(output))method='PairsGreedy'ifmethod=='PairsGreedy':# Set hyperparametersparams= {# 'engine': "mistralai/Mistral-7B-Instruct-v0.1",'engine':"meta-llama/Llama-2-7b-chat-hf",'api_call':0,'with_input':True,# Use the prompt template for task with context input, e.g. Summarization'calibrate':False,# For each pairwise comparison, we average the probabilities of both permutations to cancel the positional bias.    }# Rank the output summaries from low to high qualityindices=PairsGreedy(input[0],output,params)print(indices)elifmethod=='PairsBeam':# Set hyperparametersparams= {'engine':"mistralai/Mistral-7B-Instruct-v0.1",'beam_size':2000,'api_call':0,'prob_gap':0.1,'with_input':True,'calibrate':False,    }# Rank the output summaries from low to high qualityindices=PairsBeam(input[0],output,params)print(indices)

Evaluate on Datasets

We also present the original code (in the folderscripts/) to evalute on the datasets reported in the paper.

For NewsRoom and SummEval

bashpairs_run.sh

Notebook Demo

We provide a Notebook demonstrations innotebooks/.

Break downs

Load dataset: We put all datasets loading inscripts/utils.py.

Prompts: We put all prompts and instructions inscripts/prompts.py.

Base models: We supports the following base models,mistralai/Mistral-7B-Instruct-v0.1,meta-llama/Llama-2-7b-chat-hf, all versions ofGPT-3.5-turbo andGPT-4-turbo.

Hyper-parameters:

  • dataset: We support 3 datasets, 'newsroom', 'SummEval' and 'hanna'.
  • eval_method: For all PairS method, we use 'pairwise comparison'.
  • engine: The base models.
  • with_input: If the data format has input text. For example, the summarization task has source text as input, but story writing task has no input text.
  • confidence_beam:True for PairS-beam andFalse for PairS-greedy.
  • prob_gap: The uncertainty tolerance.$0.1$ represents we will create beam candidates for both A and B if$0.5-0.1 < P(A\succ B) < 0.5+0.1$.
  • calibrate: LLMs suffer from positional bias. Set this asTrue will average the probabilities of both permutations of A and B for each pairwise comparison. This will cancel the positional bias.

More details and comments will be added soon.

Algorithm of PairS-Beam

The PairS-Greedy can be understood as a merge sort with pairwise comparison by LLMs, while the PairS-Beam is to do a beam-search for each merge operation. In order to improve the beam search efficiency and limit the search space, we also apply a local uncertainty-based prunning mechanism.

We show the algorithm of the modified merge operation for PairS-Beam below.

algo

A Beam-search Merge Operation Demonstration

demo1

For more details please check out our paper.

Citation

If you find our work helpful, please consider citing our paper:

@article{liu2024aligning,  title={Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators},  author={Liu, Yinhong and Zhou, Han and Guo, Zhijiang and Shareghi, Ehsan and Vulic, Ivan and Korhonen, Anna and Collier, Nigel},  journal={arXiv preprint arXiv:2403.16950},  year={2024}}

About

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp