tatsu-lab/alpaca_evalPublic

NotificationsYou must be signed in to change notification settings
Fork290
Star1.9k

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

License

Apache-2.0 license

1.9k stars 290 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 596 Commits
.github/workflows		.github/workflows
client_configs		client_configs
docs		docs
example		example
figures		figures
notebooks		notebooks
results		results
scripts		scripts
src/alpaca_eval		src/alpaca_eval
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

AlpacaEval : An Automatic Evaluator for Instruction-following Language Models

AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of0.98 withChatBot Arena while costing less than$10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (< 5min), cheap (< $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks:

Updates:

🎉Length-controlled Win Rates are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More detailshere.

🎉AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More detailshere. For the old version, set your environment variableIS_ALPACA_EVAL_2=False.

Table of Contents

Overview

Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This istime-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap,replicable, and validated against 20K human annotations.It is particularly useful for model development.Although we improved over prior automatic evaluation pipelines, there are still fundamentallimitations like the preference for longer outputs.AlpacaEval provides the following:

Leaderboard: a leaderboard of common models on the AlpacaEvalevaluation set.Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).
Automatic evaluator: an automatic evaluator that has high agreement with humans (validated on 20Kannotations). We evaluate amodel bymeasuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that modeloveroutputs from a reference model. Our evaluators enable caching and output randomization by default.
Toolkit for building automatic evaluators: a simple interface forbuilding advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality,price, speed, statistical power, bias, variance etc).
Human evaluation data: 20K human preferences between a given and reference modelon theAlpacaFarmevaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).
AlpacaEval dataset: a simplificationofAlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer.Details here.

When to use and not use AlpacaEval?

When to use AlpacaEval?Our automatic evaluator is a quick and cheap proxy for human evaluation of simpleinstruction-following tasks.It is useful if youhave to run many evaluations quickly, e.g., during model development.

When not to use AlpacaEval?As any other automatic evaluator, AlpacaEval shouldnot replace human evaluation inhigh-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the factthat (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automaticevaluators may have biases such as favoring style overfactuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause.Details inlimitations.

Quick Start

To install the stable release, run

pip install alpaca-eval

To install the nightly version, run

pip install git+https://github.com/tatsu-lab/alpaca_eval

Then you can use it as follows:

export OPENAI_API_KEY=<your_api_key># for more complex configs, e.g. using Azure or switching clients see client_configs/README.mdalpaca_eval --model_outputs'example/outputs.json'

This will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as themodel_outputs file. Important parameters are the following:

model_outputs : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionaryshouldcontain the keysinstruction andoutput.
annotators_config: This is the annotator to use. We recommend usingweighted_alpaca_eval_gpt4_turbo (default for AlpacaEval 2.0), which has ahigh agreement rate with our human annotation data, large context size, and is pretty cheap. For a comparison of all annotators seehere.
reference_outputs: The outputs of the reference model. Same format asmodel_outputs. By default, thisisgpt4_turbo for AlpacaEval 2.0.
output_path: Path for saving annotations and leaderboard.

If you don't have the model outputs, you canuseevaluate_from_model andpass a local path or a name of aHuggingFacemodel, or a model from a standard API (OpenAI, Anthropic, Cohere, google, ...). Other commands:

>>> alpaca_eval -- --help

SYNOPSIS    alpaca_eval COMMANDCOMMANDS    COMMAND is one of the following:     evaluate       Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.     evaluate_from_model       Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.     make_leaderboard       Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.     analyze_evaluators       Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).

For more information about each function usealpaca_eval <command> -- --help.

Leaderboards and how to interpret them

Models

Our leaderboards are computed on theAlpacaEval dataset.We precomputed the leaderboard for important models using different baseline models and autoannotators.Our two main leaderboards ("AlpacaEval 2.0" and "AlpacaEval") can be foundon this page."AlpacaEval 2.0" usesweighted_alpaca_eval_gpt4_turbo for the annotator andgpt4_turbo for the baseline."AlpacaEval" usesalpaca_eval_gpt4 for the annotator andtext_davinci_003 for the baseline.For all precomputed leaderboards seehere.Later we also show how toadd your model to theleaderboard and how to makeanew leaderboard for your evaluator/dataset.Seehere for the configs of allmodels that are available out of the box.

AlpacaEval minimal leaderboard:

	Win Rate	Std Error
gpt4	95.3	0.7
claude	88.4	1.1
chatgpt	86.1	1.2
guanaco-65b	71.8	1.6
vicuna-13b	70.4	1.6
text_davinci_003	50.0	0.0
alpaca-farm-ppo-human	41.2	1.7
alpaca-7b	26.5	1.5
text_davinci_001	15.2	1.2

How exactly are those metrics computed?

Win Rate: the win rate measures the fraction of time the model's output is preferred over the reference's outputs (test-davinci-003 for AlpacaEval andgpt4_turbo for AlpacaEval 2.0).More specifically, to compute the win rate we collect pairs of outputs of the desired model on every instruction fromthe ApacaEval dataset.We then pair each output with the output of our reference model (e.g.text-davinci-003) on the same instruction.We then ask our automatic evaluator which output they prefer.SeeAlpacaEval'sandAlpacaEval 2.0's prompts and configs, in particular we randomize the order ofoutputs to avoid position bias.We then average the preferences over all instructions in the dataset to get the win rate of the model over the baseline.If both outputs are exactly the same we use a half preference for both models.

Standard error: this is the standard error (normalized by N-1) of the win rate, i.e., the preferences averaged overthe different instructions.

Details about our auto-annotator:alpaca_eval_gpt4

Ouralpaca_eval_gpt4 (see configs)annotator averages over preferences, where preferences are obtained as follows:

it takes in an instruction and a pair of outputs (from the desired model and the reference model)
if a preference was this triple was already computed, it returns it (i.e. it uses caching)
it randomizes the order of the outputs to avoid position bias
it formats the instruction and outputs intothefollowing zero-shot prompt,which asks to order the outputs in order of preference
it completes the prompt using GPT4 withtemperature=0
it parses the preference from the completions and returns it

The annotator is a mix between (and was highly influenced by)AlpacaFarmandAviary evaluators.In particular, we use the same code as for AlpacaFarm (caching/randomization/hyperparameters) but use a ranking promptsimilar to that of Aviary.We make changes to Aviary's prompt to decrease the bias for longer outputs.Details inRelated work.

For AlpacaEval 2.0 we useweighted_alpaca_eval_gpt4_turbo, which uses logprobs to compute continuous preference and uses GPT4_turbo as model (see configs).

Evaluators

We evaluate different automatic annotators on the AlpacaEval set by comparing to2.5Khuman annotationswe collected (~650 instructions each with 4 human annotations).Below we show metrics for our suggested evaluators (weighted_alpaca_eval_gpt4_turbo,alpaca_eval_gpt4), for priorautomaticevaluators (alpaca_farm_greedy_gpt4,aviary_gpt4,lmsys_gpt4),for humans (humans), and for different base models with essentially the sameprompt (gpt4,claude,text_davinci_003,chatgpt_fn,guanaco_33b,chatgpt).Seehere for the configs of allevaluators that are available out of the box and their associated metrics.

	Human agreement	Price [$/1000 examples]	Time [seconds/1000 examples]	Spearman corr.	Pearson corr.	Bias	Variance	Proba. prefer longer
alpaca_eval_gpt4	69.2	13.6	1455	0.97	0.93	28.4	14.6	0.68
alpaca_eval_cot_gpt4_turbo_fn	68.6	6.3	1989	0.97	0.90	29.3	18.4	0.67
alpaca_eval_gpt4_turbo_fn	68.1	5.5	864	0.93	0.82	30.2	15.6	0.65
alpaca_eval_llama3_70b_fn	67.5	0.4	209	0.90	0.86	32.3	8.2	0.79
gpt4	66.9	12.5	1037	0.88	0.87	31.5	14.6	0.65
alpaca_farm_greedy_gpt4	66.4	15.3	878	0.85	0.75	30.2	19.3	0.60
alpaca_eval_cot_gpt4_turbo_fn	65.7	4.3	228	0.78	0.77	33.9	23.7	0.61
humans	65.7	300.0	36800	1.00	1.00	0.0	34.3	0.64
claude	65.3	3.3	173	0.93	0.90	32.4	18.5	0.66
lmsys_gpt4	65.3	13.9	17982	0.98	0.97	31.6	15.9	0.74
text_davinci_003	64.1	8.7	121	0.85	0.83	33.8	22.7	0.70
longest	62.2	0.0	0	0.27	0.56	37.8	0.0	1.00
chatgpt	57.3	0.8	285	0.72	0.71	39.4	34.1	0.59

How exactly are those metrics computed?

We now explain in words how we compute the metrics in the tableabove.The code is here.

Human agreement: this measures the agreement between the current annotator and the majority preferences ofhumans onour~650 annotations fromourcross-annotation set,which contains 4 human annotations per example.To estimate the agreement between a single human (humans row in the table above) and the majority of humans, we takeone of the 4 annotations and compute the accuracy that it has when predicting the mode of the other 3 annotations.We then average this accuracy over all 4 annotations and over the 650 instructions to get the human agreement, i.e., wecompute the expected (over humans and samples)leave-one-out agreement.If the mode is not unique, we take one of the modes at random.We perform exactly the same computation for the automatic annotators, so that the final numbers are comparable.

Price [$/1000 examples]: this is the average price of every 1000 annotations.For humans, it is the price thatwe paid Mechanical Turkers to collect thoseannotations ($21/hour).If the price depends on the machine used to compute the annotations (e.g. Guanaco) we leave it empty.

Time [seconds/1000 examples]: this is the average time it takes to compute 1000 annotations.For humans, it is the estimated median time that each Mechanical Turker took to annotate 1000 examples.For automatic annotators, it is the average time that it took us when running the annotations. Note that this can dependon API limits that are different for different users and the number of requests that the clusters areprocessing.

Spearman corr.: this measures the Spearman correlation between a leaderboard computed with the auto-annotator's preference and the leaderboard computed with human preferences. As withHuman agreement, we use the human annotations from AlpacaFarm but we now consider the method-level agreement rather than only the sample-wise agreement with humans. Note that we only use have 9 models and so the correlation is not very reliable.

Pearson corr.: same as withSpearman corr. but with Pearson correlation.

Bias: agreement between the most likely human label and the most likely automatic one.For automatic annotators we estimate it by sampling 4 different annotations for each example.The randomness here comes from the order of the outputs in the prompt, sampling from the LLM, and if applicable theorder of the instruction in the batch and the choice of annotator in the pool.We then take the mode of the 4 annotations and compute the accuracy of the mode when predicting the mode of the 4 humanannotations.Note that this is likely an overestimate on the real bias that we would get if we had an "infinite" number ofcross-annotations.A low bias means that the annotator has in expectation the same preferences as humans.For the case of humans, the bias is zero by definition.Note that this is related to but not the standard statistical bias, because we take the mode instead of average overannotations and we consider 0-1 loss instead of squared loss.

Variance: expected agreement a single automatic preference and the most likely one.We estimate it the same way as we estimated "human agreement" for humans, i.e., we take the expected leave one out errorwhen predicting the mode of the 3 annotations using the 4th annotation.A low variance means that the annotator is consistent with its preference, i.e., if you sample from it with differentseeds it will give the same result.As with the bias, this is not exactly the standard statistical variance, because we take the mode instead of averageover annotations and weconsider 0-1 loss instead of squared loss.

Note that the "human agreement" is tightly related to the bias and variance. In particular, the variancemeasures the error due to the fact that we only use a single annotation while the bias aims to measure the irreducibleerrorfor the current annotator.

Proba. prefer longer: this is the probability that the annotator prefers the longer output when one of the twooutputs is significantly longer than the other (more than 30 characters difference).

In thefull table wealso provide the following metrics:

Proba. prefer lists: this is the probability that the annotator prefers the output that contains a list/bulletpoints when one output does but not the other.

Proba. prefer 1: this is the probability that the annotator prefers the first of the pair of outputs. All ourproposed annotators randomize over outputs in the prompt, so this should be 0.5. Prior annotators, such aslmsysandaviary, do not.

# parsed: this is the number of examples that the annotator was able to parse.

Note that if the variance and bias is empty, it means that we only performed one single annotation for each 648 exampledue to resource (time and price) constraints. This explains why the #parsed is 648, otherwise it should be 2592.

Tips for choosing evaluators

Overall we recommend usingannotators_config=weighted_alpaca_eval_gpt4_turbo if you want the high agreement with humans, andannotators_config=chatgpt_fn if you are on a tight budget.

When choosing an annotator we recommend you to consider the following (the first three are obvious):

"Human agreement [%]"
"Price [$/1000 examples]"
"Time [seconds/1000 examples]"
"* corr." approx. > 0.7. It is important that the correlation is not too low, but we do not recommend using it as the main metric as the correlation is computed on only 9 models.
"Proba. prefer longer" approx. < 0.7. Indeed, we found see that the majority of preference of human annotators havestrong bias for longer answers (as shown by thehighperformance=62.2ofthe"longest" evaluator that alwaysprefers the longest output). This suggests that it might more of a bias with the human annotators. In order to avoidhaving leaderboards with strong biases for length, we suggest using automatic annotators with less than 0.7 "Proba.prefer longer".
"Variance" approx. < 0.2. We believe that a good evaluator should have as little variance as possible so thatresults are mostly reproducible. Note that variance can be desirable in the case where we are simulating humansas shown inAlpacaFarm.

We filtered the annotators that do not satisfy those requirements in the table above (besides humans / ChatGPT / 003 /lmsys forreference purposes). Forallresults seehere.In general, we foundweighted_alpaca_eval_gpt4_turbo to be a good trade-off between quality / price / time /variance / length bias.

The above metrics are computed with respect to annotations from crowd-workers. Although useful, those annotations arenot perfect, e.g., crowd-workers often favor styleoverfactuality. We thus recommend users to validate automatic evaluators on their own instructions and human annotations.Details inlimitations.

Use-cases

Evaluating a model

>>> alpaca_eval evaluate -- --help

NAME    alpaca_eval evaluate - Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.SYNOPSIS    alpaca_eval evaluate <flags>DESCRIPTION    Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.FLAGS    --model_outputs=MODEL_OUTPUTS        Type: Optional[Union]        Default: None        The outputs of the model to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv) or a function to generate those. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. If None, we just print the leaderboard.    -r, --reference_outputs=REFERENCE_OUTPUTS        Type: Union        Default: <func...        The outputs of the reference model. Same format as `model_outputs`. If None, the reference outputs are a specific set of Davinci 003 outputs on the AlpacaEval set:    --annotators_config=ANNOTATORS_CONFIG        Type: Union        Default: 'alpaca_eval_gpt4_turbo_fn'        The path the (or list of dict of) the annotator's config file. For details see the docstring of `PairwiseAnnotator`.    -n, --name=NAME        Type: Optional[Optional]        Default: None        The name of the model to add to the leaderboard. If None we check if `generator is in model_outputs` if not we use "Current model".    -o, --output_path=OUTPUT_PATH        Type: Union        Default: 'auto'        Path to the directory where the new leaderboard and the annotations should be stored. If None we don't save. If `auto` we use `model_outputs` if it is a path, and otherwise use the directory from which we call the script.    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD        Type: Union        Default: 'auto'        The precomputed leaderboard or a path to it (json, csv, or tsv). The leaderboard should contain at least the column `win_rate`. If `auto` we will try to use the corresponding leaderboard for the reference outputs (only if in CORRESPONDING_OUTPUTS_LEADERBOARDS). If `None` we won't add other models from the leaderboard.    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD        Type: bool        Default: False        Whether to overwrite the leaderboard if the model is already in it.    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT        Type: Optional        Default: 'minimal'        The mode of the leaderboard to use. Only used if the precomputed leaderboard has a column `mode`, in which case it will filter the leaderboard by this mode. If None keeps all.    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE        Type: str        Default: 'community'        The mode of the leaderboard for the current method.    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT        Type: bool        Default: False        Whether to return the metrics instead of printing the results.    -f, --fn_metric=FN_METRIC        Type: Union        Default: 'pairwise_to_winrate'        The function or function name in `metrics.py` that will be used to convert preference to metrics. The function should take a sequence of preferences (0 for draw, 1 for base win, 2 when the model to compare wins) and return a dictionary of metrics and the key by which to sort the leaderboard.    -s, --sort_by=SORT_BY        Type: str        Default: 'win_rate'        The key by which to sort the leaderboard.    --is_cache_leaderboard=IS_CACHE_LEADERBOARD        Type: Optional[Optional]        Default: None        Whether to save the result leaderboard to `precomputed_leaderboard`. If None we save only if max_instances not None. A preferred way of adding models to the leaderboard is to set `precomputed_leaderboard` to the previously saved leaderboard at `<output_path>/leaderboard.csv`.    --max_instances=MAX_INSTANCES        Type: Optional[Optional]        Default: None        The maximum number of instances to annotate. Useful for testing.    --annotation_kwargs=ANNOTATION_KWARGS        Type: Optional[Optional]        Default: None        Additional arguments to pass to `PairwiseAnnotator.annotate_head2head`.    -A, --Annotator=ANNOTATOR        Default: <class 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...        The annotator class to use.    Additional flags are accepted.        Additional arguments to pass to `PairwiseAnnotator`.

>>> alpaca_eval evaluate_from_model -- --help

NAME    alpaca_eval evaluate_from_model - Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.SYNOPSIS    alpaca_eval evaluate_from_model MODEL_CONFIGS <flags>DESCRIPTION    Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.POSITIONAL ARGUMENTS    MODEL_CONFIGS        Type: Union        A dictionary or path (relative to `models_configs`) to a yaml file containing the configuration of the model to decode from. If a directory,we search for 'configs.yaml' in it. The keys in the first dictionary should be the generator's name, and the value should be a dictionary of the generator's configuration which should have theFLAGS    -r, --reference_model_configs=REFERENCE_MODEL_CONFIGS        Type: Optional[Union]        Default: None        Same as in `model_configs` but for the reference model. If None, we use the default Davinci003 outputs.    -e, --evaluation_dataset=EVALUATION_DATASET        Type: Union        Default: <func...        Path to the evaluation dataset or a function that returns a dataframe. If None, we use the default evaluation    -a, --annotators_config=ANNOTATORS_CONFIG        Type: Union        Default: 'alpaca_eval_gpt4_turbo_fn'        Path to the annotators configuration or a dictionary. If None, we use the default annotators configuration.    -o, --output_path=OUTPUT_PATH        Type: Union        Default: 'auto'        Path to save the generations, annotations and leaderboard. If auto saves at `results/<model_name>`    -m, --max_instances=MAX_INSTANCES        Type: Optional[int]        Default: None        Maximum number of instances to generate and evaluate. If None, we evaluate all instances.    --is_strip_output=IS_STRIP_OUTPUT        Type: bool        Default: True        Whether to strip trailing and leading whitespaces from the outputs.    --is_load_outputs=IS_LOAD_OUTPUTS        Type: bool        Default: True        Whether to try to load outputs from the output path. If True and outputs exist we only generate outputs for instructions that don't have outputs yet.    -c, --chunksize=CHUNKSIZE        Type: int        Default: 64        Number of instances to generate before saving. If None, we save after all generations.    Additional flags are accepted.        Other kwargs to `evaluate`NOTES    You can also use flags syntax for POSITIONAL ARGUMENTS

To evaluate a model you need to:

Choose an evaluation set and compute outputs specified asmodel_outputs. By default, we usethe 805 examples fromAlpacaEval. To compute outputs on AlpacaEval use:

importdatasetseval_set=datasets.load_dataset("tatsu-lab/alpaca_eval","alpaca_eval")["eval"]forexampleineval_set:# generate here is a placeholder for your models generationsexample["output"]=generate(example["instruction"])example["generator"]="my_model"# name of your model

if your model is a HuggingFace model or from a standard API provider (OpenAI, Anthropic, Cohere). Then you candirectly usealpaca_eval evaluate_from_model to also take care of generating outputs.

Compute the reference outputsreference_outputs. By default, we use precomputed outputs ofgpt4_turbo onAlpacaEval.If youwant to use a different model or a different dataset follow the same steps as (1.).
Choose an evaluator specified viaannotators_config. We recommend usingalpaca_eval_gpt4_turbo_fn. For other options and comparisonsseethis table. Depending on the evaluator you might need toset the appropriate API_KEY in your environmentor int theclient_configs.

Running all together:

alpaca_eval --model_outputs'example/outputs.json' \  --annotators_config'alpaca_eval_gpt4_turbo_fn'

If you don't have decoded outputs, you can useevaluate_from_model which takes care of decoding (model and reference)for you.Here's anexample:

# need a GPU for local modelsalpaca_eval evaluate_from_model \  --model_configs'oasst_pythia_12b' \  --annotators_config'alpaca_eval_gpt4_turbo_fn'

Here themodel_configs andreference_model_configs (optional) are paths to a directory that specifies the prompt,the modelprovider (here HuggingFace) and decoding parameters.Seethis directory for examples.For all model providers that are available out-of-the-boxseehere.

Information about annotators

Caching: by default all annotations are cached ondisk atcaching_path. Annotations are thus never recomputed, which makes annotations faster, cheaper and allow forreproducibility. This helps even when evaluating different models as many modelshavethe same outputs.
Output randomization by default, we randomize over the examples of outputs, as we found that annotators tend toprefer the first examplesthey see.
Batching we provide code and examples to batch annotations, which decreases cost and time for annotations if theprompt is long. See forexamplealpaca_farm_greedy_gpt4.
Pool of annotators we provide code and examples to evaluate using a pool of automatic annotators, which is helpfulfor replicating the variance ofhuman annotations. See forexamplealpaca_farm.
Seeding based on instructions For reproducibility and more fair comparison between models, we seed allrandomness (output order, order in batches,examples for each annotator in a pool) based on the instruction.

Making a new leaderboard

>>> alpaca_eval make_leaderboard -- --help

NAME    alpaca_eval make_leaderboard - Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.SYNOPSIS    alpaca_eval make_leaderboard <flags>DESCRIPTION    Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.FLAGS    --leaderboard_path=LEADERBOARD_PATH        Type: Optional[Union]        Default: None        The path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already exists it will    --annotators_config=ANNOTATORS_CONFIG        Type: Union        Default: 'alpaca_eval_gpt4_turbo_fn'        The path the (or list of dict of) the annotator's config file.    --all_model_outputs=ALL_MODEL_OUTPUTS        Type: Union        Default: <fu...        The outputs of all models to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv potentially with globbing) or a function to generate those. If the path contains a globbing pattern, we will read all files matching the pattern and concatenate them. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. It should also contain a column `generator` with the name of the current model.    -r, --reference_outputs=REFERENCE_OUTPUTS        Type: Union        Default: <func...        The outputs of the reference model. Same format as `all_model_outputs` but without needing `generator`. By default, the reference outputs are the 003 outputs on AlpacaEval set.    -f, --fn_add_to_leaderboard=FN_ADD_TO_LEADERBOARD        Type: Callable        Default: 'evaluate'        The function to use to add a model to the leaderboard. If a string, it should be the name of a function in `main.py`. The function should take the arguments: `model_outputs`, `annotators_config`, `name`, `precomputed_leaderboard`, `is_return_instead_of_print`, `reference_outputs`.    --leaderboard_mode=LEADERBOARD_MODE        Type: str        Default: 'verified'        The mode of the leaderboard to save all new entries with.    -i, --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT        Type: bool        Default: False        Whether to return the metrics instead of printing the results.    Additional flags are accepted.        Additional arguments to pass to `fn_add_to_leaderboard`.

If you want to make a new leaderboard using a single command (rather than multiplealpaca_eval calls), for yourdesired evaluationset and evaluators, you can use the following:

alpaca_eval make_leaderboard \  --leaderboard_path<path_to_save_leaderboard> \  --all_model_outputs<model_outputs_path> \  --reference_outputs<reference_outputs_path> \  --annotators_config<path_to_config.yaml>

where:

leaderboard_path: path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already existsit will append.
all_model_outputs : The json path to the outputs of all models to add to the leaderboard (as a single file or byglobbing multiple files). Each dictionary should containthe keys (instruction andoutput) that are formatted in the prompts and a columngenerator with the name of thecurrent model. As an exampleseethis file.
reference_outputs the path to the outputs of the reference model. Each dictionary should containthe keys (instruction andoutput) that are formatted in the prompts. Bydefault, the reference outputs are the 003 outputs on AlpacaEval set.
annotators_config: The path to the annotator's config file. Defaults toalpaca_eval_gpt4.

Making a new evaluator

>>> alpaca_eval analyze_evaluators -- --help

NAME    alpaca_eval analyze_evaluators - Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).SYNOPSIS    alpaca_eval analyze_evaluators <flags>DESCRIPTION    Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).FLAGS    --annotators_config=ANNOTATORS_CONFIG        Type: Union        Default: 'alpaca_eval_gpt4_turbo_fn'        The path the (or list of dict of) the annotator's config file.    -A, --Annotator=ANNOTATOR        Default: <class 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...        The annotator class to use.    --analyzer_kwargs=ANALYZER_KWARGS        Type: Optional[Optional]        Default: None        Additional arguments to pass to the analyzer.    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD        Type: Union        Default: PosixPath('/Users/yanndubois/Desktop/GitHub/alpaca_eval/src/...        The precomputed (meta)leaderboard of annotators or a path to it (json, csv, or tsv).    --is_save_leaderboard=IS_SAVE_LEADERBOARD        Type: bool        Default: False        Whether to save the leaderboard (ie analyzed results).    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT        Type: bool        Default: False        Whether to return the leaderboard (ie analyzed results). If True, it will not print the results.    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD        Type: bool        Default: False        Whether to overwrite the leaderboard if it already exists.    -m, --max_instances=MAX_INSTANCES        Type: Optional[Optional]        Default: None        The maximum number of instances to analyze.    --is_single_annotator=IS_SINGLE_ANNOTATOR        Type: bool        Default: False        Whether to analyze a single annotator. If True, will not be able to estimate the annotator's bias.    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT        Type: str        Default: 'minimal'        The mode of the leaderboard to print.    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE        Type: str        Default: 'minimal'        The mode of the leaderboard to save all new entries with.    -o, --output_path=OUTPUT_PATH        Type: Union        Default: 'auto'        Path to save the leaderboard and annotataions. If None, we don't save.    Additional flags are accepted.        Additional arguments to pass to `Annotator`.

AlpacaEval provides a simple way of making new evaluators. All you need is to make a newconfigs.yaml configurationfile, which you will then passas--annotators_config <path_to_config.yaml> toalpaca_eval.Here are some ways you can make a new evaluator:

Changing the prompt: Write a new prompt in a text file and specify the path inprompt_template of theconfiguration file. Paths are relative to the configuration file.
Changing decoding parameters: Specify the desired parameters incompletions_kwargs in the configuration file. Tosee all available parameters refer to the docstrings of the correspondingfunctionin this filespecified byfn_completionsin the configuration file.
Changing the model: Specify the desired model inmodel_name and the correspondingprompt inprompt_template. If the model comes from another provider youwillhaveto changefn_completions which maps to the corresponding functioninthis file. Weprovidefn_completions functions to use models from OpenAI, Anthropic, Cohere, or HuggingFace. Toinstall packages needed forall providersusepip install alpaca_eval[all].

Other parameters in the configuration file

The easiest is to check the docstringsofSinglePairwiseAnnotator.Here are some important ones:

Parameters----------prompt_template : path    A prompt that will be given to `fn_prompter` or path to the prompts. Path is relative to    `evaluators_configs/`fn_completion_parser : callable or str    Function in `completion_parsers.py` to use for parsing the completions into preferences. For each completion,    the number of preferences should be equal to the batch_size if not we set all the preferences in that batch to    NaN.completion_parser_kwargs : dict    Kwargs for fn_completion_parser.fn_completions : callable or str    Function in `decoders.py` to use for decoding the output.completions_kwargs : dict    kwargs for fn_completions. E.g. model_name, max_tokens, temperature, top_p, top_k, stop_seq.is_randomize_output_order : bool    Whether to randomize output_1, output_2 when formatting.batch_size : int    Number of examples that will be added in a single prompt.

Once you made the evaluator you can also analyze it and add it to theevaluator'sleaderboard using thefollowing command:

alpaca_eval analyze_evaluators --annotators_config'<path_to_config.yaml>'

To estimate the bias and variance this evaluates every example with 4 seeds, i.e., 2.5Kevaluation.If you want a cheaper evaluation you can use a single seed using--is_single_annotator True which will skip theestimation of bias and variance.

Contributing

We are accepting PRs for new models, evaluators, and eval sets, in addition to bug fixes.We will update theleaderboard website regularly with new communitycontributions.We have also created asupport discord for AlpacaEval in case you run into any issuesandwish to ask help from the community.

To get started, please first fork the repo, and install the package from sourcepip install -e .

Contributing a model

First, you'll need to add a model config definition in themodels_configs folder. Asan example, you can look atthefalcon-7b-instruct yaml. Please make sure thefolder name and key name in the yaml match exactly.

Then, please follow the steps inEvaluating a model to run inference on the model to produceoutputs on the eval set and score the model according to one of the evaluators.An example command may look like:

alpaca_eval evaluate_from_model \  --model_configs'falcon-7b-instruct'

After running this command, you should have generated an outputs json and a new entry in the correspondingleaderboardfile. Please make a PRwith theconfig, outputs file, and updated leaderboard.

Concretely you should do something like:

Fork the repository in github
Clone the forked repositorygit clone <URL>
Make a model config atsrc/alpaca_eval/models_configs/<model_name> and evaluate itevaluate_from_model --model_configs '<model_name>'
Add the model configs, output, and leaderboard entry to the forked repository

git add src/alpaca_eval/models_configs/<model_name># add the model configgit add src/alpaca_eval/leaderboards/# add the actual leaderboard entrygit add src/alpaca_eval/metrics/weights# add the weights for LCgit add -f results/<model_name>/model_outputs.json# force add the outputs on the datasetgit add -f results/<model_name>/*/annotations.json# force add the evaluations from the annotatorsgit commit -m"Add <model_name> to AlpacaEval"git push

Create apull request on AlpacaEval

Note: if you are generating outputs outside of AlpacaEval you should still add a model config but withfn_completions: null.Seethis config for an example.

Getting your model verified

A verified result in AlpacaEval indicates that a core maintainer has decoded the outputs from the model and performed the evaluation. Unfortunately, we, the AlpacaEval maintainers, lack the resources to verify all the models and so we will only do that for models that are in the top-5 of the leaderboard. We apologize for any inconvenience this may cause and appreciate your understanding. To have your model verified, please follow the steps below:

Contact@yann on Discord, or email us if you have our email, providing a brief rationale for why your model should be verified.
Await our response and approval before proceeding.
Prepare a script to decode from your model that does not require a GPU, typically the same script used for your model contribution. It should run usingalpaca_eval evaluate_from_model --model_configs '<your_model_name>' without requiring a local GPU.
Generate temporary API keys for running the script and share them with us. Specifically, we need the keys for both decoding your model and for evaluation (e.g., OpenAI or Anthropic key).
We will executealpaca_eval evaluate_from_model --model_configs '<your_model_name>', update the results, and inform you so that you can revoke the temporary keys.

Note that we will not re-evaluate the same model. Due to sampling variance, the results might slightly differ from your initial ones. We will replace your previous community results with the verified ones.

Contributing an evaluator

Please first follow the directions inMaking a new evaluator.Once you're created the annotator config, we ask that you create a new leaderboard for the annotator by evaluating theminimal set of models. The outputs for these models can be found bydownloadingalpaca_eval_all_outputs.json.

alpaca_eval make_leaderboard \  --leaderboard_path src/alpaca_eval/leaderboards/data_AlpacaEval/<evaluator>_leaderboard.csv \  --all_model_outputs alpaca_eval_all_outputs.json \  --annotators_config<evaluator_config>

Then, please create a PR with the annotator config and leaderboard csv.

Contributing an eval set

To contribute a new eval set, you'll first need to specify a set of textual instructions.Then, you'll need to specify a set of reference outputs (model win-rates are computed against this reference).For ease of use, you may use the defaulttext-davinci-003 referenceconfig.

Place these together into a json, where each entry specifies the fieldsinstruction,output, andgenerator. Youcan look toalpaca_eval.json as aguide (thedataset field is not necessary).

Finally, we ask that you create a minimal leaderboard on this new evaluation set. You can do this with the following:

alpaca_eval make_leaderboard \  --leaderboard_path<src/alpaca_eval/leaderboards/data_AlpacaEval/your_leaderboard_name.csv> \  --all_model_outputs alpaca_eval_all_outputs.json \  --reference_outputs<path_to_json_file>

Please submit a PR with the eval set json and corresponding leaderboard csv.

Contributing a completion function

Currently, we allow different completion functions, e.g.,openai,anthropic,huggingface_local,huggingface_hub_api ... If you want to contribute a new completion function / API with which to perform inference then follow those steps:

add a file .py with a function<name>_completions(prompts : Sequence[str], model_name :str, ... ) in thedecoder folder. This function should take as argument the prompts + kwargs and return the completions. Please look at other completion functions in the directory for templates. E.g.huggingface_local_completions oranthropic.
add<name>_completions and dependencies ininit . Again you can follow the example ofhuggingface_local_completions
update optional dependencies insetup.py
add a model you want to evaluate in themodels configs
evaluate your model usingalpaca_eval evaluate_from_model --model_configs '<model_configs>'
(optional) push the results from the previous model on AlpacaEval leaderboard followingthose steps

Feel free to start a PR early, we'll be able to provide some help in the process!

Limitations

The AlpacaEval evaluation pipeline, like other current evaluators have important limitations and should therefore not beused as replacement for human evaluation in important settings, such as to decide whether a model is ready to bedeployed.Those can broadly be clustered into 3 categories:

Instructions might not be representative of real-usage: the AlpacaEval set contains examples from a variety ofdatasets (self-instruct,open-assistant,vicuna,koala,hh-rlhf)which might not be representative of real-usage and advanced applications of better models like GPT4. This likely makes the best closed models (GPT4 / Claude / ChatGPT / ...) seem more similar to the open models than what they are. Indeed, those closed models seem to be pretrained/finetuned on much more diverse data. See forexamplethis blogfor preliminary results on more complex instructions.Note, however, that inAlpacaFarm we showed that win-rates on our evaluation setare highly correlated (0.97 R2) with win-rates on instructions from user interactions with the Alpaca Demo.Furthermore, the AlpacaEval leaderboard shows largergap between the open models and OpenAI models than other leaderboards (e.g.lmsys).
Biases of automatic annotators: the raw automatic annotators seem to have implicit biases. In particular, we foundthat they tend to prefer longer outputs and outputs that contain lists (e.g. 0.68 / 0.69 foralpaca_eval_gpt4and 0.62 / 0.58 forclaude).Although we found that humans have similar biases (0.64 / 0.61), we believe that this could be more of a limitationof human annotation pipeline we used rather than a true human bias. More generally, through qualitative analysis, wefound that automatic annotators give more importance to the styleof the output than its content (e.g. factuality).Finally, we found that automatic evaluators tend to prefer outputs from models that are similar (likely trained onthe same data) as suggested by the big difference between ChatGPT/GPT4 onclaude's andalpaca_eval_gpt4'sleaderboard. Note that the length bias is partially mitigated in our length-controlled win-rates.
Lack of safety evaluation: importantly, AlpacaEval only evaluates the instruction-following capabilities ofmodels rather than the harm that they could cause (e.g. toxic behavior or bias). As a result the small gap betweencurrent ChatGPT and the best open source modelsshould not be interpreted as if that the latter are ready to bedeployed.

Beyond those limitations about the evaluation pipelines, there are also limitations about our validation of theevaluators and ourproposed approach to selecting evaluation sets.

Limitations about our validation pipeline

First, our validation of evaluators based on human cross-annotations suffers from the following limitations: (1) wequalitatively found that our crowd-workers tend to also favor style such as length and presence of lists overfactuality;(2) this does not validate whether win-rates against a reference model is a good evaluation strategy in the first place;(3) preferences from 16 crowd-workers are not representative of preferences of all humans.

Second, our suggested approach to selecting evaluation sets based on statistical power suffers from the followinglimitations: (1) statistical power does not ensure the right direction, e.g. you can have an unnatural set ofinstructions where Alpaca "performs" better than better model; and(2) this can push users to select data to support the hypothesis that they want to validate.

Additional analysis and plots

Length-controlled AlpacaEval (LCAE)

Length-controlled AlpacaEval Visualizations:

Length-controlled AlpacaEval Development:

The notebook shows different options that we considered for mitigating the length bias of automatic annotators.

Here we briefly summarize the main results. Namely:

LCAE increases the correlation with Chat Arena to 0.98 from 0.94 for AlpacaEval 2.0. This makes LCAE the most highly correlated benchmark with Chat Arena as seen in the plot below.

LCAE decreases length gameability one of the major issues of AlpacaEval is that you can increase your win-rate by increasing the length of your outputs. For example, in AlpacaEval 2.0 the win-rate for the baseline (50%) increases to 64% when prompted to “give as much detail as possible” and decreases to 23% when prompted to “be as concise as possible while still providing all the necessary information to answer the question”. More generally the relative length gameability was ~21% for AlpacaEval and decreases to ~6% for LCAE, so it's 3x less gameable through prompt length. This is shown in the plot below.

We can predict performance for different baselines One other benefit of using a GLM for controlling for length bias. Is that we now have a model that can predict the win-rate of a model for different baselines. In particular, our GLM has many nice properties, for examplewin_rate(m,b) = 1 - win_rate(b,m) \in [0,1] andwin_rate(m,m) = 0.5. This is shown in the plot below.

Finally, note that we are only controlling for length bias. There are other known biases that we are not controlling for, such as the fact that auto-annotators prefer outputs similar to their model. Although we could control for that, in practice we have found that to be less of an issue than length bias. For two reasons (1) this mostly a single model in the leaderboard because fine-tuning on outputs from the auto-annotator doesn't seem to have doesn't seem to impact the win-rate as much, and (2) the bias is actually less strong that what one could think. For example we show below a subset of the leaderboards auto-annotated by three different models, and we see that the ranking of models is exactly the same. In particular,claude-3-opus prefersgpt4_preview, andmistral-large prefers the former two.

Analyzing an evaluator

Caution: all the following results are about AlpacaEval 1.0 and have not been updated since

Analyzing evaluators:

As we saw inthe evaluator's leaderboard, there are many metrics to consider when selecting an evaluator,e.g. the quality, price, and speed. To assist with selection of the evaluator we provide a few functions to plot thosemetrics.The following shows for example the price/time/agreement of the different evaluators.

Here we see thatalpaca_eval_gpt4 performs very well and is better than humans on all the considered metrics.

Previously we only considered the agreement with human annotators overall.An additional validation that one could do is checking whether making a leaderboard using ourautomatic annotator gives similar results as a leaderboard from humans.To enable such analysis, we releasehumanannotations of outputs from 22 methods fromAlpacaFarm =>22*805 = ~18K annotations. As a result wecantestthe correlation between the win-rates of the 22 models as evaluated by the humans and our automatic annotator.Note that this is arguably a better way of selecting an automatic evaluator than using "human agreement [%]" but isexpensive given that it requires 18Kannotations.The plot below shows such correlation for thealpaca_eval_gpt4 evaluator.

We see that thealpaca_eval_gpt4 leaderboard is highly correlated (0.94 Pearson correlation) to the leaderboard fromhumans, which furthersuggests that automatic evaluation is a good proxy for human evaluation.For the code and more analysis,seethis notebook, or thecolab notebook above.

Analyzing an eval set

Caution: all the following results are about AlpacaEval 1.0 and have not been updated since.

Making evaluation sets:

When creating an evaluation set there are two main factors to consider: how much data to use? and what data?

One way of answering those question is by considering a leaderboard of models that you believe are of differentquality and checking what and how much data is needed to distinguish between them in a statistically significant way.We will do so below using a paired t-test to test if the difference in win-rates between every pair of modelsisstatistically significant.

First, let us consider the question of how much data to use.Below we show the number of random samples needed from AlpacaEval for the paired t-test to give a p-value < 0.05 foreach pair of models in the minimalalpaca_eval_gpt4leaderboard.Grey cells correspond to pairs that are not significantly different on the 805 samples.y- and x-axis are ordered by the win-rate of the first and second model respectively.

We see that most models can already be distinguished with 50 samples, and that 150 samples allows distinguishing themajority of pairs (74 out of 78). This suggests that we can decrease the evaluation set size by a factor of4 when testing two models that have similar performance gaps as those on theminimalalpaca_eval_gpt4leaderboard.

The second question is what data to use. Again we can try to answer this question from a statistical power perspective:what data allows to best distinguish between models. Let's consider this for all the datasets that are part ofAlpacaEval, but let us control for the size of the evaluation sets as we only care about the quality of the data. Thefollowing plot shows the p-values from the paired t-test of each pairs of models on 80 examples of each subset ofAlpacaEval.

We see for example that the self-instruct dataset yields the least statistical power, which suggests that one couldremove this dataset from the evaluation set.The exact reason should be analyzed in future work.For the code and more analysisseethis notebook, or thecolab notebook above.

Citation

Please consider citing the following depending on what you are using and referring to:

Code, results, and general benchmark:alpaca_eval (this repo). Specify whether you are using AlpacaEval or AlpacaEval 2.0. For length-controlled win-rates see below.
Length-controlled (LC) win rates:alpaca_eval_length.
Human annotations:dubois2023alpacafarm (AlpacaFarm)
AlpacaEval evaluation set:alpaca_eval andself-instruct,open-assistant,vicuna,koala,hh-rlhf.

Here are the bibtex entries:

@misc{alpaca_eval,  author = {Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },  title = {AlpacaEval: An Automatic Evaluator of Instruction-following Models},  year = {2023},  month = {5},  publisher = {GitHub},  journal = {GitHub repository},  howpublished = {\url{https://github.com/tatsu-lab/alpaca_eval}}}

@article{dubois2024length,  title={Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators},  author={Dubois, Yann and Galambosi, Bal{\'a}zs and Liang, Percy and Hashimoto, Tatsunori B},  journal={arXiv preprint arXiv:2404.04475},  year={2024}}

@misc{dubois2023alpacafarm,  title={AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback},   author={Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto},  year={2023},  eprint={2305.14387},  archivePrefix={arXiv},  primaryClass={cs.LG}}

More information

Length-Controlled Win Rates

Length controlled (LC) win-rates are a debiased version of the win-rates that control for the length of the outputs.

The main idea is that for each model we will fit a logistic regression to predict the preference of the autoannotator given: (1) the instruction, (2) the model, and (3) the difference of length between the baseline and model output.Given such a logistic regression we can then try to predict the counterfactual "what would the preference be if the model's output had the same length as the baseline" by setting the length difference to 0.By averaging over this length-controlled preference, we then obtain the length-controlled win-rate.The exact form of the logistic regression is taken such that the interpretation of LC win rates is similar to the raw win rates, for example for any modelm1 andm2 we havewin_rate(m1, m2) = 1 - win_rate(m2, m1) \in [0,100] andwin_rate(m1, m1) = 0.5.Length controlled win-rates increase the correlation between AlpacaEval's leaderboard and Chat Arena from0.93 to 0.98 Spearman correlation, while significantly decreasing the length gameability of the annotator.For more information and results about length controlled win-rates seethis notebook.

This idea of estimating the controlled direct effect, by predicting the outcome while conditioning on the mediator (the length difference), is common in statistical inference.

To get LC win rates on previously annotated models, you can use the following command:

pip install -U alpaca_evalalpaca_eval --model_outputs … --is_recompute_metrics_only True

AlpacaEval 2.0

AlpacaEval 2.0 is a new version of AlpacaEval. Here are the differences:

reference:gpt4_turbo: we upgraded the baseline fromtext-davinci-003 togpt4_turbo to make the benchmark more challenging and have a metric that better reflects the current state of the art.
annotator:weighted_alpaca_eval_gpt4_turbo: we improved the annotator in quality and price. First, we use thegpt4_turbo model for annotating, which is approximately 2x cheaper thangpt4. Second, we changed the prompt such that the model outputs a single token, which further reduced cost and speed. Finally, instead of using a binary preference, we used the logprobs to compute a continuous preference, which gives the final weighted win-rate. Note that the latter two changes had the surprising effect of decreasing the annotators' length biased.

By default, AlpacaEval 2.0 will be used frompip install alpaca_eval==0.5. If you wish to use the old configs by default, you can setIS_ALPACA_EVAL_2=False in your environment.

Data Release

As part of AlpacaEval, we release the following data:

Human annotations (17701) in order to develop and understand automatic evaluators, we release all the humanpairwiseevaluation that we collected for AlpacaFarm. This contains comparisons between 22 models with thetext-davinci-003reference on the AlpacaFarm evaluation set. Annotations are from a pool of 16 crowd workers on Amazon Mechanical Turk.The different models are: 6 from OpenAI, 2 SFT models from AlpacaFarm, 13 RLHF methods from AlpacaFarm, and LLaMA 7B.
Human cross-annotations (2596) in order to further analyze automatic evaluators we selected (via stratifiedsamplingacross models and datasets) 650 examples from the AlpacaFarm evaluation set and collected 4 human annotations perexample.
AlpacaEval set (805) we made slight modifications/simplification of the AlpacaFarm evaluation set. In particular,we first mergedthe instruction and input fields into a single instruction field. This affects 1/4 of the examples in the AlpacaFarmevaluation set, all of which are from theself-instruct evaluation set. Second weregenerated the text-davinci-003 reference outputs without limiting the length of its outputs.

For more details about the human annotations refer to theAlpacaFarm paper.

Differences with AlpacaFarm

AlpacaEval is an improvement and simplification of the automatic pairwise preference simulatorfromAlpacaFarm.Outside AlpacaFarm, you should be using AlpacaEval.Here are the main differences:

AlpacaEval merges instructions and inputs: The AlpacaEval evaluation is the same as the AlpacaFarm evaluationexcept that the instruction and input fields are merged as{instruction}\n\n{input}. This affects 1/4 of theexamples in the AlpacaFarm evaluation set (theself-instruct subset).This simplification provides a more fair comparison for models that were not trained by distinguishing betweenthe two fields.
AlpacaEval handles longer generations: Models in AlpacaFarm were limited to a maximum number of 300 tokens forgenerations. Wechange this number to 2000 for AlpacaEval. Note that this also affects the reference generations (text-davinci-003),so the results on AlpacaEval are not comparable to those on AlpacaFarm even for examples that had no inputfield.
AlpacaEval removes intra- and inter-annotator variance: The AlpacaFarm simulator replicates human annotation interms of both mode behavior and diversity.In particular, AlpacaFarm's simulator uses a pool of models and prompts and adds noise to replicate human intra- andinter-annotator variance.If the goal is to use an automatic annotator for evaluation or simply training better models, then this variancemay not be desirable. The default annotators in AlpacaEval thus don't have this variance. We give the option to add itback byusing--anotators_config 'alpaca_farm' and--p_label_flip 0.25 when creating an evaluator.

Related work

There have been several work that propose new automatic annotators for instruction-following models. Here we list theones that we are aware of and discuss how they differ from ours. We evaluated all of thoseinour evaluator's leaderboard.

Vicuna/lmsys The lmsys annotator (lmsys_gpt4) evaluates the pair by asking the annotator a score from 1-10 foreach output, and then selecting the output with the highest score as preferred. They do not randomize over outputorder and they ask an explanationafter the score. Overall, we found that this annotator has strong bias towardslonger outputs (0.74) and relatively low correlation with human annotations (63.2).
AlpacaFarm The best AlpacaFarm annotator (alpaca_farm_greedy_gpt4) evaluates the pair by directly asking theannotatorwhich output it prefers. Furthermore, it batches 5 examples together to amortize the length of the prompt andrandomizes the order of outputs. Overall, wefound that this annotator has much less bias towards longer outputs (0.60) and is faster (878 seconds/1000 examples)than others. It has aslightly higher correlation with the majority of human annotations (66.4) than humans themselves (65.7).However, it is more expensive ($15.3/1000 examples) and doesn't work with very long outputs given the batching.
Aviary The Aviary annotator (aviary_gpt4) asks the annotator to order the output by its preference, rather thansimply selecting the preferred output. It does not randomize the order of outputs and uses high temperature fordecoding (0.9). Overall, we found that this annotator has relatively strong bias towards longer outputs (0.70) andvery highcorrelation with human annotations (69.1). By decreasing the temperature and randomizing the order of outputs,wefurther improvedthe correlation to 69.8 (improved_aviary_gpt4) but this further increased the length bias to 0.73.

Ouralpaca_eval_gpt4 is a mix between the AlpacaFarm and Aviary annotators. It asks the annotator to order the outputsby preference, but it uses temperature 0, randomizes over outputs, and made some modifications to the prompt to decreaselength bias to 0.68.

Other related work include recent papers which analyze automatic evaluators.For example:

AlpacaFarm Appx CandLarge Language Models are not Fair Evaluators both found that automaticannotators havea position bias.
AlpacaFarm Sec. 5.2.andThe False Promise of Imitating Proprietary LLMs both found thatautomaticannotators favor style (e.g. use of list, tone, word choice, length) over factuality.

Interpreting annotations

For all models you can find the auto-annotations underresults/<model_name>/*/annotations.json. The annotations have the following columns:

instruction: the prompt
generator_1: the baseline model
output_1: the output of the baseline model
generator_2: the model being evaluated
output_2: the output of the model being evaluated
annotator: the auto-annotator
preference: the result of the auto-annotator. This is a float between 1 and 2. Closer to 1 means that the auto-annotator prefersoutput_1, closer to 2 means that it prefersoutput_2. For AlpacaEval 2.0,preference-1 corresponds to the probability ofoutput_1 being preferred. For AlpacaEval 1.0,preference is 1 ifoutput_1 is preferred, 2 ifoutput_2 is preferred, and 1.5 if they are the same. The win rate is always(preference -1).mean().
raw_completion: the raw output of the auto-annotator. This is field contains the completions before de-randomization of the order betweenoutput_1 andoutput_2! It is thus much harder to interpret, see below for more information.

Chain of thought

For some annotators, e.g.alpaca_eval_cot_gpt4_turbo_fn we usechain of thought reasoning to make the models preferences more interpretable. Those can then be found underconcise_explanation. To interpret them, you should also look atreferenced_models which translates the temporary model name (in the prompt) to the actual output. Below, we provide more explanation as to what is happening behind the scenes.

You can check theraw_annotations["concise_explanation] column inannotations.json (e.g.here) which contains the chain of thought reasoning of the auto annotator. Note that the raw_annotations is not modified by the randomization of the order of the outputs. In particular,"m" and"M" can sometime refer to the first model (the reference) and sometime to the second model (the model being evaluated). To understand which model is being referred to, you should use the columnpreference andordered_models. To make it easier we add a column"referenced_models" mapping the model names to the corresponding outputs. For example in the following annotation we see that the preference is 1.0 (i.e.output_1) and corresponds to modelM inconcise_explanation (seeordered_models).

{"instruction":"How did US states get their names?","output_1":"The names of U.S. states are derived from a variety of languages and sources, including indigenous American languages, European languages (primarily English, Spanish, and French), as well as other historical influences. Here is a brief overview of how some states got their names:\n\n1. **Indigenous American Languages**: Many state names come from Native American words or names of tribes, often from the languages of the indigenous peoples who lived in or near those areas.\n   - **Alabama**: Possibly derived from the Choctaw language, meaning\"thicket clearers.\"\n   - **Connecticut**: From a Mohegan-Pequot word meaning\"long tidal river.\"\n   - **Massachusetts**: [...]","generator_1":"gpt4_1106_preview","dataset":"helpful_base","output_2":"The names of the 50 U.S. states come from a variety of sources, including Native American languages, European languages, and historical figures. Here's a brief overview of how some states got their names:\n\n1. Native American origins: Many states have names derived from Native American languages. For example, Alabama comes from the Choctaw word\"Albah amo,\" meaning\"plant gatherers\" or\"herb gatherers.\" Similarly, the name Mississippi comes from the Ojibwe word\"Misi-ziibi,\" meaning\"great river.\"\n\n2. European languages: [...].","generator_2":"gpt4","annotator":"alpaca_eval_cot_gpt4_turbo_fn","preference":1.0,"raw_completion": {"concise_explanation":"Model M provided a more detailed and structured response, including bold headings for each category and a wider range of examples. It also included additional categories such as 'Other European Languages' and 'Combination of Languages and Influences', which added depth to the explanation. Model m's response was accurate but less comprehensive and lacked the clear structure found in Model M's output.","ordered_models": [      {"model":"M","rank":1      },      {"model":"m","rank":2      }    ]  },"referenced_models": {"M":"output_1","m":"output_2"  }}

Major updates

12th March 2024: updated to use length-controlled (LC) win rates. This is a debiased version of the win-rates that control for the length of the outputs.
3rd January 2024: updated to AlpacaEval 2.0, which uses GPT4-turbo as baseline and annotator.
2nd January 2024: added Azure API and more general way of setting client configs. Seehere
19th June 2023: add leaderboardchatgpt_fn that anyone can use (no waiting lists).
19th June 2023: update touseOpenAI's function calling.Example:chatgpt_fnoralpaca_eval_gpt4_fn.