- Notifications
You must be signed in to change notification settings - Fork44
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
License
Apache-2.0, MIT licenses found
Licenses found
declare-lab/instruct-eval
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
🔥 If you are interested in IQ testing LLMs, check out our new work:AlgoPuzzleVQA
📣 Introducing Resta:Safety Re-alignment of Language Models.PaperGithub
📣Red-Eval, the benchmark forSafety Evaluation of LLMs has been added:Red-Eval
📣 IntroducingRed-Eval to evaluate the safety of the LLMs using several jailbreaking prompts. WithRed-Eval one could jailbreak/red-team GPT-4 with a 65.1% attack success rate and ChatGPT could be jailbroken 73% of the time as measured on DangerousQA and HarmfulQA benchmarks. More details are here:Code andPaper.
📣 We developed Flacuna by fine-tuning Vicuna-13B on the Flan collection. Flacuna is better than Vicuna at problem-solving. Access the model herehttps://huggingface.co/declare-lab/flacuna-13b-v1.0.
📣 TheInstructEval benchmark and leaderboard have been released.
📣 The paper reporting Instruction Tuned LLMs on theInstructEval benchmark suite has been released on Arxiv. Read it here:https://arxiv.org/pdf/2306.04757.pdf
📣 We are releasingIMPACT, a dataset for evaluating the writing capability of LLMs in four aspects: Informative, Professional, Argumentative, and Creative. Download it from Huggingface:https://huggingface.co/datasets/declare-lab/InstructEvalImpact.
📣FLAN-T5 is also useful in text-to-audio generation. Find our workathttps://github.com/declare-lab/tango if you are interested.
This repository contains code to evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-outtasks.We aim to facilitate simple and convenient benchmarking across multiple tasks and models.
Instruction-tuned models such asFlan-T5andAlpaca represent an exciting direction to approximate theperformance of large language models (LLMs) like ChatGPT at lower cost.However, it is challenging to compare the performance of different models qualitatively.To evaluate how well the models generalize across a wide range of unseen and challenging tasks, we can use academicbenchmarks such asMMLU andBBH.Compared to existing libraries such asevaluation-harnessandHELM, this repo enables simple and convenient evaluation for multiplemodels.Notably, we support most models from HuggingFace Transformers 🤗 (checkhere for a list of models we support):
- AutoModelForCausalLM (egGPT-2,GPT-J,OPT-IML,BLOOMZ)
- AutoModelForSeq2SeqLM (egFlan-T5,Flan-UL2,TK-Instruct)
- LlamaForCausalLM (egLLaMA,Alpaca,Vicuna)
- ChatGLM
For detailed results, please go to ourleaderboard
| Model Name | Model Path | Paper | Size | MMLU | BBH | DROP | HumanEval |
|---|---|---|---|---|---|---|---|
| GPT-4 | Link | ? | 86.4 | 80.9 | 67.0 | ||
| ChatGPT | Link | ? | 70.0 | 64.1 | 48.1 | ||
| seq_to_seq | google/flan-t5-xxl | Link | 11B | 54.5 | 43.9 | ||
| seq_to_seq | google/flan-t5-xl | Link | 3B | 49.2 | 40.2 | 56.3 | |
| llama | eachadea/vicuna-13b | Link | 13B | 49.7 | 37.1 | 32.9 | 15.2 |
| llama | decapoda-research/llama-13b-hf | Link | 13B | 46.2 | 37.1 | 35.3 | 13.4 |
| seq_to_seq | declare-lab/flan-alpaca-gpt4-xl | Link | 3B | 45.6 | 34.8 | ||
| llama | TheBloke/koala-13B-HF | Link | 13B | 44.6 | 34.6 | 28.3 | 11.0 |
| llama | chavinlo/alpaca-native | Link | 7B | 41.6 | 33.3 | 26.3 | 10.3 |
| llama | TheBloke/wizardLM-7B-HF | Link | 7B | 36.4 | 32.9 | 15.2 | |
| chatglm | THUDM/chatglm-6b | Link | 6B | 36.1 | 31.3 | 44.2 | 3.1 |
| llama | decapoda-research/llama-7b-hf | Link | 7B | 35.2 | 30.9 | 27.6 | 10.3 |
| llama | wombat-7b-gpt4-delta | Link | 7B | 33.0 | 32.4 | 7.9 | |
| seq_to_seq | bigscience/mt0-xl | Link | 3B | 30.4 | |||
| causal | facebook/opt-iml-max-1.3b | Link | 1B | 27.5 | 1.8 | ||
| causal | OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 | Link | 12B | 27.0 | 30.0 | 9.1 | |
| causal | stabilityai/stablelm-base-alpha-7b | Link | 7B | 26.2 | 1.8 | ||
| causal | databricks/dolly-v2-12b | Link | 12B | 25.7 | 7.9 | ||
| causal | Salesforce/codegen-6B-mono | Link | 6B | 27.4 | |||
| causal | togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 | Link | 7B | 38.1 | 31.3 | 24.7 | 5.5 |
Evaluate onMassive Multitask Language Understanding (MMLU) whichincludes exam questions from 57 tasks such as mathematics, history, law, and medicine.We use 5-shot direct prompting and measure the exact-match score.
python main.py mmlu --model_name llama --model_path chavinlo/alpaca-native# 0.4163936761145136python main.py mmlu --model_name seq_to_seq --model_path google/flan-t5-xl # 0.49252243270189433Evaluate onBig Bench Hard (BBH) which includes 23 challenging tasks forwhich PaLM (540B) performs below an average human rater.We use 3-shot direct prompting and measure the exact-match score.
python main.py bbh --model_name llama --model_path TheBloke/koala-13B-HF --load_8bit# 0.3468942926723247Evaluate onDROP which is a math question answering benchmark.We use 3-shot direct prompting and measure the exact-match score.
python main.py drop --model_name seq_to_seq --model_path google/flan-t5-xl # 0.5632458233890215Evaluate onHumanEval which includes 164 coding questions in python.We use 0-shot direct prompting and measure the pass@1 score.
python main.py humaneval --model_name llama --model_path eachadea/vicuna-13b --n_sample 1 --load_8bit# {'pass@1': 0.1524390243902439}Install dependencies and download data.
conda create -n instruct-eval python=3.8 -yconda activate instruct-evalpip install -r requirements.txtmkdir -p datawget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tartar -xf data/mmlu.tar -C data && mv data/data data/mmluAbout
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
Topics
Resources
License
Apache-2.0, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.
