Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

License

NotificationsYou must be signed in to change notification settings

ku-nlp/ja-vicuna-qa-benchmark

Repository files navigation

This repository contains code for Japanese Vicuna QA Benchmark, described by the paper:Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese.

We released Japanese Vicuna QA Benchmark for measuring comprehensive capabilities of Japanese LLMs, which consists of 80 diverse questions in 10 categories (generic, coding, roleplay, writing, etc.)You can leverage this package to evaluate the answers of your Japanese LLM models in a reference-free manner with LLM-as-a-judge.To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

To be clarified, such zero-shot QA-style evaluation might be more suitable for those LLMs that have been fine-tuned with instructions. The 80 questions are manually translated from the English Vicuna benchmark.

Install

pip install -e.

Evaluate a model with Japanese Vicuna QA Benchmark

Step 1. Generate model answers to Japanese Vicuna QA questions (noted as jp-bench).

python llm_judge/gen_model_answer.py --config<CONFIG-PATH>

Arguments & Options:

  • <CONFIG-PATH> is the path to a configuration file. Examples are inconfigs/.

For example:

python llm_judge/gen_model_answer.py --config configs/rinna--japanese-gpt-neox-3.6b-instruction-ppo.json

Step 2. Generate GPT-4 judgments

There are several options to use GPT-4 as a judge, such as pairwise win-rate and single-answer grading.

OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \    --mode {single|pairwise-baseline|pairwise-all} \    [--baseline-model<BASELINE-MODEL-ID>] \    [--model-list<LIST-OF-MODEL-IDS>] \    [--yes] \    [--wandb]

Arguments & Options:

  • --mode {single|pairwise-baseline|pairwise-all} is the mode of judgment.
    • pairwise-baseline: run pairwise comparison against a baseline model. This mode will be used by default.
    • pairwise-all: run pairwise comparison between all model pairs.
    • single: run score-based single-model grading.
  • --baseline-model <BASELINE-MODEL-ID> is the model ID of the baseline model. This option is only available inpairwise-baseline mode. If not specified, the baseline model is set totext-davinci-003.
  • --model-list <LIST-OF-MODEL-IDS> is a list of model IDs to be evaluated. If not specified, all models indata/jp_bench/model_answer will be evaluated.
  • --yes is a flag to skip the confirmation prompt.
  • --wandb is a flag to enable logging to W&B. You can upload the results later to W&B by runningupload_result.py, as described in the next section.

Mode:pairwise-baseline (Default)

This mode runs pairwise comparison against a baseline model.By default, the baseline model is set totext-davinci-003.For example:

OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \    --mode pairwise-baseline \    --model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo

To show the scores:

python llm_judge/show_result.py \    --mode pairwise-baseline \    --model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo

Mode:pairwise-all

This mode runs pairwise comparison between all model pairs.For example:

OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \    --mode pairwise-all

To show the scores:

python llm_judge/show_result.py \    --mode pairwise-all

To be noticed, thepairwise-all mode might become very inefficient when evaluating many LLMs, as it evaluates combinations of each two of them.In such cases, we recommend using thepairwise-baseline mode, allowing all models to be compared against a fixed baseline such as ChatGPT.

Mode:single

This mode runs score-based single-model grading.For example:

python llm_judge/gen_judgment.py \    --mode single \    --model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo

To show the scores:

python llm_judge/show_result.py \    --mode single \    --model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo

Step 3. Upload the results to W&B (Optional)

If you want to upload the results to W&B, you can run the following command:

WANDB_ENTITY=<USER-NAME or ORGANIZATION-NAME> WANDB_PROJECT=<PROJECT-NAME> python llm_judge/upload_result.py \    --mode {single|pairwise-baseline|pairwise-all} \    [--baseline-model<BASELINE-MODEL-ID>] \    [--model-list<LIST-OF-MODEL-IDS>]

By default, the entity is configured to use your username, and the project name is set toja-vicuna-qa-benchmark-dev-<VERSION>.

Pairwise win-rate compared with GPT-3.5 (text-davinci-003)

See theleaderboard (in Japanese).

Supported baseline Models

To make it more convenient for users to utilize pairwise comparisons with existing Japanese LLMs, we offer the prediction of the following four baselines indata/jp_bench/model_answer.

Questions

If you have any questions and feedback, please feel free to leave questions in the `Issues' list.

Citation

If you use our code in your research, please cite our work:

@inproceedings{sun2024rapidly,title={Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese},author={Sun, Yikun and Wan, Zhen and Ueda, Nobuhiro and Yahata, Sakiko and Cheng, Fei and Chu, Chenhui and Kurohashi, Sadao},booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},year={2024}}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors4

  •  
  •  
  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp