- Notifications
You must be signed in to change notification settings - Fork4
A lightweight framework for evaluating visual-language models.
License
llm-jp/llm-jp-eval-mm
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
You can install llm-jp-eval-mm from GitHub or via PyPI.
- Option 1: Clone from GitHub (Recommended)
git clone git@github.com:llm-jp/llm-jp-eval-mm.gitcd llm-jp-eval-mmuv sync
- Option 2: Install via PyPI
pip install eval_mm
To use LLM-as-a-Judge, configure your OpenAI API keys in a.env
file:
- For Azure: Set
AZURE_OPENAI_ENDPOINT
andAZURE_OPENAI_KEY
- For OpenAI: Set
OPENAI_API_KEY
If you are not using LLM-as-a-Judge, you can assign any value in the.env
file to bypass the error.
To evaluate a model on a task, run the following command:
uv sync --group normaluv run --group normal python examples/sample.py \ --model_id llava-hf/llava-1.5-7b-hf \ --task_id japanese-heron-bench \ --result_dir result \ --metrics heron-bench \ --judge_model gpt-4o-2024-11-20 \ --overwrite
The evaluation results will be saved in the result directory:
result├── japanese-heron-bench│ ├── llava-hf│ │ ├── llava-1.5-7b-hf│ │ │ ├── evaluation.jsonl│ │ │ └── prediction.jsonl
To evaluate multiple models on multiple tasks, please checkeval_all.sh
.
You can integrate llm-jp-eval-mm into your own code. Here's an example:
fromPILimportImagefromeval_mmimportTaskRegistry,ScorerRegistry,ScorerConfigclassMockVLM:defgenerate(self,images:list[Image.Image],text:str)->str:return"宮崎駿"task=TaskRegistry.load_task("japanese-heron-bench")example=task.dataset[0]input_text=task.doc_to_text(example)images=task.doc_to_visual(example)reference=task.doc_to_answer(example)model=MockVLM()prediction=model.generate(images,input_text)scorer=ScorerRegistry.load_scorer("rougel",ScorerConfig(docs=task.dataset))result=scorer.aggregate(scorer.score([reference], [prediction]))print(result)# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
To generate a leaderboard from your evaluation results, run:
python scripts/make_leaderboard.py --result_dir result
This will create aleaderboard.md
file with your model performance:
Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
---|---|---|---|
llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | 52.4 |
Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 |
google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 |
microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 |
gpt-4o-2024-11-20 | 93.7 | 4.44 | 32.2 |
The official leaderboard is availablehere
Japanese Tasks:
- Japanese Heron Bench
- JA-VG-VQA500
- JA-VLM-Bench-In-the-Wild
- JA-Multi-Image-VQA
- JDocQA
- JMMMU
- JIC-VQA
- MECHA-ja
- CC-OCR (multi_lan_ocr split, ja subset)
- CVQA (ja subset)
English Tasks:
We use uv’s dependency groups to manage each model’s dependencies.
For example, to use llm-jp/llm-jp-3-vila-14b, run:
uv sync --group vilajauv run --group vilaja python examples/VILA_ja.py
Seeeval_all.sh
for the complete list of model dependencies.
When adding a new group, remember to configureconflict.
uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf
To add a new task, implement the Task class insrc/eval_mm/tasks/task.py
.
To add a new metric, implement the Scorer class insrc/eval_mm/metrics/scorer.py
.
To add a new model, implement the VLM class inexamples/base_vlm.py
Install a new dependency using the following command:
uv add<package_name>uv add --group<group_name><package_name>
Run the following commands to test tasks, metrics, and models::
bash test.shbash test_model.sh
Ensure code consistency with:
uv run ruff format srcuv run ruff check --fix src
To release a new version:
git tag -a v0.x.x -m"version 0.x.x"git push origin --tags
For website updates, seegithub_pages/README.md.
To update leaderboard data:
python scripts/make_leaderboard.py --update_pages
- Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
- lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.
We also thank the developers of the evaluation datasets for their hard work.
About
A lightweight framework for evaluating visual-language models.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors5
Uh oh!
There was an error while loading.Please reload this page.