Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A lightweight framework for evaluating visual-language models.

License

NotificationsYou must be signed in to change notification settings

llm-jp/llm-jp-eval-mm

Repository files navigation

pypiTest workflowLicense

llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.

Overview of llm-jp-eval-mm

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

  • Option 1: Clone from GitHub (Recommended)
git clone git@github.com:llm-jp/llm-jp-eval-mm.gitcd llm-jp-eval-mmuv sync
  • Option 2: Install via PyPI
pip install eval_mm

To use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:

  • For Azure: SetAZURE_OPENAI_ENDPOINT andAZURE_OPENAI_KEY
  • For OpenAI: SetOPENAI_API_KEY

If you are not using LLM-as-a-Judge, you can assign any value in the.env file to bypass the error.

Usage

To evaluate a model on a task, run the following command:

uv sync --group normaluv run --group normal python examples/sample.py \  --model_id llava-hf/llava-1.5-7b-hf \  --task_id japanese-heron-bench  \  --result_dir result  \  --metrics heron-bench \  --judge_model gpt-4o-2024-11-20 \  --overwrite

The evaluation results will be saved in the result directory:

result├── japanese-heron-bench│   ├── llava-hf│   │   ├── llava-1.5-7b-hf│   │   │   ├── evaluation.jsonl│   │   │   └── prediction.jsonl

To evaluate multiple models on multiple tasks, please checkeval_all.sh.

Hello World Example

You can integrate llm-jp-eval-mm into your own code. Here's an example:

fromPILimportImagefromeval_mmimportTaskRegistry,ScorerRegistry,ScorerConfigclassMockVLM:defgenerate(self,images:list[Image.Image],text:str)->str:return"宮崎駿"task=TaskRegistry.load_task("japanese-heron-bench")example=task.dataset[0]input_text=task.doc_to_text(example)images=task.doc_to_visual(example)reference=task.doc_to_answer(example)model=MockVLM()prediction=model.generate(images,input_text)scorer=ScorerRegistry.load_scorer("rougel",ScorerConfig(docs=task.dataset))result=scorer.aggregate(scorer.score([reference], [prediction]))print(result)# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Leaderboard

To generate a leaderboard from your evaluation results, run:

python scripts/make_leaderboard.py --result_dir result

This will create aleaderboard.md file with your model performance:

ModelHeron/LLMJVB-ItW/LLMJVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b68.034.0852.4
Qwen/Qwen2.5-VL-7B-Instruct70.294.2829.63
google/gemma-3-27b-it69.154.3630.89
microsoft/Phi-4-multimodal-instruct45.523.226.8
gpt-4o-2024-11-2093.74.4432.2

The official leaderboard is availablehere

Supported Tasks

Japanese Tasks:

English Tasks:

Managing Dependencies

We use uv’s dependency groups to manage each model’s dependencies.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilajauv run --group vilaja python examples/VILA_ja.py

Seeeval_all.sh for the complete list of model dependencies.

When adding a new group, remember to configureconflict.

Browse Predictions with Streamlit

uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf

Streamlit

Development

Adding a new task

To add a new task, implement the Task class insrc/eval_mm/tasks/task.py.

Adding a new metric

To add a new metric, implement the Scorer class insrc/eval_mm/metrics/scorer.py.

Adding a new model

To add a new model, implement the VLM class inexamples/base_vlm.py

Adding a new dependency

Install a new dependency using the following command:

uv add<package_name>uv add --group<group_name><package_name>

Testing

Run the following commands to test tasks, metrics, and models::

bash test.shbash test_model.sh

Formatting and Linting

Ensure code consistency with:

uv run ruff format srcuv run ruff check --fix src

Releasing to PyPI

To release a new version:

git tag -a v0.x.x -m"version 0.x.x"git push origin --tags

Updating the Website

For website updates, seegithub_pages/README.md.

To update leaderboard data:

python scripts/make_leaderboard.py --update_pages

Acknowledgements

  • Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
  • lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

About

A lightweight framework for evaluating visual-language models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors5


[8]ページ先頭

©2009-2025 Movatter.jp