Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Touchstone: Evaluating Vision-Language Models by Language Models

NotificationsYou must be signed in to change notification settings

OFA-Sys/TouchStone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation



TouchStone: Evaluating Vision-Language Models by Language Models

Paper

TOUCHSTONE is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention.

DATASET

TouchStone is a diverse and comprehensive dataset that covers five key dimensions:Basic Descriptive Ability,Visual Recognition Ability,Visual Comprehension Ability,Visual Storytelling Ability, andMulti-image Analysis Ability. You can download the datasethere.

Our dataset currently places more emphasis on assessing basic abilities, where the highest proportion of questions pertains to recognition, accounting for about 44.1%, followed by comprehension questions at 29.6%. The proportions of the other categories are 15.3% for basic descriptive ability, 7.4% for visual storytelling ability, and 3.6% for multi-image analysis ability. There are a total of 908 dialogue.

Methods

TouchStone leverages fine-grained annotation and strong LLMs to evaluate LVLMs. Firstly, fine-grained descriptions of images are obtained through manual annotation and inspection. These descriptions, along with questions, are fed into GPT-4 (text-only) to generate reference answers. On the other hand, different LVLMs directly take visual signals and questions as input to generate answers. The generated answers, reference answers, questions, and fine-grained descriptions are all scored by GPT-4. The final scores are averaged and used to rank the models, representing their comprehensive performance.

New Results

RankModelScore
🏅️GPT-4V803.5
🥈CogVLM742.0
🥉Qwen-VL711.6
4Emu2703.8
5mPLUG-Owl605.4
6LLaVA602.7
7LLaMA-AdapterV2590.1
8InstructBLIP552.4
9MiniGPT4531.7
10PandaGPT488.5

Evaluation Results

Run Evaluation

Read image
importioimportbase64importpandasaspdfromPILimportImagedefdecode_base64_to_image(base64_string):image_data=base64.b64decode(base64_string)image=Image.open(io.BytesIO(image_data))returnimagedf=pd.read_csv("touchstone_20230831.tsv",sep='\t')index=0image=decode_base64_to_image(df.iloc[index]['image'])question=df.iloc[index]['question']human_annotation=df.iloc[index]['human_annotation']gpt4_ha_answer=df.iloc[index]['gpt4_ha_answer']category=df.iloc[index]['category']task_name=df.iloc[index]['task_name']
Format requirement
  • The submitted file should be in CSV format with the delimiter set as '\t'.
  • The submitted file must contain the following fields: index, question, human_annotation, gpt4_ha_answer, category, task_name, and response. The "response" field represents the model's answer, while the other fields should match theevaluation dataset file.
  • The number of rows in the submission.xlsx file (excluding the header) should be consistent with the evaluation dataset, which is 908 rows.

The evaluation script is provided ineval.py.

python eval.py submit_file openai_key --model-name your_model

Citation

@misc{bai2023touchstone,      title={TouchStone: Evaluating Vision-Language Models by Language Models},       author={Shuai Bai and Shusheng Yang and Jinze Bai and Peng Wang and Xingxuan Zhang and Junyang Lin and Xinggang Wang and Chang Zhou and Jingren Zhou},      year={2023},      eprint={2308.16890},      archivePrefix={arXiv},      primaryClass={cs.CV}}

About

Touchstone: Evaluating Vision-Language Models by Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp