- Notifications
You must be signed in to change notification settings - Fork0
OFA-Sys/TouchStone
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TOUCHSTONE is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention.
TouchStone is a diverse and comprehensive dataset that covers five key dimensions:Basic Descriptive Ability,Visual Recognition Ability,Visual Comprehension Ability,Visual Storytelling Ability, andMulti-image Analysis Ability. You can download the datasethere.
Our dataset currently places more emphasis on assessing basic abilities, where the highest proportion of questions pertains to recognition, accounting for about 44.1%, followed by comprehension questions at 29.6%. The proportions of the other categories are 15.3% for basic descriptive ability, 7.4% for visual storytelling ability, and 3.6% for multi-image analysis ability. There are a total of 908 dialogue.
TouchStone leverages fine-grained annotation and strong LLMs to evaluate LVLMs. Firstly, fine-grained descriptions of images are obtained through manual annotation and inspection. These descriptions, along with questions, are fed into GPT-4 (text-only) to generate reference answers. On the other hand, different LVLMs directly take visual signals and questions as input to generate answers. The generated answers, reference answers, questions, and fine-grained descriptions are all scored by GPT-4. The final scores are averaged and used to rank the models, representing their comprehensive performance.
Rank | Model | Score |
---|---|---|
🏅️ | GPT-4V | 803.5 |
🥈 | CogVLM | 742.0 |
🥉 | Qwen-VL | 711.6 |
4 | Emu2 | 703.8 |
5 | mPLUG-Owl | 605.4 |
6 | LLaVA | 602.7 |
7 | LLaMA-AdapterV2 | 590.1 |
8 | InstructBLIP | 552.4 |
9 | MiniGPT4 | 531.7 |
10 | PandaGPT | 488.5 |
Read image
importioimportbase64importpandasaspdfromPILimportImagedefdecode_base64_to_image(base64_string):image_data=base64.b64decode(base64_string)image=Image.open(io.BytesIO(image_data))returnimagedf=pd.read_csv("touchstone_20230831.tsv",sep='\t')index=0image=decode_base64_to_image(df.iloc[index]['image'])question=df.iloc[index]['question']human_annotation=df.iloc[index]['human_annotation']gpt4_ha_answer=df.iloc[index]['gpt4_ha_answer']category=df.iloc[index]['category']task_name=df.iloc[index]['task_name']
Format requirement
- The submitted file should be in CSV format with the delimiter set as '\t'.
- The submitted file must contain the following fields: index, question, human_annotation, gpt4_ha_answer, category, task_name, and response. The "response" field represents the model's answer, while the other fields should match theevaluation dataset file.
- The number of rows in the submission.xlsx file (excluding the header) should be consistent with the evaluation dataset, which is 908 rows.
The evaluation script is provided ineval.py.
python eval.py submit_file openai_key --model-name your_model
@misc{bai2023touchstone, title={TouchStone: Evaluating Vision-Language Models by Language Models}, author={Shuai Bai and Shusheng Yang and Jinze Bai and Peng Wang and Xingxuan Zhang and Junyang Lin and Xinggang Wang and Chang Zhou and Jingren Zhou}, year={2023}, eprint={2308.16890}, archivePrefix={arXiv}, primaryClass={cs.CV}}