Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型

License

NotificationsYou must be signed in to change notification settings

smg478/InternVL

 
 

Repository files navigation

image InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V

[Update Blog][Paper][InternVL 1.5 Technical Report][Chat Demo][HuggingFace Demo][Quick Start][Community-hosted API][中文解读]

OpenGVLab%2FInternVL | Trendshift

News🚀🚀🚀

  • 2024/05/13: 🔥 InternVL can now be used as thetext encoder for diffusion models to support multilingual generation natively in over 110 languages worldwide. SeeMuLan for more details.
  • 2024/04/28: We release the INT8 version of InternVL-Chat-V1-5, seeHF link.
  • 2024/04/28: We achieve the SOTA performance (75.74) on the Infographics VQA benchmark, seehere.
  • 2024/04/18: InternVL-Chat-V1.5 has been released atHF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
  • 2024/02/27: InternVL is accepted by CVPR 2024! 🎉
  • 2024/02/24: InternVL-Chat models have been included in theVLMEvalKit.
  • 2024/02/21:InternVL-Chat-V1.2-Plus achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See ourblog for more details.
  • 2024/02/12: InternVL-Chat-V1.2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to ourblog,SFT data or try ourdemo. The model is now available onHuggingFace, and both training/evaluation data and scripts are open-sourced.
  • 2024/02/04:InternVL-Chat-V1.1 achieves 44.67% onMMVP, higher than GPT-4V!
  • 2024/01/27: We release 448 resolution model, achieving 76.6 on MMBench dev, seehere.
  • 2024/01/24: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, seehere or try ourdemo.
  • 2024/01/16: We release ourcustomized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.

Documents

  • How to install the environment?[link]
  • How to reproduce the SFT stage of InternVL-Chat-V1.2?[link]
  • How to fine-tune InternVL-Chat-V1.2 on a custom dataset?[link]
  • How to evaluate InternVL-Chat-V1-5?[link]
  • How to evaluate InternVL-Chat-V1-5 using VLMEvalKit? (Recommend)[link]
  • How to deploy a local demo?[link]
  • How to run InternVL 1.5-8bit with Nvidia V100 GPU?[link][中文教程]
  • How to perform batch inference?[link]
  • Inference Acceleration by LMDeploy[link][中文教程]

Compared with SOTA VLLMs

image

imageimage

What is InternVL?

InternVL scales up the ViT to6B parameters and aligns it with LLM.

Model Zoo

Vision Large Language Model

ModelDateDownloadNote
Mini-InternVL−Chat−2B-V1.5 (Preview version)2024.05.19🤗HF link🚀🚀 Only 2B parameters, anyone can deploy it locally.
InternVL−Chat−V1.5-Int82024.04.28🤗HF linkThe INT8 version of InternVL-Chat-V1-5
InternVL−Chat−V1.52024.04.18🤗HF linksupport 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)
InternVL−Chat−V1.2−Plus2024.02.21🤗HF linkmore SFT data and stronger
InternVL−Chat−V1.22024.02.11🤗HF linkscaling up LLM to 34B
InternVL−Chat−V1.12024.01.24🤗HF linksupport Chinese and stronger OCR
InternVL−Chat−19B−448px2024.02.03🤗HF link448 resolution
InternVL−Chat−19B2023.12.25🤗HF linkEnglish multimodal dialogue
InternVL−Chat−13B2023.12.25🤗HF linkEnglish multimodal dialogue

Vision-Language Foundation Model

ModelDateDownloadNote
InternViT−6B−448px−V1.52024.04.20🤗HF linksupport dynamic resolution, super strong OCR (🔥new)
InternViT−6B−448px−V1.22024.02.11🤗HF link448 resolution
InternViT−6B−448px−V1.02024.01.30🤗HF link448 resolution
InternViT−6B−224px2023.12.22🤗HF linkvision foundation model
InternVL−14B−224px2023.12.22🤗HF linkvision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrival like CLIP

What can InternVL do?

Visual Perception (click to expand)
  • Linear-Probe Image Classification[see details]

    ViT-22B uses the private JFT-3B dataset.

    method#paramIN-1KIN-ReaLIN-V2IN-AIN-RIN-Sketch
    OpenCLIP-G1.8B86.289.477.263.887.866.4
    DINOv2-g1.1B86.589.678.475.978.862.5
    EVA-01-CLIP-g1.1B86.589.377.470.587.763.1
    MAWS-ViT-6.5B6.5B87.8-----
    ViT-22B*21.7B89.590.983.283.887.4
    InternViT-6B (ours)5.9B88.290.479.977.589.869.1
  • Semantic Segmentation[see details]

    methoddecoder#param (train/total)crop sizemIoU
    OpenCLIP-G (frozen)Linear0.3M / 1.8B51239.3
    ViT-22B (frozen)Linear0.9M / 21.7B50434.6
    InternViT-6B (frozen)Linear0.5M / 5.9B50447.2 (+12.6)
    ViT-22B (frozen)UperNet0.8B / 22.5B50452.7
    InternViT-6B (frozen)UperNet0.4B / 6.3B50454.9 (+2.2)
    ViT-22BUperNet22.5B / 22.5B50455.3
    InternViT-6BUperNet6.3B / 6.3B50458.9 (+3.6)
  • Zero-Shot Image Classification[see details]

    methodIN-1KIN-AIN-RIN-V2IN-SketchObjectNet
    OpenCLIP-G80.169.392.173.668.973.0
    EVA-02-CLIP-E+82.082.194.575.771.679.6
    ViT-22B*85.990.196.080.987.6
    InternVL-C (ours)83.283.895.577.373.980.6
  • Multilingual Zero-Shot Image Classification[see details]

    EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

    methodIN-1K (EN)IN-1K (ZH)IN-1K (JP)IN-1K (AR)IN-1K (IT)
    Taiyi-CLIP-ViT-H-54.4---
    WuKong-ViT-L-G-57.5---
    CN-CLIP-ViT-H-59.6---
    AltCLIP-ViT-L74.559.6---
    EVA-02-CLIP-E+82.0---41.2
    OpenCLIP-XLM-R-H77.055.753.137.056.8
    InternVL-C (ours)83.264.561.544.965.7
  • Zero-Shot Video Classification [see details]

    method#frameK400K600K700
    OpenCLIP-G165.966.159.2
    EVA-02-CLIP-E+169.869.363.4
    InternVL-C (ours)171.071.365.7
    ViCLIP875.773.566.4
    InternVL-C (ours)879.478.871.5
Cross-Modal Retrieval (click to expand)
  • English Zero-Shot Image-Text Retrieval[see details]

    modelFlickr30KCOCOavg
    image-to-texttext-to-imageimage-to-texttext-to-image
    R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
    OpenCLIP-G92.999.399.879.595.097.167.386.992.651.474.983.085.0
    EVA-02-CLIP-E+93.999.499.878.894.296.868.887.892.851.175.082.785.1
    EVA-CLIP-8B95.699.699.980.895.597.670.389.393.953.076.083.486.2
    InternVL-C (ours)94.799.699.981.796.098.270.689.093.554.177.384.686.6
    InternVL-G (ours)95.799.799.985.097.098.674.991.395.258.681.388.088.8
  • Chinese Zero-Shot Image-Text Retrieval[see details]

    modelFlickr30K-CNCOCO-CNavg
    image-to-texttext-to-imageimage-to-texttext-to-image
    R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
    CN-CLIP-ViT-H81.697.598.871.291.495.563.086.692.969.289.996.186.1
    OpenCLIP-XLM-R-H86.197.599.271.090.594.970.091.597.066.190.896.087.6
    InternVL-C (ours)90.398.899.775.192.996.468.892.096.768.991.996.589.0
    InternVL-G (ours)92.999.499.877.794.897.371.493.997.773.894.498.190.9
  • Multilingual Zero-Shot Image-Text Retrieval on XTD[see details]

    methodENESFRZHITKORUJPaverage
    AltCLIP95.494.192.995.194.294.491.891.793.7
    OpenCLIP-XLM-R-H97.396.194.594.796.090.293.994.094.6
    InternVL-C (ours)97.395.795.195.696.092.293.395.595.1
    InternVL-G (ours)98.697.796.596.796.995.194.896.196.6
Multimodal Dialogue (see "Compared with SOTA VLLMs")

Quick Start with Huggingface

using InternViT-6B (click to expand)
importtorchfromPILimportImagefromtransformersimportAutoModel,CLIPImageProcessormodel=AutoModel.from_pretrained('OpenGVLab/InternViT-6B-224px',torch_dtype=torch.bfloat16,low_cpu_mem_usage=True,trust_remote_code=True).cuda().eval()image=Image.open('./examples/image1.jpg').convert('RGB')image_processor=CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')pixel_values=image_processor(images=image,return_tensors='pt').pixel_valuespixel_values=pixel_values.to(torch.bfloat16).cuda()outputs=model(pixel_values)
using InternVL-C(ontrastive) and InternVL-G(enerative) (click to expand)
importtorchfromPILimportImagefromtransformersimportAutoModel,CLIPImageProcessorfromtransformersimportAutoTokenizermodel=AutoModel.from_pretrained('OpenGVLab/InternVL-14B-224px',torch_dtype=torch.bfloat16,low_cpu_mem_usage=True,trust_remote_code=True).cuda().eval()image_processor=CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')tokenizer=AutoTokenizer.from_pretrained('OpenGVLab/InternVL-14B-224px',use_fast=False,add_eos_token=True)tokenizer.pad_token_id=0# set pad_token_id to 0images= [Image.open('./examples/image1.jpg').convert('RGB'),Image.open('./examples/image2.jpg').convert('RGB'),Image.open('./examples/image3.jpg').convert('RGB')]prefix='summarize:'texts= [prefix+'a photo of a red panda',# Englishprefix+'一张熊猫的照片',# Chineseprefix+'二匹の猫の写真'# Japanese]pixel_values=image_processor(images=images,return_tensors='pt').pixel_valuespixel_values=pixel_values.to(torch.bfloat16).cuda()input_ids=tokenizer(texts,return_tensors='pt',max_length=80,truncation=True,padding='max_length').input_ids.cuda()# InternVL-Clogits_per_image,logits_per_text=model(image=pixel_values,text=input_ids,mode='InternVL-C')probs=logits_per_image.softmax(dim=-1)# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],#         [2.2949e-02, 9.7656e-01, 5.9903e-06],#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)# InternVL-Glogits_per_image,logits_per_text=model(image=pixel_values,text=input_ids,mode='InternVL-G')probs=logits_per_image.softmax(dim=-1)# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],#         [8.6060e-03, 9.9219e-01, 2.8759e-06],#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)# please set add_eos_token to False for generationtokenizer.add_eos_token=Falseimage=Image.open('./examples/image1.jpg').convert('RGB')pixel_values=image_processor(images=image,return_tensors='pt').pixel_valuespixel_values=pixel_values.to(torch.bfloat16).cuda()tokenized=tokenizer("English caption:",return_tensors='pt')pred=model.generate(pixel_values=pixel_values,input_ids=tokenized.input_ids.cuda(),attention_mask=tokenized.attention_mask.cuda(),num_beams=5,min_new_tokens=8,)caption=tokenizer.decode(pred[0].cpu(),skip_special_tokens=True).strip()# English caption: a red panda sitting on top of a wooden platform
using InternVL-Chat (click to expand)
fromtransformersimportAutoTokenizer,AutoModelimporttorchimporttorchvision.transformsasTfromPILimportImagefromtorchvision.transforms.functionalimportInterpolationModeIMAGENET_MEAN= (0.485,0.456,0.406)IMAGENET_STD= (0.229,0.224,0.225)defbuild_transform(input_size):MEAN,STD=IMAGENET_MEAN,IMAGENET_STDtransform=T.Compose([T.Lambda(lambdaimg:img.convert('RGB')ifimg.mode!='RGB'elseimg),T.Resize((input_size,input_size),interpolation=InterpolationMode.BICUBIC),T.ToTensor(),T.Normalize(mean=MEAN,std=STD)    ])returntransformdeffind_closest_aspect_ratio(aspect_ratio,target_ratios,width,height,image_size):best_ratio_diff=float('inf')best_ratio= (1,1)area=width*heightforratiointarget_ratios:target_aspect_ratio=ratio[0]/ratio[1]ratio_diff=abs(aspect_ratio-target_aspect_ratio)ifratio_diff<best_ratio_diff:best_ratio_diff=ratio_diffbest_ratio=ratioelifratio_diff==best_ratio_diff:ifarea>0.5*image_size*image_size*ratio[0]*ratio[1]:best_ratio=ratioreturnbest_ratiodefdynamic_preprocess(image,min_num=1,max_num=6,image_size=448,use_thumbnail=False):orig_width,orig_height=image.sizeaspect_ratio=orig_width/orig_height# calculate the existing image aspect ratiotarget_ratios=set(        (i,j)forninrange(min_num,max_num+1)foriinrange(1,n+1)forjinrange(1,n+1)ifi*j<=max_numandi*j>=min_num)target_ratios=sorted(target_ratios,key=lambdax:x[0]*x[1])# find the closest aspect ratio to the targettarget_aspect_ratio=find_closest_aspect_ratio(aspect_ratio,target_ratios,orig_width,orig_height,image_size)# calculate the target width and heighttarget_width=image_size*target_aspect_ratio[0]target_height=image_size*target_aspect_ratio[1]blocks=target_aspect_ratio[0]*target_aspect_ratio[1]# resize the imageresized_img=image.resize((target_width,target_height))processed_images= []foriinrange(blocks):box= (            (i% (target_width//image_size))*image_size,            (i// (target_width//image_size))*image_size,            ((i% (target_width//image_size))+1)*image_size,            ((i// (target_width//image_size))+1)*image_size        )# split the imagesplit_img=resized_img.crop(box)processed_images.append(split_img)assertlen(processed_images)==blocksifuse_thumbnailandlen(processed_images)!=1:thumbnail_img=image.resize((image_size,image_size))processed_images.append(thumbnail_img)returnprocessed_imagesdefload_image(image_file,input_size=448,max_num=6):image=Image.open(image_file).convert('RGB')transform=build_transform(input_size=input_size)images=dynamic_preprocess(image,image_size=input_size,use_thumbnail=True,max_num=max_num)pixel_values= [transform(image)forimageinimages]pixel_values=torch.stack(pixel_values)returnpixel_valuespath="OpenGVLab/InternVL-Chat-V1-5"# If you have an 80G A100 GPU, you can put the entire model on a single GPU.model=AutoModel.from_pretrained(path,torch_dtype=torch.bfloat16,low_cpu_mem_usage=True,trust_remote_code=True).eval().cuda()# Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.# model = AutoModel.from_pretrained(#     path,#     torch_dtype=torch.bfloat16,#     low_cpu_mem_usage=True,#     trust_remote_code=True,#     device_map='auto').eval()tokenizer=AutoTokenizer.from_pretrained(path,trust_remote_code=True)# set the max number of tiles in `max_num`pixel_values=load_image('./examples/image1.jpg',max_num=6).to(torch.bfloat16).cuda()generation_config=dict(num_beams=1,max_new_tokens=512,do_sample=False,)# single-round single-image conversationquestion="请详细描述图片"# Please describe the picture in detailresponse=model.chat(tokenizer,pixel_values,question,generation_config)print(question,response)# multi-round single-image conversationquestion="请详细描述图片"# Please describe the picture in detailresponse,history=model.chat(tokenizer,pixel_values,question,generation_config,history=None,return_history=True)print(question,response)question="请根据图片写一首诗"# Please write a poem according to the pictureresponse,history=model.chat(tokenizer,pixel_values,question,generation_config,history=history,return_history=True)print(question,response)# multi-round multi-image conversationpixel_values1=load_image('./examples/image1.jpg',max_num=6).to(torch.bfloat16).cuda()pixel_values2=load_image('./examples/image2.jpg',max_num=6).to(torch.bfloat16).cuda()pixel_values=torch.cat((pixel_values1,pixel_values2),dim=0)question="详细描述这两张图片"# Describe the two pictures in detailresponse,history=model.chat(tokenizer,pixel_values,question,generation_config,history=None,return_history=True)print(question,response)question="这两张图片的相同点和区别分别是什么"# What are the similarities and differences between these two picturesresponse,history=model.chat(tokenizer,pixel_values,question,generation_config,history=history,return_history=True)print(question,response)# batch inference (single image per sample)pixel_values1=load_image('./examples/image1.jpg',max_num=6).to(torch.bfloat16).cuda()pixel_values2=load_image('./examples/image2.jpg',max_num=6).to(torch.bfloat16).cuda()image_counts= [pixel_values1.size(0),pixel_values2.size(0)]pixel_values=torch.cat((pixel_values1,pixel_values2),dim=0)questions= ["Describe the image in detail."]*len(image_counts)responses=model.batch_chat(tokenizer,pixel_values,image_counts=image_counts,questions=questions,generation_config=generation_config)forquestion,responseinzip(questions,responses):print(question)print(response)

Inference Acceleration by LMDeploy

We recommend usingLMDeploy, if InternVL-Chat model inference optimization is required.

In the following subsections, we will introduce the usage of LMDeploy with theInternVL-Chat-V1-5 model as an example.

First of all, please setup the inference environment as follows:

conda create -n internvl python=3.10 -yconda activate internvlpip install timm torchvision==0.17.2pip install lmdeploy

LMDeploy pypi package depends on CUDA 12.x by default. For a CUDA 11.x environment, please refer to theinstallation guide.

Offline Inference Pipeline

fromlmdeployimportpipelinefromlmdeploy.vlimportload_imagepipe=pipeline('OpenGVLab/InternVL-Chat-V1-5')image=load_image('examples/image2.jpg')response=pipe(('describe this image',image))print(response)

For more on using the VLM pipeline, including multi-image inference or multi-turn chat, please overviewthis guide.

Online Inference Service

LMDeploy supports one-click packaging of the VLM model into an OpenAI service, providing seamless integration with the OpenAI API.

The service can be launched by one command as below:

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5

The arguments ofapi_server can be viewed through the commandlmdeploy serve api_server -h, for instance,--tp to set tensor parallelism,--session-len to specify the max length of the context window,--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

For more details, including service startup with docker, RESTful API information, and openai integration methods, please refer tothis guide.

License

This project is released under theMIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find this project useful in your research, please consider cite:

@article{chen2023internvl,title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},journal={arXiv preprint arXiv:2312.14238},year={2023}}@article{chen2024far,title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},journal={arXiv preprint arXiv:2404.16821},year={2024}}

Acknowledgement

InternVL is built with reference to the code of the following projects:OpenAI CLIP,Open CLIP,CLIP Benchmark,EVA,InternImage,ViT-Adapter,MMSegmentation,Transformers,DINOv2,BLIP-2,Qwen-VL, andLLaVA-1.5. Thanks for their awesome work!


If you want to join our WeChat group, please scan the following QR Code to add our assistant as a Wechat friend:

image

About

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python48.5%
  • Jupyter Notebook48.4%
  • Shell2.6%
  • JavaScript0.2%
  • HTML0.2%
  • Makefile0.1%

[8]ページ先頭

©2009-2026 Movatter.jp