Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.16821 (cs)

[Submitted on 25 Apr 2024 (v1), last revised 29 Apr 2024 (this version, v2)]

Title:How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Authors:Zhe Chen,Weiyun Wang,Hao Tian,Shenglong Ye,Zhangwei Gao,Erfei Cui,Wenwen Tong,Kongzhi Hu,Jiapeng Luo,Zheng Ma,Ji Ma,Jiaqi Wang,Xiaoyi Dong,Hang Yan,Hewei Guo,Conghui He,Botian Shi,Zhenjiang Jin,Chao Xu,Bin Wang,Xingjian Wei,Wei Li,Wenjian Zhang,Bo Zhang,Pinlong Cai,Licheng Wen,Xiangchao Yan,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang

View PDF HTML (experimental)

Abstract:In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released atthis https URL.

Comments:	Technical report
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.16821 [cs.CV]
	(orarXiv:2404.16821v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.16821

Submission history

From: Zhe Chen [view email]
[v1] Thu, 25 Apr 2024 17:59:19 UTC (6,452 KB)
[v2] Mon, 29 Apr 2024 20:24:30 UTC (7,083 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new |recent |2024-04

Change to browse by:

References & Citations

export BibTeX citation

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer(What is the Explorer?)

Connected Papers Toggle

Connected Papers(What is Connected Papers?)

Litmaps Toggle

Litmaps(What is Litmaps?)

scite.ai Toggle

scite Smart Citations(What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv(What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers(What is CatalyzeX?)

DagsHub Toggle

DagsHub(What is DagsHub?)

GotitPub Toggle

Gotit.pub(What is GotitPub?)

Huggingface Toggle

Hugging Face(What is Huggingface?)

Links to Code Toggle

Papers with Code(What is Papers with Code?)

ScienceCast Toggle

ScienceCast(What is ScienceCast?)

Demos

Replicate Toggle

Replicate(What is Replicate?)

Spaces Toggle

Hugging Face Spaces(What is Spaces?)

Spaces Toggle

TXYZ.AI(What is TXYZ.AI?)

Recommenders and Search Tools

Link to Influence Flower

Influence Flower(What are Influence Flowers?)

Core recommender toggle

CORE Recommender(What is CORE?)

Author
Venue
Institution
Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community?Learn more about arXivLabs.

Movatterモバイル変換