ictnlp/LLaVA-MiniPublic

NotificationsYou must be signed in to change notification settings
Fork24
Star508

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

License

Apache-2.0 license

508 stars 24 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
docs		docs
llava		llava
llavamini		llavamini
playground/data		playground/data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml
test.sh		test.sh
webui.sh		webui.sh

Repository files navigation

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Shaolei Zhang,Qingkai Fang,Zhe Yang,Yang Feng*

LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities.Model anddemo of LLaVA-Mini are available now!

Note

LLaVA-Mini only requires1 token to represent each image, which improves the efficiency of image and video understanding, including:

Computational effort: 77% FLOPs reduction
Response latency: reduce from 100 milliseconds to 40 milliseconds
VRAM memory usage: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing

💡Highlight:

Good Performance: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
High Efficiency: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
Insights: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to ourpaper for a detailed analysis and our conclusions.

🖥 Demo

Download LLaVA-Mini model fromhere.

Run these scripts and Interact with LLaVA-Mini in your browser:

# Launch a controllerpython -m llavamini.serve.controller --host 0.0.0.0 --port 10000&# Build the API of LLaVA-Mini, if the VRAM memory is less than 20GB, try using --load-8bitCUDA_VISIBLE_DEVICES=0  python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini&# Start the interactive interfacepython -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload  --port 7860

🔥 Quick Start

Requirements

Install packages:

conda create -n llavamini python=3.10 -yconda activate llavaminipip install -e.pip install -e".[train]"pip install flash-attn --no-build-isolation

Command Interaction

Image understanding, using--image-file.

If the VRAM memory is less than 20GB, try using--load-8bit.

# Image UnderstandingCUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \    --model-path  ICTNLP/llava-mini-llama-3.1-8b \    --image-file llavamini/serve/examples/baby_cake.png \    --conv-mode llava_llama_3_1 --model-name"llava-mini" \    --query"What's the text on the cake?"

Video understanding, using--video-file:

# Video UnderstandingCUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \    --model-path  ICTNLP/llava-mini-llama-3.1-8b \    --video-file llavamini/serve/examples/fifa.mp4 \    --conv-mode llava_llama_3_1 --model-name"llava-mini" \    --query"What happened in this video?"

Reproduction and Evaluation

Refer toEvaluation.md for the evaluation of LLaVA-Mini on image/video benchmarks.

Cases

LLaVA-Mini achieves high-quality image understanding and video understanding.

More cases

LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).

🤝 Acknowledgement

LLaVA: LLaVA-Mini is built upon LLaVA codebase, a large language and vision assistant.
Video-ChatGPT: The training of LLaVA-Mini involves the video instruction data provided by Video-ChatGPT.
LLaVA-OneVision: The training of LLaVA-Mini involves the image instruction data provided by LLaVA-OneVision.

🖋Citation

If this repository is useful for you, please cite as:

@misc{llavamini,      title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token},       author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},      year={2025},      eprint={2501.03895},      archivePrefix={arXiv},      primaryClass={cs.CV},      url={https://arxiv.org/abs/2501.03895}, }

If you have any questions, please feel free to submit an issue or contactzhangshaolei20z@ict.ac.cn.

About

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

🖥 Demo

🔥 Quick Start

Requirements

Command Interaction

Reproduction and Evaluation

Cases

🤝 Acknowledgement

🖋Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

ictnlp/LLaVA-Mini

Folders and files

Latest commit

History

Repository files navigation

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

🖥 Demo

🔥 Quick Start

Requirements

Command Interaction

Reproduction and Evaluation

Cases

🤝 Acknowledgement

🖋Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages