large-multimodal-models

Star

Here are 78 public repositories matching this topic...

Language:All

Filter by language

All78 Python63 Jupyter Notebook3

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

VITA-MLLM /VITA

Star2.4k

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

multimodal-large-language-models large-multimodal-models

UpdatedMar 28, 2025
Python

OpenAdaptAI /OpenAdapt

Sponsor

Star1.4k

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

python transformers openai agents process-mining ai-agents process-automation huggingface ultralytics large-language-models anthropic segment-anything ai-agents-framework large-multimodal-models google-gemini large-action-model omniparser gpt4o generative-process-automation computer-use

UpdatedMar 16, 2025
Python

NVlabs /describe-anything

Star1.4k

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

vision-language-model large-multimodal-models describe-anything detailed-localized-captioning

UpdatedJun 26, 2025
Python

ShareGPT4Omni /ShareGPT4Video

Star1.1k

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"

gpt sora text-to-video large-language-models chatgpt large-vision-language-models large-multimodal-models gpt-4v large-video-language-models

UpdatedOct 9, 2024
Python

TinyLLaVA /TinyLLaVA_Factory

Star914

A Framework of Small-scale Large Multimodal Models

nlp transformers llama vision-language llava large-multimodal-models tinyllama

UpdatedApr 26, 2025
Python

richard-peng-xia /awesome-multimodal-in-medical-imaging

Star861

A collection of resources on applications of multi-modal learning in medical imaging.

medical-imaging multimodal-learning visual-question-answering multimodal-deep-learning large-language-models medical-report-generation multimodal-large-language-models large-multimodal-models

UpdatedAug 26, 2025

LLaVA-VL /LLaVA-Plus-Codebase

Star760

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

agent tool-use large-language-models multimodal-large-language-models large-multimodal-models

UpdatedFeb 1, 2024
Python

ictnlp /LLaVA-Mini

Star537

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

video efficient vision llama multimodal large-language-models vision-language-model llava visual-instruction-tuning multimodal-large-language-models gpt4v large-multimodal-models gpt4o

UpdatedJun 29, 2025
Python

MMMU-Benchmark /MMMU

Star513

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

machine-learning natural-language-processing deep-neural-networks computer-vision deep-learning evaluation question-answering stem multimodality multimodal-learning visual-question-answering multimodal multimodal-deep-learning foundation-models large-language-models llm llms large-multimodal-models

UpdatedMay 19, 2025
Python

xiaoachen98 /Open-LLaVA-NeXT

Star425

An open-source implementation for training LLaVA-NeXT.

chatbot llama multimodal multi-modality gpt-4 visual-language-learning chatgpt vision-language-model llava large-multimodal-models llama3 gpt4o llava-next

UpdatedOct 23, 2024
Python

shikiw /OPERA

Star377

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

chatbot llama multimodal gpt-4 chatgpt vision-language-model vision-language-learning large-multimodal-models

UpdatedAug 24, 2024
Python

ictnlp /Stream-Omni

Star354

Stream-Omni is a GPT-4o-like language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.

chatbot speech tts speech-synthesis vision speech-recognition question-answering llama speech-to-text interaction asr large-language-models llm chatgpt vision-language-model large-multimodal-models gpt-4o mutlimodal

UpdatedJun 17, 2025
Python

zjysteven /lmms-finetune

Star352

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

finetuning multimodal vision-language foundation-models instruction-tuning large-language-model llava visual-instruction-tuning multimodal-large-language-models large-multimodal-models qwen-vl llava-next