multimodal-large-language-models

Star

Here are 380 public repositories matching this topic...

Language:All

Filter by language

All380 Python263 Jupyter Notebook54 HTML6 JavaScript3 C++2 Java2 TypeScript2 C#1 CSS1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

BradyFU /Awesome-Multimodal-Large-Language-Models

Star17k

✨✨Latest Advances on Multimodal Large Language Models

multi-modality instruction-following in-context-learning large-language-models chain-of-thought instruction-tuning visual-instruction-tuning large-vision-language-model multimodal-instruction-tuning large-vision-language-models multimodal-large-language-models multimodal-in-context-learning multimodal-chain-of-thought

UpdatedDec 12, 2025

X-PLUG /MobileAgent

Star6.7k

Mobile-Agent: The Powerful GUI Agent Family

android agent app gui automation mobile copilot multimodal mobile-agents mllm multimodal-large-language-models multimodal-agent

UpdatedDec 2, 2025
Python

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

svg vlm llm multimodal-large-language-models

UpdatedNov 7, 2025
Python

ictnlp /LLaMA-Omni

Star3.1k

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

speech-to-text speech-to-speech large-language-models multimodal-large-language-models speech-language-model speech-interaction

UpdatedMay 19, 2025
Python

VITA-MLLM /VITA

Star2.5k

✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

multimodal-large-language-models large-multimodal-models omni-modal-video-understanding omni-language-model omni-model

UpdatedMar 28, 2025
Python

X-PLUG /mPLUG-DocOwl

Star2.3k

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

multimodal table-understanding document-understanding mllm multimodal-large-language-models chart-understanding

UpdatedMay 30, 2025
Python

cambrian-mllm /cambrian

Star2k

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

computer-vision chatbot representation-learning clip dino large-language-models llms instruction-tuning mllm multimodal-large-language-models

UpdatedNov 7, 2025
Python

sherlockchou86 /VideoPipe

Star1.9k

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )

opencv ai deep-learning gstreamer cv feature-extraction openai image-classification face-recognition object-detection deepstream image-segmentation similarity-search video-analysis license-plate-recognition reid behaviour-analysis llm multimodal-large-language-models ollama

UpdatedNov 5, 2025
C++

YangLing0818 /RPG-DiffusionMaster

Star1.8k

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

text-to-image image-editting large-language-models multimodal-large-language-models

UpdatedFeb 1, 2025
Jupyter Notebook

ByteDance-Seed /Seed1.5-VL

Star1.5k

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

cookbook large-language-model vision-language-model multimodal-large-language-models

UpdatedJun 14, 2025
Jupyter Notebook

AIDC-AI /Ovis

Star1.4k

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

chatbot multimodality multimodal vision-language-model multimodal-large-language-models vision-language-learning qwen llama3

UpdatedSep 22, 2025
Python

Henry-23 /VideoChat

Star1.2k

实时语音交互数字人，支持语音端到端和级联方案。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

streaming real-time end-to-end tts lip-sync dialogue-systems asr talking-head digital-human multimodal-large-language-models musetalk gradio-python-app