multimodal-large-language-models
Here are 380 public repositories matching this topic...
Language:All
Sort:Most stars
✨✨Latest Advances on Multimodal Large Language Models
- Updated
Dec 12, 2025
Mobile-Agent: The Powerful GUI Agent Family
- Updated
Dec 2, 2025 - Python
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.
- Updated
Nov 7, 2025 - Python
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
- Updated
May 19, 2025 - Python
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
- Updated
Mar 28, 2025 - Python
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
- Updated
May 30, 2025 - Python
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
- Updated
Nov 7, 2025 - Python
A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化(视频分析)框架,觉得有帮助的请给个星星 : )
- Updated
Nov 5, 2025 - C++
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
- Updated
Feb 1, 2025 - Jupyter Notebook
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
- Updated
Jun 14, 2025 - Jupyter Notebook
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
- Updated
Sep 22, 2025 - Python
实时语音交互数字人,支持语音端到端和级联方案。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.
- Updated
Oct 31, 2025 - Python
Awesome Unified Multimodal Models
- Updated
Aug 17, 2025
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
- Updated
Oct 24, 2025 - Python
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
- Updated
Dec 15, 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
- Updated
Nov 14, 2025
A collection of resources on applications of multi-modal learning in medical imaging.
- Updated
Aug 26, 2025
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
- Updated
Feb 1, 2024 - Python
Large-Scale Visual Representation Model
- Updated
Dec 8, 2025 - Python
Improve this page
Add a description, image, and links to themultimodal-large-language-models topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with themultimodal-large-language-models topic, visit your repo's landing page and select "manage topics."