Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
#

multimodal-large-language-models

Here are 380 public repositories matching this topic...

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

  • UpdatedNov 7, 2025
  • Python

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

  • UpdatedMay 19, 2025
  • Python

✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

  • UpdatedMar 28, 2025
  • Python

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

  • UpdatedMay 30, 2025
  • Python

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

  • UpdatedNov 7, 2025
  • Python

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化(视频分析)框架,觉得有帮助的请给个星星 : )

  • UpdatedNov 5, 2025
  • C++

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

  • UpdatedFeb 1, 2025
  • Jupyter Notebook

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

  • UpdatedJun 14, 2025
  • Jupyter Notebook

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

  • UpdatedSep 22, 2025
  • Python

实时语音交互数字人,支持语音端到端和级联方案。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

  • UpdatedOct 31, 2025
  • Python

A family of lightweight multimodal models.

  • UpdatedNov 18, 2024
  • Python

A Framework for Speech, Language, Audio, Music Processing with Large Language Model

  • UpdatedOct 24, 2025
  • Python

PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

  • UpdatedDec 15, 2025

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

  • UpdatedFeb 1, 2024
  • Python

Improve this page

Add a description, image, and links to themultimodal-large-language-models topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with themultimodal-large-language-models topic, visit your repo's landing page and select "manage topics."

Learn more


[8]ページ先頭

©2009-2025 Movatter.jp