Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork147
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
License
Blaizzy/mlx-vlm
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.
The easiest way to get started is to install themlx-vlm
package using pip:
pip install -U mlx-vlm
Generate output from a model using the CLI:
# Image generationmlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg# Audio generation (New)mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt"Describe what you hear" --audio /path/to/audio.wav# Multi-modal generation (Image + Audio)mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt"Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav
Launch a chat interface using Gradio:
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
Here's an example of how to use MLX-VLM in a Python script:
importmlx.coreasmxfrommlx_vlmimportload,generatefrommlx_vlm.prompt_utilsimportapply_chat_templatefrommlx_vlm.utilsimportload_config# Load the modelmodel_path="mlx-community/Qwen2-VL-2B-Instruct-4bit"model,processor=load(model_path)config=load_config(model_path)# Prepare inputimage= ["http://images.cocodataset.org/val2017/000000039769.jpg"]# image = [Image.open("...")] can also be used with PIL.Image.Image objectsprompt="Describe this image."# Apply chat templateformatted_prompt=apply_chat_template(processor,config,prompt,num_images=len(image))# Generate outputoutput=generate(model,processor,formatted_prompt,image,verbose=False)print(output)
frommlx_vlmimportload,generatefrommlx_vlm.prompt_utilsimportapply_chat_templatefrommlx_vlm.utilsimportload_config# Load model with audio supportmodel_path="mlx-community/gemma-3n-E2B-it-4bit"model,processor=load(model_path)config=model.config# Prepare audio inputaudio= ["/path/to/audio1.wav","/path/to/audio2.mp3"]prompt="Describe what you hear in these audio files."# Apply chat template with audioformatted_prompt=apply_chat_template(processor,config,prompt,num_audios=len(audio))# Generate output with audiooutput=generate(model,processor,formatted_prompt,audio=audio,verbose=False)print(output)
frommlx_vlmimportload,generatefrommlx_vlm.prompt_utilsimportapply_chat_templatefrommlx_vlm.utilsimportload_config# Load multi-modal modelmodel_path="mlx-community/gemma-3n-E2B-it-4bit"model,processor=load(model_path)config=model.config# Prepare inputsimage= ["/path/to/image.jpg"]audio= ["/path/to/audio.wav"]prompt=""# Apply chat templateformatted_prompt=apply_chat_template(processor,config,prompt,num_images=len(image),num_audios=len(audio))# Generate outputoutput=generate(model,processor,formatted_prompt,image,audio=audio,verbose=False)print(output)
Start the server:
mlx_vlm.server
The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).
/generate
- Main generation endpoint with support for images, audio, and text/chat
- Chat-style interaction endpoint/responses
- OpenAI-compatible endpoint/health
- Check server status/unload
- Unload current model from memory
curl -X POST"http://localhost:8000/generate" \ -H"Content-Type: application/json" \ -d'{ "model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit", "image": ["/path/to/repo/examples/images/renewables_california.png"], "prompt": "This is today'\''s chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?", "system": "You are a helpful assistant.", "stream": true, "max_tokens": 1000 }'
curl -X POST"http://localhost:8000/generate" \ -H"Content-Type: application/json" \ -d'{ "model": "mlx-community/gemma-3n-E2B-it-4bit", "audio": ["/path/to/audio1.wav", "https://example.com/audio2.mp3"], "prompt": "Describe what you hear in these audio files", "stream": true, "max_tokens": 500 }'
curl -X POST"http://localhost:8000/generate" \ -H"Content-Type: application/json" \ -d'{ "model": "mlx-community/gemma-3n-E2B-it-4bit", "image": ["/path/to/image.jpg"], "audio": ["/path/to/audio.wav"], "prompt": "", "max_tokens": 1000 }'
curl -X POST"http://localhost:8000/chat" \ -H"Content-Type: application/json" \ -d'{ "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit", "messages": [ { "role": "user", "content": "What is in this image?", "images": ["/path/to/image.jpg"] } ], "max_tokens": 100 }'
curl -X POST"http://localhost:8000/responses" \ -H"Content-Type: application/json" \ -d'{ "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit", "messages": [ { "role": "user", "content": [ {"type": "input_text", "text": "What is in this image?"}, {"type": "input_image", "image": "/path/to/image.jpg"} ] } ], "max_tokens": 100 }'
model
: Model identifier (required)prompt
: Text prompt for generationimage
: List of image URLs or local paths (optional)audio
: List of audio URLs or local paths (optional, new)system
: System prompt (optional)messages
: Chat messages for chat/OpenAI endpointsmax_tokens
: Maximum tokens to generatetemperature
: Sampling temperaturetop_p
: Top-p sampling parameterstream
: Enable streaming responses
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
frommlx_vlmimportload,generatefrommlx_vlm.prompt_utilsimportapply_chat_templatefrommlx_vlm.utilsimportload_configmodel_path="mlx-community/Qwen2-VL-2B-Instruct-4bit"model,processor=load(model_path)config=model.configimages= ["path/to/image1.jpg","path/to/image2.jpg"]prompt="Compare these two images."formatted_prompt=apply_chat_template(processor,config,prompt,num_images=len(images))output=generate(model,processor,formatted_prompt,images,verbose=False)print(output)
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt"Compare these images" --image path/to/image1.jpg path/to/image2.jpg
MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.
The following models support video chat:
- Qwen2-VL
- Qwen2.5-VL
- Idefics3
- LLaVA
With more coming soon.
mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt"Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0
These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
To learn more about LoRA, please refer to theLoRA.md file.
About
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.