- Notifications
You must be signed in to change notification settings - Fork103
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
License
Blaizzy/mlx-vlm
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
The easiest way to get started is to install themlx-vlm
package using pip:
pip install mlx-vlm
Generate output from a model using the CLI:
python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg
Launch a chat interface using Gradio:
python -m mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
Here's an example of how to use MLX-VLM in a Python script:
importmlx.coreasmxfrommlx_vlmimportload,generatefrommlx_vlm.prompt_utilsimportapply_chat_templatefrommlx_vlm.utilsimportload_config# Load the modelmodel_path="mlx-community/Qwen2-VL-2B-Instruct-4bit"model,processor=load(model_path)config=load_config(model_path)# Prepare inputimage= ["http://images.cocodataset.org/val2017/000000039769.jpg"]# image = [Image.open("...")] can also be used with PIL.Image.Image objectsprompt="Describe this image."# Apply chat templateformatted_prompt=apply_chat_template(processor,config,prompt,num_images=len(image))# Generate outputoutput=generate(model,processor,formatted_prompt,image,verbose=False)print(output)
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
The following models support multi-image chat:
- Idefics 2
- LLaVA (Interleave)
- Qwen2-VL
- Phi3-Vision
- Pixtral
frommlx_vlmimportload,generatefrommlx_vlm.prompt_utilsimportapply_chat_templatefrommlx_vlm.utilsimportload_configmodel_path="mlx-community/Qwen2-VL-2B-Instruct-4bit"model,processor=load(model_path)config=load_config(model_path)images= ["path/to/image1.jpg","path/to/image2.jpg"]prompt="Compare these two images."formatted_prompt=apply_chat_template(processor,config,prompt,num_images=len(images))output=generate(model,processor,formatted_prompt,images,verbose=False)print(output)
python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt"Compare these images" --image path/to/image1.jpg path/to/image2.jpg
MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.
The following models support video chat:
- Qwen2-VL
- Qwen2.5-VL
- Idefics3
- LLaVA
With more coming soon.
python -m mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt"Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0
These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
To learn more about LoRA, please refer to theLoRA.md file.
About
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.