Llama 3.2 Vision Models #

Meta’s Llama 3.2 models introduce advanced capabilities in visual recognition, image reasoning, captioning, and answering general image-related questions.

The Llama 3.2 vision-language models are available in two parameter sizes: 11B and 90B. Each size is offered in both base and instruction-tuned versions, providing flexibility for various use cases.

Resources:

Hugging Face Llama 3.2 Collection:HF llama32 collection
Meta LLaMA Source Code:GitHub Repository

Import from Hugging Face to NeMo 2.0#

To import the Hugging Face (HF) model and convert it to NeMo 2.0 format, run the following command. This step only needs to be performed once:

fromnemo.collections.llmimportimport_ckptfromnemo.collectionsimportvlmif__name__=='__main__':hf_model_id="meta-llama/Llama-3.2-11B-Vision-Instruct"import_ckpt(model=vlm.MLlamaModel(vlm.MLlamaConfig11BInstruct(),source=f"hf://{hf_model_id}",)

The command above saves the converted file in the NeMo cache folder, located at:~/.cache/nemo.

If needed, you can change the default cache directory by setting theNEMO_MODELS_CACHE environment variable before running the script.

NeMo 2.0 Fine-Tuning Recipes#

We provide pre-defined recipes for fine-tuning Llama 3.2 vision models (also known as MLlama) using NeMo 2.0 andNeMo-Run.These recipes configure arun.Partial for one of thenemo.collections.llm api functions introduced in NeMo 2.0.The recipes are hosted inmllama_11bandmllama_90b files. The recipes use theMockDataModule for thedata argument.

Note

The recipes use theMockDataModule for thedata argument. You are expected to replace theMockDataModule with your custom dataset.

By default, the non-instruct version of the model is loaded. To load a different model, setfinetune.resume.restore_config.path=nemo://<hf_model_id> orfinetune.resume.restore_config.path=<local_model_path>.

We provide an example below on how to invoke the default recipe and override the data argument:

fromnemo.collectionsimportvlmfinetune=vlm.mllama_11b.finetune_recipe(name="mllama_11b_finetune",dir=f"/path/to/checkpoints",num_nodes=1,num_gpus_per_node=8,peft_scheme='lora',# 'lora', 'none')

By default, the fine-tuning recipe applies LoRA to all linear layers in the language model, including cross-attention layers, while keeping the vision model unfrozen.

To configure which layers to apply LoRA: Setfinetune.peft.target_modules. For example, to apply LoRA only on the self-attention qkv projection layers, setfinetune.peft.target_modules=["*.language_model.*.linear_qkv"].
To freeze the vision model: Setfinetune.peft.freeze_vision_model=True.
To fine-tune the entire model without LoRA: Setpeft_scheme='none' in the recipe argument.

Note

The configuration in the recipes is done using the NeMo-Runrun.Config andrun.Partial configuration objects. Please review the NeMo-Rundocumentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:

importnemo_runasrunrun.run(finetune,executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(finetune,direct=True)

Bring Your Own Data#

Replace theMockDataModule in default recipes with your custom dataset.

fromnemo.collectionsimportvlm# Define the fine-tuning recipefinetune=vlm.mllama_11b.finetune_recipe(name="mllama_11b_finetune",dir=f"/path/to/checkpoints",num_nodes=1,num_gpus_per_node=8,peft_scheme='lora',# 'lora', 'none')# The following is an example of a custom dataset configuration.data_config=vlm.ImageDataConfig(image_folder="/path/to/images",conv_template="mllama",# Customize based on your dataset needs)# Data module setupcustom_data=vlm.MLlamaPreloadedDataModule(paths="/path/to/dataset.json",# Path to your llava-like datasetdata_config=data_config,seq_length=6404,decoder_sequence_length=2048,global_batch_size=16,# Global batch sizemicro_batch_size=1,# Micro batch sizetokenizer=None,# Define your tokenizer if neededimage_processor=None,# Add an image processor if requirednum_workers=8,# Number of workers for data loading)# Assign custom data to the fine-tuning recipefinetune.data=custom_data

A comprehensive list of pretraining recipes that we currently support or plan to support soon is provided below for reference:

Recipe	Status
Llama 3.2 11B LoRA	Yes
Llama 3.2 11B Full fine-tuning	Yes
Llama 3.2 90B LoRA	Yes
Llama 3.2 90B Full fine-tuning	Yes

On this page

Movatterモバイル変換

Llama 3.2 Vision Models#

Import from Hugging Face to NeMo 2.0#

NeMo 2.0 Fine-Tuning Recipes#

Bring Your Own Data#

Llama 3.2 Vision Models #