Quickstart with NeMo-Run#

This tutorial explains how to run any of the supportedNeMo 2.0 recipes usingNeMo-Run.We will demonstrate how to run a pretraining and fine-tuning recipe both locally and remotely on a Slurm-based cluster. Let’s get started!

For a high-level overview of NeMo-Run, please refer to theNeMo-Run README.

Minimum Requirements#

This tutorial requires a minimum of 1 NVIDIA GPU with 48GB of memory forfinetuning and 2 NVIDIA GPUs with 48GB of memory each forpretraining. Pretraining can also be done on a single GPU or GPUs with less memory by decreasing the model size. Each section can be followed individually based on your needs.You will also need to run this tutorial inside theNeMo container with thedev tag.

You can launch the NeMo container using the following command:

dockerrun--rm-it--gpusall--ipc=host--ulimitmemlock=-1--ulimitstack=67108864nvcr.io/nvidia/nemo:dev

Pretraining#

Note

The default pretraining recipe uses theMockDataModule. If you want to use a real dataset, follow theinstructions here.

For this pretraining quickstart, we will use a relatively small model.We will begin with theNemotron 3 4B pretraining recipe and go through the steps required to configure and launch pretraining.

As mentioned in the requirements, this tutorial was run on a node with 2 GPUs (each RTX 5880 with 48GB of memory).If you intend to run it on just 1 GPU or GPUs with less memory, please adjust the configuration to match your host.For example, you can reducenum_layers orhidden_size in the model config to make it fit on a single GPU.

Set Up the Prerequisites#

Run the following commands to set up your workspace and files:

# Check GPU accessnvidia-smi# Create and go to workspacemkdir-p/workspace/nemo-runcd/workspace/nemo-run# Create a python file to run pre-trainingtouchnemotron_pretraining.py

Configure the Recipe#

Important

In any script you write, please make sure you wrap your code in anif__name__=="__main__": block. SeeWorking with scripts in NeMo 2.0 for details.

Configure the recipe insidenemotron_pretraining.py:

importnemo_runasrunfromnemo.collectionsimportllmdefconfigure_recipe(nodes:int=1,gpus_per_node:int=2):recipe=llm.nemotron3_4b.pretrain_recipe(dir="/checkpoints/nemotron",# Path to store checkpointsname="nemotron_pretraining",tensor_parallelism=2,num_nodes=nodes,num_gpus_per_node=gpus_per_node,max_steps=100,# Setting a small value for the quickstart)# Add overrides herereturnrecipe

Here, the recipe variable holds a configuredrun.Partial object. For those familiar with the NeMo 1.0-style YAML configuration, thisrecipe is just a Pythonic version of a YAML config file for pretraining.

Note

The configuration in the recipes is done using the NeMo-Runrun.Config andrun.Partial configuration objects. Please review the NeMo-Rundocumentation to learn more about its configuration and execution system.

Override the Attributes#

You can override its attributes just like you would with any normal Python object. For example, if you want to change theval_check_interval, you can do so after defining your recipe by setting:

recipe.trainer.val_check_interval=100

Note

An important thing to remember is that you are only configuring your task at this stage; the underlying code is not being executed at this time.

Swap Recipes#

The recipes in NeMo 2.0 are easily swappable. For instance, if you want to swap the NeMotron recipe with a Llama 3 recipe, you can simply run the following command:

recipe=llm.llama3_8b.pretrain_recipe(dir="/checkpoints/llama3",# Path to store checkpointsname="llama3_pretraining",num_nodes=nodes,num_gpus_per_node=gpus_per_node,)

Once you have the finalrecipe configured, you are ready to move to the execution stage.

Execute Locally#

  1. Execute locally using torchrun. To do so, we will define aLocalExecutor as shown:

deflocal_executor_torchrun(nodes:int=1,devices:int=2)->run.LocalExecutor:# Env vars for jobs are configured hereenv_vars={"TORCH_NCCL_AVOID_RECORD_STREAMS":"1","NCCL_NVLS_ENABLE":"0","NVTE_DP_AMAX_REDUCE_INTERVAL":"0","NVTE_ASYNC_AMAX_REDUCTION":"1",}executor=run.LocalExecutor(ntasks_per_node=devices,launcher="torchrun",env_vars=env_vars)returnexecutor

To find out more about NeMo-Run executors, see theexecution guide.

  1. Combine therecipe andexecutor to launch the pretraining run:

defrun_pretraining():recipe=configure_recipe()executor=local_executor_torchrun(nodes=recipe.trainer.num_nodes,devices=recipe.trainer.devices)run.run(recipe,executor=executor,name="nemotron3_4b_pretraining")# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.if__name__=="__main__":run_pretraining()

The full code fornemotron_pretraining.py looks like this:

importnemo_runasrunfromnemo.collectionsimportllmdefconfigure_recipe(nodes:int=1,gpus_per_node:int=2):recipe=llm.nemotron3_4b.pretrain_recipe(dir="/checkpoints/nemotron",# Path to store checkpointsname="nemotron_pretraining",tensor_parallelism=2,num_nodes=nodes,num_gpus_per_node=gpus_per_node,max_steps=100,# Setting a small value for the quickstart)recipe.trainer.val_check_interval=100returnrecipedeflocal_executor_torchrun(nodes:int=1,devices:int=2)->run.LocalExecutor:# Env vars for jobs are configured hereenv_vars={"TORCH_NCCL_AVOID_RECORD_STREAMS":"1","NCCL_NVLS_ENABLE":"0","NVTE_DP_AMAX_REDUCE_INTERVAL":"0","NVTE_ASYNC_AMAX_REDUCTION":"1",}executor=run.LocalExecutor(ntasks_per_node=devices,launcher="torchrun",env_vars=env_vars)returnexecutordefrun_pretraining():recipe=configure_recipe()executor=local_executor_torchrun(nodes=recipe.trainer.num_nodes,devices=recipe.trainer.devices)run.run(recipe,executor=executor,name="nemotron3_4b_pretraining")# This condition is necessary for the script to be compatible with Python's multiprocessing module.if__name__=="__main__":run_pretraining()
  1. Run the file using the following command:

pythonnemotron_pretraining.py

Here’s a recording that demonstrates all the steps mentioned above, leading up to the start of pretraining:

Change the Number of GPUs#

Let’s see how we can change the configuration to run on just 1 GPU instead of 2.All you need to do is change the configuration inrun_pretraining, as shown below:

defrun_pretraining():recipe=configure_recipe()executor=local_executor_torchrun(nodes=recipe.trainer.num_nodes,devices=recipe.trainer.devices)# Change to 1 GPU# Change executor paramsexecutor.ntasks_per_node=1executor.env_vars["CUDA_VISIBLE_DEVICES"]="0"# Change recipe params# The default number of layers comes from the recipe in nemo where num_layers is 32# Ref: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/nemotron.py# To run on 1 GPU without TP, we can reduce the number of layers to 8 by setting recipe.model.config.num_layers = 8recipe.model.config.num_layers=8# We also need to set TP to 1, since we had used 2 for 2 GPUs.recipe.trainer.strategy.tensor_model_parallel_size=1# Lastly, we need to set devices to 1 in the trainer.recipe.trainer.devices=1run.run(recipe,executor=executor,name="nemotron3_4b_pretraining")

Execute on a Slurm Cluster#

One of the benefits of NeMo-Run is to allow you to easily scale from local to remote Slurm-based clusters.Next, let’s see how we can launch the same pretraining recipe on a Slurm cluster.

Note

Each cluster might have different settings. It is recommended that you reach out to the cluster administrators for specific details.

  1. Define aslurm executor:

defslurm_executor(user:str,host:str,remote_job_dir:str,account:str,partition:str,nodes:int,devices:int,time:str="01:00:00",custom_mounts:Optional[list[str]]=None,custom_env_vars:Optional[dict[str,str]]=None,container_image:str="nvcr.io/nvidia/nemo:dev",retries:int=0,)->run.SlurmExecutor:ifnot(userandhostandremote_job_dirandaccountandpartitionandnodesanddevices):raiseRuntimeError("Please set user, host, remote_job_dir, account, partition, nodes, and devices args for using this function.")mounts=[]# Custom mounts are defined here.ifcustom_mounts:mounts.extend(custom_mounts)# Env vars for jobs are configured hereenv_vars={"TORCH_NCCL_AVOID_RECORD_STREAMS":"1","NCCL_NVLS_ENABLE":"0","NVTE_DP_AMAX_REDUCE_INTERVAL":"0","NVTE_ASYNC_AMAX_REDUCTION":"1",}ifcustom_env_vars:env_vars|=custom_env_vars# This defines the slurm executor.# We connect to the executor via the tunnel defined by user, host, and remote_job_dir.executor=run.SlurmExecutor(account=account,partition=partition,tunnel=run.SSHTunnel(user=user,host=host,job_dir=remote_job_dir,# This is where the results of the run will be stored by default.# identity="/path/to/identity/file" OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password.),nodes=nodes,ntasks_per_node=devices,gpus_per_node=devices,mem="0",exclusive=True,gres="gpu:8",packager=run.Packager(),)executor.container_image=container_imageexecutor.container_mounts=mountsexecutor.env_vars=env_varsexecutor.retries=retriesexecutor.time=timereturnexecutor
  1. Replace the local executor with the slurm executor, as shown below:

defrun_pretraining_with_slurm():recipe=configure_recipe(nodes=1,gpus_per_node=8)executor=slurm_executor(user="",# TODO: Set the username you want to usehost="",# TODO: Set the host of your clusterremote_job_dir="",# TODO: Set the directory on the cluster where you want to save resultsaccount="",# TODO: Set the account for your clusterpartition="",# TODO: Set the partition for your clustercontainer_image="",# TODO: Set the container image you want to use for your job# container_mounts=[], TODO: Set any custom mounts# custom_env_vars={}, TODO: Set any custom env varsnodes=recipe.trainer.num_nodes,devices=recipe.trainer.devices,)run.run(recipe,executor=executor,detach=True,name="nemotron3_4b_pretraining")
  1. Run the pretraining as follows:

if__name__=="__main__":run_pretraining_with_slurm()
pythonnemotron_pretraining.py

Since we have setdetach=True, the process will exit after scheduling the job on the cluster. It will provide information about directories and commands to manage the run/experiment.

Continue Pretraining#

If you want to continue pretraining from a previous checkpoint on long context, you can follow the guidehere.

Fine-Tuning#

Note

The default fine-tuning recipe uses theSquadDataModule. If you want to use a real dataset, follow theinstructions here.

One of the main benefits of NeMo-Run is that it decouples configuration and execution, allowing us to reuse predefined executors and simply change the recipe.For the purpose of this tutorial, we will include the executor definition so that this section can be followed independently.

Set Up the Prerequisites#

Run the following commands to set up your Hugging Face token for automatic conversion of the model from Hugging Face.

mkdir-p/tokens# Fetch Huggingface token and export it.# See https://huggingface.co/docs/hub/en/security-tokens for instructions.exportHF_TOKEN="hf_your_token"# Change this to your Huggingface token# Save token to /tokens/huggingfaceecho"$HF_TOKEN">/tokens/huggingface

Configure the Recipe#

In this section, we will fine-tune a Llama 3 8B model from Hugging Face on a single GPU.To achieve this, we need to follow two steps:

  1. Convert the checkpoint from Hugging Face to NeMo.

  2. Run fine-tuning using the converted checkpoint from step 1.

We will accomplish this using aNeMo-Run experiment, which allows you to define these two tasks and execute them sequentially with ease.We will create a new file,nemotron_finetuning.py, in the same directory.For the fine-tuning configuration, we will use theLlama3 8b finetuning recipe.This recipe uses LoRA, enabling it to fit on 1 GPU (this example uses a GPU with 48GB of memory).

Let’s first define the configuration for the two tasks:

importnemo_runasrunfromnemo.collectionsimportllmdefconfigure_checkpoint_conversion():returnrun.Partial(llm.import_ckpt,model=llm.llama3_8b.model(),source="hf://meta-llama/Meta-Llama-3-8B",overwrite=False,)defconfigure_finetuning_recipe(nodes:int=1,gpus_per_node:int=1):recipe=llm.llama3_8b.finetune_recipe(dir="/checkpoints/llama3_finetuning",# Path to store checkpointsname="llama3_lora",num_nodes=nodes,num_gpus_per_node=gpus_per_node,)recipe.trainer.max_steps=100recipe.trainer.num_sanity_val_steps=0# Need to set this to 1 since the default is 2recipe.trainer.strategy.context_parallel_size=1recipe.trainer.val_check_interval=100# This is currently required for LoRA/PEFTrecipe.trainer.strategy.ddp="megatron"returnrecipe

You can refer tooverrides for details on overriding more of the default attributes.

Execute Locally#

Note

You will need to import the checkpoint first by running the recipe returned byconfigure_checkpoint_conversion(). Skipping this step will most likely result in an error unless you have a pre-converted checkpoint.

Execution should be pretty straightforward, since we will reuse thelocal executor (but include the definition here for reference). Next, we will define the experiment and launch it.Here’s what it looks like:

deflocal_executor_torchrun(nodes:int=1,devices:int=2)->run.LocalExecutor:# Env vars for jobs are configured hereenv_vars={"TORCH_NCCL_AVOID_RECORD_STREAMS":"1","NCCL_NVLS_ENABLE":"0","NVTE_DP_AMAX_REDUCE_INTERVAL":"0","NVTE_ASYNC_AMAX_REDUCTION":"1",}executor=run.LocalExecutor(ntasks_per_node=devices,launcher="torchrun",env_vars=env_vars)returnexecutordefrun_finetuning():import_ckpt=configure_checkpoint_conversion()finetune=configure_finetuning_recipe(nodes=1,gpus_per_node=1)executor=local_executor_torchrun(nodes=finetune.trainer.num_nodes,devices=finetune.trainer.devices)executor.env_vars["CUDA_VISIBLE_DEVICES"]="0"# Set this env var for model download from huggingfaceexecutor.env_vars["HF_TOKEN_PATH"]="/tokens/huggingface"withrun.Experiment("llama3-8b-peft-finetuning")asexp:exp.add(import_ckpt,executor=run.LocalExecutor(),name="import_from_hf")# We don't need torchrun for the checkpoint conversionexp.add(finetune,executor=executor,name="peft_finetuning")exp.run(sequential=True,tail_logs=True)# This will run the tasks sequentially and stream the logs# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.if__name__=="__main__":run_finetuning()

The full file looks like this:

importnemo_runasrunfromnemo.collectionsimportllmdefconfigure_checkpoint_conversion():returnrun.Partial(llm.import_ckpt,model=llm.llama3_8b.model(),source="hf://meta-llama/Meta-Llama-3-8B",overwrite=False,)defconfigure_finetuning_recipe(nodes:int=1,gpus_per_node:int=1):recipe=llm.llama3_8b.finetune_recipe(dir="/checkpoints/llama3_finetuning",# Path to store checkpointsname="llama3_lora",num_nodes=nodes,num_gpus_per_node=gpus_per_node,)recipe.trainer.max_steps=100recipe.trainer.num_sanity_val_steps=0# Async checkpointing doesn't work with PEFTrecipe.trainer.strategy.ckpt_async_save=False# Need to set this to 1 since the default is 2recipe.trainer.strategy.context_parallel_size=1recipe.trainer.val_check_interval=100# This is currently required for LoRA/PEFTrecipe.trainer.strategy.ddp="megatron"returnrecipedeflocal_executor_torchrun(nodes:int=1,devices:int=2)->run.LocalExecutor:# Env vars for jobs are configured hereenv_vars={"TORCH_NCCL_AVOID_RECORD_STREAMS":"1","NCCL_NVLS_ENABLE":"0","NVTE_DP_AMAX_REDUCE_INTERVAL":"0","NVTE_ASYNC_AMAX_REDUCTION":"1",}executor=run.LocalExecutor(ntasks_per_node=devices,launcher="torchrun",env_vars=env_vars)returnexecutordefrun_finetuning():import_ckpt=configure_checkpoint_conversion()finetune=configure_finetuning_recipe(nodes=1,gpus_per_node=1)executor=local_executor_torchrun(nodes=finetune.trainer.num_nodes,devices=finetune.trainer.devices)executor.env_vars["CUDA_VISIBLE_DEVICES"]="0"# Set this env var for model download from huggingfaceexecutor.env_vars["HF_TOKEN_PATH"]="/tokens/huggingface"withrun.Experiment("llama3-8b-peft-finetuning")asexp:exp.add(import_ckpt,executor=run.LocalExecutor(),name="import_from_hf")# We don't need torchrun for the checkpoint conversionexp.add(finetune,executor=executor,name="peft_finetuning")exp.run(sequential=True,tail_logs=True)# This will run the tasks sequentially and stream the logs# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.if__name__=="__main__":run_finetuning()

Here’s a recording showing all the steps above leading up to the start of fine-tuning:

Switch from PEFT to Full Fine-Tuning#

The default recipe uses PEFT for fine-tuning. If you want to use full fine-tuning, you will need to use a minimum of 2 GPUs and passpeft_scheme=None to the recipe.

Warning

When usingimport_ckpt in NeMo 2.0, ensure your script includesif__name__=="__main__":. Without this, Python’s multiprocessing won’t initialize threads properly, causing a “Failure to acquire lock” error.

You can change the code as follows:

defconfigure_finetuning_recipe(nodes:int=1,gpus_per_node:int=2,peft_scheme:Optional[str]=None):# Minimum of 2 GPUsrecipe=llm.llama3_8b.finetune_recipe(dir="/checkpoints/llama3_finetuning",# Path to store checkpointsname="llama3_lora",num_nodes=nodes,num_gpus_per_node=gpus_per_node,peft_scheme=peft_scheme,# This will disable PEFT and use full fine-tuning)recipe.trainer.max_steps=100recipe.trainer.num_sanity_val_steps=0# Need to set this to 1 since the default is 2recipe.trainer.strategy.context_parallel_size=1recipe.trainer.val_check_interval=100# This is currently required for LoRA/PEFTrecipe.trainer.strategy.ddp="megatron"returnrecipe...defrun_finetuning():import_ckpt=configure_checkpoint_conversion()finetune=configure_finetuning_recipe(nodes=1,gpus_per_node=2,peft_scheme=None)executor=local_executor_torchrun(nodes=finetune.trainer.num_nodes,devices=finetune.trainer.devices)executor.env_vars["CUDA_VISIBLE_DEVICES"]="0,1"# Set this env var for model download from huggingfaceexecutor.env_vars["HF_TOKEN_PATH"]="/tokens/huggingface"withrun.Experiment("llama3-8b-peft-finetuning")asexp:exp.add(import_ckpt,executor=run.LocalExecutor(),name="import_from_hf")# We don't need torchrun for the checkpoint conversionexp.add(finetune,executor=executor,name="peft_finetuning")exp.run(sequential=True,tail_logs=True)# This will run the tasks sequentially and stream the logs

Use a NeMo 2.0 Pretraining Checkpoint as the Base#

In case you already have a pretrained checkpoint using NeMo 2.0, and want to use that as a starting point for fine-tuning instead of the Hugging Face checkpoint, you can do the following:

defrun_finetuning():finetune=configure_finetuning_recipe(nodes=1,gpus_per_node=1)finetune.resume.restore_config.path="/path/to/pretrained/NeMo-2/checkpoint"executor=local_executor_torchrun(nodes=finetune.trainer.num_nodes,devices=finetune.trainer.devices)executor.env_vars["CUDA_VISIBLE_DEVICES"]="0"withrun.Experiment("llama3-8b-peft-finetuning")asexp:exp.add(finetune,executor=executor,name="peft_finetuning")exp.run(sequential=True,tail_logs=True)# This will run the tasks sequentially and stream the logs

Execute on a Slurm Cluster with More Nodes#

You can reuse the slurm executor fromabove. The experiment can then be configured like:

Note

Theimport_ckpt configuration should write to a shared filesystem accessible by all nodes in the cluster for multi-node training.

You can control the default cache location by setting theNEMO_HOME environment variable.

Warning

When usingimport_ckpt in NeMo 2.0, ensure your script includesif__name__=="__main__":. Without this, Python’s multiprocessing won’t initialize threads properly, causing a “Failure to acquire lock” error.

defrun_finetuning_on_slurm():import_ckpt=configure_checkpoint_conversion()# This will make finetuning run on 2 nodes with 8 GPUs each.recipe=configure_finetuning_recipe(gpus_per_node=8,nodes=2)executor=slurm_executor(...nodes=recipe.trainer.num_nodes,devices=recipe.trainer.devices,...)executor.env_vars["NEMO_HOME"]="/path/to/a/shared/filesystem"# Importing checkpoint always requires only 1 node and 1 task per nodeimport_executor=slurm_executor.clone()import_executor.nodes=1import_executor.ntasks_per_node=1# Set this env var for model download from huggingfaceimport_executor.env_vars["HF_TOKEN_PATH"]="/tokens/huggingface"withrun.Experiment("llama3-8b-peft-finetuning-slurm")asexp:exp.add(import_ckpt,executor=import_executor,name="import_from_hf")exp.add(recipe,executor=executor,name="peft_finetuning")exp.run(sequential=True,tail_logs=True)