NeMo 2.0#

In NeMo 1.0, the main interface for configuring experiments is through YAML files. This approach allows for a declarative way to set up experiments, but it has limitations in terms of flexibility and programmatic control. NeMo 2.0 shifts to a Python-based configuration, which offers several advantages:

  • More flexibility and control over the configuration.

  • Better integration with IDEs for code completion and type checking.

  • Easier to extend and customize configurations programmatically.

By adopting PyTorch Lightning’s modular abstractions, NeMo 2.0 makes it easy for users to adapt the framework to their specific use cases and experiment with various configurations. This section offers an overview of the new features in NeMo 2.0 and includes a migration guide with step-by-step instructions for transitioning your models from NeMo 1.0 to NeMo 2.0.

Install NeMo 2.0#

NeMo 2.0 installation instructions can be found in theGetting Started guide.

Quickstart#

Important

In any script you write, please make sure you wrap your code in anif__name__=="__main__": block. SeeWorking with scripts in NeMo 2.0 for details.

The following is an example of running a simple training loop using NeMo 2.0. This example uses thetrain API from the NeMo Framework LLM collection. Once you have set up your environment using the instructions above, you’re ready to run this simple train script.

importtorchfromnemoimportlightningasnlfromnemo.collectionsimportllmfrommegatron.core.optimizerimportOptimizerConfigif__name__=="__main__":seq_length=2048global_batch_size=16## setup the dummy datasetdata=llm.MockDataModule(seq_length=seq_length,global_batch_size=global_batch_size)## initialize a small GPT modelgpt_config=llm.GPTConfig(num_layers=6,hidden_size=384,ffn_hidden_size=1536,num_attention_heads=6,seq_length=seq_length,init_method_std=0.023,hidden_dropout=0.1,attention_dropout=0.1,layernorm_epsilon=1e-5,make_vocab_size_divisible_by=128,)model=llm.GPTModel(gpt_config,tokenizer=data.tokenizer)## initialize the strategystrategy=nl.MegatronStrategy(tensor_model_parallel_size=1,pipeline_model_parallel_size=1,pipeline_dtype=torch.bfloat16,)## setup the optimizeropt_config=OptimizerConfig(optimizer='adam',lr=6e-4,bf16=True,)opt=nl.MegatronOptimizerModule(config=opt_config)trainer=nl.Trainer(devices=1,## you can change the number of devices to suit your setupmax_steps=50,accelerator="gpu",strategy=strategy,plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),)nemo_logger=nl.NeMoLogger(log_dir="test_logdir",## logs and checkpoints will be written here)llm.train(model=model,data=data,trainer=trainer,log=nemo_logger,tokenizer='data',optim=opt,)

CLI Quickstart#

NeMo comes equipped with a CLI that allows you to launch experiments locally or on a remote cluster. Every command has a help flag that you can use to get more information about the command.

To list all the commands inside the llm-collection, you can use the following command:

$nemollm--helpUsage: nemo llm [OPTIONS] COMMAND [ARGS]...[Module] llm╭─ Options ────────────────────────────────────────────────────────────────╮│ --help          Show this message and exit.                              │╰──────────────────────────────────────────────────────────────────────────╯╭─ Commands ───────────────────────────────────────────────────────────────╮│ train      [Entrypoint] train                                            ││ pretrain   [Entrypoint] pretrain                                         ││ finetune   [Entrypoint] finetune                                         ││ validate   [Entrypoint] validate                                         ││ prune      [Entrypoint] prune                                            ││ distill    [Entrypoint] distill                                          ││ ptq        [Entrypoint] ptq                                              ││ deploy     [Entrypoint] deploy                                           ││ import     [Entrypoint] import                                           ││ export     [Entrypoint] export                                           ││ generate   [Entrypoint] generate                                         │╰──────────────────────────────────────────────────────────────────────────╯

Most commands come with various pre-configured recipes. To list all the recipes for a given command, you can use the following command:

$nemollmfinetune--helpUsage: nemo llm finetune [OPTIONS] [ARGUMENTS][Entrypoint] finetuneFinetunes a model using the specified data and trainer, with optional logging, resuming, and PEFT.╭─ Pre-loaded entrypoint factories, run with --factory ──────────────────────────────────╮│ baichuan2_7b                nemo.collections.llm.recipes.baichuan2_7b.fi…  line 236    ││ chatglm3_6b                 nemo.collections.llm.recipes.chatglm3_6b.fin…  line 236    ││ deepseek_v2                 nemo.collections.llm.recipes.deepseek_v2.fin…  line 108    ││ deepseek_v2_lite            nemo.collections.llm.recipes.deepseek_v2_lit…  line 107    ││ gemma2_2b                   nemo.collections.llm.recipes.gemma2_2b.finet…  line 173    ││ gemma2_9b                   nemo.collections.llm.recipes.gemma2_9b.finet…  line 173    ││ llama2_7b                   nemo.collections.llm.recipes.llama2_7b.finet…  line 230    ││ llama3_8b                   nemo.collections.llm.recipes.llama3_8b.finet…  line 245    ││ mixtral_8x7b                nemo.collections.llm.recipes.mixtral_8x7b.fi…  line 240    ││ nemotron3_8b                nemo.collections.llm.recipes.nemotron3_8b.fi…  line 253    ││ nemotron4_15b               nemo.collections.llm.recipes.nemotron4_15b.f…  line 227    ││ ...                         (output truncated)                                         │╰────────────────────────────────────────────────────────────────────────────────────────╯

You can also use the--factory flag to run a specific recipe. For example, to run thellama32_1b recipe, you can use the following command:

$nemollmfinetune--factoryllama32_1b

NeMo CLI supports overriding any configuration parameter using Hydra-style dot notation. This powerful feature allows you to customize any aspect of the recipe without modifying the source code. For example, to change the number of GPUs used for training from the default to just 1 device:

$nemollmfinetune--factoryllama32_1btrainer.devices=1Configuring global optionsDry run for task nemo.collections.llm.api:finetuneResolved Arguments┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ Argument Name        ┃ Resolved Value                                               ┃┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ trainer              │ Trainer(                                                     ││                      │   ...                                                        ││                      │   devices='1',                                               ││                      │   ...                                                        │└──────────────────────┴──────────────────────────────────────────────────────────────┘Continue? [y/N]:

This syntax follows the patterncomponent.parameter=value, allowing you to navigate nested configurations. You can override multiple parameters at once by adding more space-separated overrides:

$nemollmfinetune--factoryllama32_1btrainer.devices=1trainer.max_steps=500optim.config.lr=5e-5

The command prints a preview of the resolved configuration values so you can verify your changes before starting the training run.

NeMo 2.0 also seamlessly supports scaling to thousands of GPUs usingNeMo-Run.For examples of launching large-scale experiments using NeMo-Run, refer toQuickstart with NeMo-Run.

Note

If you are an existing user of NeMo 1.0 and would like to use a NeMo 1.0 dataset in place of theMockDataModule in the example, refer to thedata migration guide for instructions.

Extend Quickstart with NeMo-Run#

WhileQuickstart with NeMo-Run covers how to configure your NeMo 2.0 experiment using NeMo-Run, it is not mandatory to use the configuration system from NeMo-Run.In fact, you can take the Python script from theQuickstart above and launch it on remote clusters directly using NeMo-Run.For more details about NeMo-Run, refer toNeMo-Run Github and thehello_scripts example.Below, we will walk through how to do this.

Prerequisites#

  1. Save the script above astrain.py in your working directory.

  2. Install NeMo-Run using the following command:

pipinstallgit+https://github.com/NVIDIA/NeMo-Run.git

Let’s assume that you have the above script saved astrain.py in your current working directory.

Launch the Experiment Locally#

Locally here means from your local workstation. It can be avenv in your workstation or an interactive NeMo Docker container.

  1. Write a new file calledrun.py with the following contents:

importosimportnemo_runasrunif__name__=="__main__":training_job=run.Script(inline="""# This string will get saved to a sh file and executed with bash# Run any preprocessing commands# Run the training commandpython train.py# Run any post processing commands""")# Run it locallyexecutor=run.LocalExecutor()withrun.Experiment("nemo_2.0_training_experiment",log_level="INFO")asexp:exp.add(training_job,executor=executor,tail_logs=True,name="training")# Add more jobs as needed# Run the experimentexp.run(detach=False)
  1. Launch the experiment using the following command:

pythonrun.py

Launch the Experiment on Slurm#

Writing an extra script to just launch locally is not very useful. So let’s see how we can extendrun.py to launch the job on any supportedNeMo-Run executors.For this tutorial, we will use the slurm executor.

Note

Each cluster might have different settings. It is recommended that you reach out to the cluster administrators for specific details.

  1. Define a function to configure your slurm executor as follows:

defslurm_executor(user:str,host:str,remote_job_dir:str,account:str,partition:str,nodes:int,devices:int,time:str="01:00:00",custom_mounts:Optional[list[str]]=None,custom_env_vars:Optional[dict[str,str]]=None,container_image:str="nvcr.io/nvidia/nemo:dev",retries:int=0,)->run.SlurmExecutor:ifnot(userandhostandremote_job_dirandaccountandpartitionandnodesanddevices):raiseRuntimeError("Please set user, host, remote_job_dir, account, partition, nodes, and devices args for using this function.")mounts=[]# Custom mounts are defined here.ifcustom_mounts:mounts.extend(custom_mounts)# Env vars for jobs are configured hereenv_vars={"TORCH_NCCL_AVOID_RECORD_STREAMS":"1","NCCL_NVLS_ENABLE":"0","NVTE_DP_AMAX_REDUCE_INTERVAL":"0","NVTE_ASYNC_AMAX_REDUCTION":"1",}ifcustom_env_vars:env_vars|=custom_env_vars# This will package the train.py script in the current working directory to the remote cluster.# If you are inside a git repo, you can also use https://github.com/NVIDIA/NeMo-Run/blob/main/src/nemo_run/core/packaging/git.py.# If the script already exists on your container and you call it with the absolute path, you can also just use `run.Packager()`.packager=run.PatternPackager(include_pattern="train.py",relative_path=os.getcwd())# This defines the slurm executor.# We connect to the executor via the tunnel defined by user, host and remote_job_dir.executor=run.SlurmExecutor(account=account,partition=partition,tunnel=run.SSHTunnel(user=user,host=host,job_dir=remote_job_dir,# This is where the results of the run will be stored by default.# identity="/path/to/identity/file" OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password.),nodes=nodes,ntasks_per_node=devices,gpus_per_node=devices,mem="0",exclusive=True,gres="gpu:8",packager=packager,)executor.container_image=container_imageexecutor.container_mounts=mountsexecutor.env_vars=env_varsexecutor.retries=retriesexecutor.time=timereturnexecutor
  1. Replace the executor inrun.py as follows:

executor=slurm_executor(...)# pass in args relevant to your cluster
  1. Run the file with the same command and it will launch your job on the cluster. Similarly, you can define multiple slurm executors for multiple Slurm clusters and use them interchangeably, or use any of the supported executors in NeMo-Run.

Where to Find NeMo 2.0#

Currently, the code for NeMo 2.0 can be found in two main locations within theNeMo GitHub repository:

  1. LLM collection: This is the first collection to adopt the NeMo 2.0 APIs. This collection provides implementations of common language models using NeMo 2.0. Currently, the collection supports the following models:

  2. NeMo 2.0 LLM Recipes: Provides comprehensive recipes for pretraining and fine-tuning large language models. Recipes can be easily configured and modified for specific use-cases with the help ofNeMo-Run.

  3. NeMo Lightning: Provides custom PyTorch Lightning-compatible objects that make it possible to train Megatron Core-based models using PTL in a modular fashion. NeMo 2.0 employs these objects to train models in a simple and efficient manner.

Pretraining, Supervised Fine-Tuning (SFT), and Parameter-Efficient Fine-Tuning (PEFT) are all supported by the LLM collection. More information about each model can be found in the model-specific documentation linked above.

Long context recipes are also supported with the help of context parallelism. For more information on the available long conext recipes, refer to thelong context documentation.

Inference via TensorRT-LLM supported in NeMo 2.0. For more information, refer to theTRT-LLM deployment documentation.

Additional Resources#