Run prebuilt workloads Stay organized with collections Save and categorize content based on your preferences.
This guide shows you how to use the NVIDIA NeMo ecosystem ona Managed Training cluster for end-to-end generative AI model development.It provides step-by-step instructions for the following distinct but relatedworkflows, each covered in its own dedicated section:
- NVIDIA NeMo: For foundational model development, follow these instructionsto perform large-scale pre-training, continuous pre-training (CPT), andsupervised fine-tuning (SFT).
- NVIDIA NeMo-RL: For model alignment and preference tuning, use this sectionto apply advanced techniques like Reinforcement Learning (RL) to align yourmodel with human instructions and preferences.
Whether you're building a model from scratch or refining an existing one, thisdocument guides you through setting up your environment, managing containerizedjobs, and launching training scripts on the cluster.
NVIDIA NeMo
TheNVIDIA NeMo frameworkis an end-to-end platform for building, customizing,and deploying generative AI models. This section of the guide is specificallyfor developers and researchers focused on the foundational stages of modeldevelopment. It provides step-by-step instructions for using NeMo to performlarge-scale pre-training, continuous pre-training (CPT), and supervisedfine-tuning (SFT) on a Managed Training cluster.
This training clusters guide provides the complete workflowfor running a training job with theNeMo framework. The process is divided into two main parts: the initial one-timesetup of your environment and the recurring steps for launching a job.
Set up your environment
Before launching a job, you need to prepare your environment by ensuring youhave a container image and the necessary training scripts.
Prepare a container image
You have two options for the container image: use a prebuilt image (recommended)or build a custom one.
Use a prebuilt image (recommended)
Prebuilt container images are provided in the.squashfs format. Copy theappropriate image for your region into your working directory.
# Example for the US regiongcloudstoragecpgs://managed-containers-us/nemo_squashfs/nemo-20250721.sqsh.Build a customized container (advanced)
Follow these steps only if the prebuilt containers don't meet your needs.This procedure guides you through converting a custom container image into the.squashfs format usingenroot.
Step 1: Authenticate with Google Cloud.
Use the following commands to ensure that both your Google Cloud user accountand the Docker registry where your image is hosted are authenticated:
gcloudauthlogingcloudauthconfigure-dockerus-docker.pkg.devus-docker.pkg.dev) is an example for anArtifact Registry repository. This address must match the hostname of the registrywhere your image is stored (for example, an Artifact Registry or Container Registryaddress).For example, if your image URI isgcr.io/my-project/my-image, you must usegcr.io.Step 2: Create the conversion script.
Create a file namedenroot-convert.sh and add the following scriptcontent. Before running this script, you must update theREMOTE_IMG andLOCAL_IMG variables to point to your container image and your chosen output path.
#!/bin/bash#SBATCH --gpus-per-node=8#SBATCH --exclusive#SBATCH --mem=0#SBATCH --ntasks-per-node=1# Run this script on the slurm login node:# sbatch -N 1 enroot-convert.shset-xset-e# The remote docker image URI.REMOTE_IMG="docker://us-docker.pkg.dev/{YOUR_CONTAINER_IMG_URI}:{YOUR_CONTAINER_IMAGE_TAG}"# The local path to the to be imported enroot squash file.LOCAL_IMG="${HOME}/my_nemo.sqsh"# The path to the enroot config file.TMP_ENROOT_CONFIG_PATH="/tmp/\$(id -u --name)/config/enroot"# Download the docker image to each node.srun-l-N"${SLURM_NNODES}"\bash-c"mkdir -p${TMP_ENROOT_CONFIG_PATH};echo 'machine us-docker.pkg.dev login oauth2accesstoken password$(gcloudauthprint-access-token)' >${TMP_ENROOT_CONFIG_PATH}/.credentials;rm -f${LOCAL_IMG};ENROOT_CONFIG_PATH=${TMP_ENROOT_CONFIG_PATH} ENROOT_MAX_PROCESSORS=$(($(nproc)/2)) enroot import -o${LOCAL_IMG}${REMOTE_IMG};"Step 3: Run the script and verify the output.
Execute the script on the Slurm login node.
sbatch-N1enroot-convert.shAfter the job completes, find the conversion logs in a file namedslurm-<JOB_ID>.out and the final container image at the path you specifiedforLOCAL_IMG.
Download the training recipes
The training recipes are stored in a privategooglesource.com repository.To access them with the Git command line, you must first generate authenticationcredentials.
Generate authentication credentials.
Visit the following URL and follow the on-screen instructions.This configures your local environment to authenticate with therepository.https://www.googlesource.com/new-password
Clone the repository.
Once the credentials are authenticated, run the following command todownload the recipes.
gitclonehttps://vertex-model-garden.googlesource.com/vertex-oss-training
Launch a training job
Once your environment is set up, you can launch a training job.
Step 1: Set environment variables
The following environment variables may be required for your job:
- The
HF_TOKENis required to download models and datasets from Hugging Face. - The
WANDB_API_KEYis required to useWeights & Biasesfor experiment analysis.
exportHF_TOKEN=YOUR_HF_TOKENexportWANDB_API_KEY=YOUR_WANDB_API_KEYStep 2: Run the launch script
Navigate to your working directory and run therun.py script to start ajob. This example kicks off a demo training job with Llama 3.1-2b.
# Set the working directoryexportWORK_DIR=$HOME/vertex-oss-training/nemocd$WORK_DIR# Launch the training jobexportNEMORUN_HOME=$WORK_DIR &&\python3run.py-eslurm--slurm-typehcc-a3m--partitiona3m\-d$WORK_DIR-i$WORK_DIR/nemo-demo.sqsh\-spretrain/llama3p1_2b_pt.py-n2\--experiment-namenemo-demo-runLaunch parameters
--slurm-typeis set based on the cluster type (for example,hcc-a3m,hcc-a3u,hcc-a4).--partitionmust be set to an available partition. You can checkpartition names with thesinfocommand.- The
run.pyscript automatically mounts several directories to theDocker container, including--log-dir,--cache-dir, and--data-dir,if they are set.
Monitoring job status and logs
After you launch the job, a status block is displayed:
ExperimentStatusfornemo-demo-run_1753123402Task0:nemo-demo-run-Status:RUNNING-Executor:SlurmExecutoron@localhost-Jobid:75-LocalDirectory:$NEMORUN_HOME/experiments/nemo-demo-run/nemo-demo-run_1753123402/nemo-demo-runThe execution logs are written to the path shown in theLocal Directory fieldfrom the status output. For example, you can find the log files at a pathsimilar to this:
$NEMORUN_HOME/experiments/nemo-demo-run/nemo-demo-run_1753123402/nemo-demo-run/<JOB_ID>.logCommon errors and solutions
This section describes common issues that may arise during job execution andprovides recommended steps to resolve them.
Invalid partition error
By default, jobs attempt to launch on the general partition. If the generalpartition doesn't exist or isn't available, the job will fail with thefollowing error:
sbatch:error:invalidpartitionspecified:generalsbatch:error:Batchjobsubmissionfailed:InvalidpartitionnamespecifiedSolution:
Specify an available partition using the--partition or-p argument in yourlaunch command.To see a list of available partitions, run thesinfo command on theSlurm login node.
sinfoThe output shows the available partition names, such asa3u in this example:
| PARTITION | AVAIL | TIMELIMIT | NODES | STATE | NODELIST |
|---|---|---|---|---|---|
| a3u* | up | infinite | 2 | idle~ | alice-a3u-[2-3] |
| a3u* | up | infinite | 2 | idle | alice-a3u-[0-1] |
Tokenizer download error
You may encounter an OSError related to a cross-device link when the scriptattempts to download the GPT2 tokenizer:
OSError:[Errno18]Invalidcross-devicelink:'gpt2-vocab.json'->'/root/.cache/torch/megatron/megatron-gpt-345m_vocab'Solutions:
You have two options to resolve this issue:
- Option #1:Rerun the job. This error is often transient. Rerunning the job usingthe same
--cache-dirmay resolve the issue. - Option #2:Manually download the tokenizer files. If rerunning the job fails,follow these steps:
- Download the following two files:
gpt2-vocab.jsongpt2-merges.txt
- Move the downloaded files into the
torch/megatron/subdirectory withinyour cache directory (for example,<var>YOUR_CACHE_DIR</var>/torch/megatron/). - Rename the files as follows:
- Rename
gpt2-vocab.jsontomegatron-gpt-345m_vocab. - Rename
gpt2-merges.txttomegatron-gpt-345m_merges.
- Rename
- Download the following two files:
NVIDIA NeMo-RL
TheNVIDIA NeMo-RLframework is designed to align large language models withhuman preferences and instructions. This section guides you through usingNeMo-RL on a cluster to perform advanced alignment tasks, including supervisedfine-tuning (SFT), preference-tuning (such as Direct Preference Optimization,or DPO), and Reinforcement Learning (RL).
The guide covers two primary workflows: running a standard batch training joband using the interactive development environment for debugging.
Prerequisites
Before you begin, create a cluster by following the instructions on theCreate cluster page, or use anexisting Managed Training cluster, if you have one.
Connect to the cluster login node
To connect to the cluster's login node, find the correct Google Cloud CLIcommand by navigating to theGoogle Compute Engine Virtual Machinepage in the Google Google Cloud console andclicking SSH > View Google Cloud CLI command. It will look similar to this:
ssh$USER_NAME@machine-addrExample:
ssh$USER_NAME@nic0.sliua3m1-login-001.europe-north1-c.c.infinipod-shared-dev.internal.gcpnode.comUse the prebuilt Docker image
Converted.sqsh files are provided for prebuilt container images. You canselect a container for your region and either set it directly as the containerimage parameter or download it to the cluster's file system.
To directly set it as the container image param, use one of the following paths.Note that you should replace<region> with your specific region(for example,europe,asia,us):
/gcs/managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqshTo download the image to the cluster's lustre storage, use the following command:
gcloudstoragecpgs://managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqshDESTINATIONDownload code
In order to get access to the training recipe with git CLI, visithttps://www.googlesource.com/new-password.The recipe can be downloaded with the following command:
cd$HOMEgitclonehttps://vertex-model-garden.googlesource.com/vertex-oss-trainingLaunch jobs
Step 1: Set environment variables.
To pull models and data from Hugging Face, theHF_TOKEN may need to be set.To use Weights & Biases for experiment analysis, theWANDB_API_KEY needs tobe set. Update these variables in the following file:
File to update:$HOME/vertex-oss-training/nemo_rl/configs/auth.sh
If you don't want to use Weights & Biases, set thelogger.wandb_enabled toFalsein your launch script.
Step 2: Download or copy the container file to your launch folder.
Here are some examples.
gcloudstoragecp\gs://managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh\$HOME/vertex-oss-training/nemo_rl/nemo_rl-h20250923.sqsh# ORcp/gcs/managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh\$HOME/vertex-oss-training/nemo_rl/nemo_rl-h20250923.sqshcd$HOME/vertex-oss-training/nemo_rl/Step 3: Prepare or Clone the NeMo-RL repository.
Create a clone of the NeMo-RL code if not already present. Note that you mayneed to usegit submodule update --init --recursive if you've already clonedthe repository without the--recursive flag.
gitclonehttps://github.com/NVIDIA-NeMo/RL--recursiveStep 4: Launch the training job.
sbatch-N<num_nodes>launch.sh--cluster_typehcc-a3m--job_scriptalgorithms/dpo.shWhere:
--cluster-typeis set based on the cluster type:- A3-Mega:
hcc-a3m - A3-Ultra:
hcc-a3u - A4:
hcc-a4 - A3H:
hcc-a3h
- A3-Mega:
--partitionshould be set accordingly, wheresinfocan be used to checkthe slurm partitions.
After your job starts, a new directory named after its SLURM Job ID are createdin your current location. Inside, you'll find all the logs and checkpointsbelonging to this job. More precisely, inside that you'll find the followingdirectories and files:
checkpoints/→ This directory is mounted inside the NeMo-RL container andcontains all of the checkpoints from the training.ray-logs/→ This directory contains the logs from ray head and ray workers.nemo_rl_output.log→ This file contains the Slurm logs from your submitted job.attach.sh(Interactive jobs only) → This is a bash script which lets youattach to an interactive job. If your job is launched successfully, it mighttake a couple of minutes for this file to be created.
Development with NeMo-RL
Interactive Setup
Two options are available for quick interactive development with NeMo-RL.
nemorlinteractive
This is a straightforward helper command which lets you choose a GPU node fromthe cluster (let's say node number 5), and then it takes you to a runningcontainer for NeMo-RL inside your selected node. This command is helpful foryour single node workflows.
In order to usenemorlinteractive, you need to follow these prerequisite steps:
- Provide all auth tokens you want (for example, HF and WandB) loaded to thejob in the
configs/auth.shfile. Set the
CLUSTER_TYPEenvironment variable according to the followingguideline:exportCLUSTER_TYPE="hcc-a3m"# --> if you have A3-Mega clusterexportCLUSTER_TYPE="hcc-a3u"# --> if you have A3-Ultra clusterexportCLUSTER_TYPE="hcc-a4"# --> If you have A4 clusterexportCLUSTER_TYPE="hcc-a3h"# --> If you have A3H clusterImport
nemorlinteractivein your bash terminal by sourcing thebash_utils.sh:sourcebash_utils.shRun the
nemorlinteractivecommand. For example:#\ Assuming you want to take the compute node number 5.nemorlinteractive5
Interactive launch
This option lets you run workloads interactively on multiple computenodes. Interactive jobs are most suitable for debugging and verification usecases. These workloads reserve the node indefinitely until the developerdecides that the debugging has concluded and they want to release the resources.
The following are the steps that need to be followed for this option:
Provide all auth tokens you want (for example HF and WandB) loaded to the jobin theconfigs/auth.sh file.
sbatch-N<num_nodes>launch.sh--cluster_typehcc-a3m--interactiveWait for 2-5 minutes and you should see
<job_id>/attach.shcreated.To monitor progress of the launch check
<job_id>/nemo_rl_output.logto seethe progress of the launch script and check<job_id>/ray_logs/to seethe progress of ray head and workers launch.Connect to the interactive job. This script lets you connect again even ifyou lose connection:
bash<job_id>/attach.shWhat's next
Running a prebuilt workload verifies the cluster's operational status.The next step is to run your own custom training application.
- Run your own custom workload: Package your training code into a containerand submit the container as a
CustomJobto your training cluster. Thisprocess includes configuring the job for a distributed environment. - Monitor your training jobs: Effectively track the progress, resourceutilization, and logs for the jobs running on your cluster using thethe Google Cloud console or Cloud Logging.
- Manage your cluster: After running your tests, check the status of yourcluster or delete it to manage costs.
- Orchestrate jobs with Vertex AI Pipelines: After running jobs manually,automate the process by creating a pipeline to orchestrate your trainingworkflows.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.