Run prebuilt workloads

If you're interested in Vertex AI Managed Training, contact your salesrepresentative for access.

This guide shows you how to use the NVIDIA NeMo ecosystem ona Managed Training cluster for end-to-end generative AI model development.It provides step-by-step instructions for the following distinct but relatedworkflows, each covered in its own dedicated section:

  • NVIDIA NeMo: For foundational model development, follow these instructionsto perform large-scale pre-training, continuous pre-training (CPT), andsupervised fine-tuning (SFT).
  • NVIDIA NeMo-RL: For model alignment and preference tuning, use this sectionto apply advanced techniques like Reinforcement Learning (RL) to align yourmodel with human instructions and preferences.

Whether you're building a model from scratch or refining an existing one, thisdocument guides you through setting up your environment, managing containerizedjobs, and launching training scripts on the cluster.

NVIDIA NeMo

TheNVIDIA NeMo frameworkis an end-to-end platform for building, customizing,and deploying generative AI models. This section of the guide is specificallyfor developers and researchers focused on the foundational stages of modeldevelopment. It provides step-by-step instructions for using NeMo to performlarge-scale pre-training, continuous pre-training (CPT), and supervisedfine-tuning (SFT) on a Managed Training cluster.

This training clusters guide provides the complete workflowfor running a training job with theNeMo framework. The process is divided into two main parts: the initial one-timesetup of your environment and the recurring steps for launching a job.

Set up your environment

Before launching a job, you need to prepare your environment by ensuring youhave a container image and the necessary training scripts.

Prepare a container image

You have two options for the container image: use a prebuilt image (recommended)or build a custom one.

Use a prebuilt image (recommended)

Prebuilt container images are provided in the.squashfs format. Copy theappropriate image for your region into your working directory.

# Example for the US regiongcloudstoragecpgs://managed-containers-us/nemo_squashfs/nemo-20250721.sqsh.
Build a customized container (advanced)

Follow these steps only if the prebuilt containers don't meet your needs.This procedure guides you through converting a custom container image into the.squashfs format usingenroot.

Step 1: Authenticate with Google Cloud.

Use the following commands to ensure that both your Google Cloud user accountand the Docker registry where your image is hosted are authenticated:

gcloudauthlogingcloudauthconfigure-dockerus-docker.pkg.dev
Important: The address used here (us-docker.pkg.dev) is an example for anArtifact Registry repository. This address must match the hostname of the registrywhere your image is stored (for example, an Artifact Registry or Container Registryaddress).For example, if your image URI isgcr.io/my-project/my-image, you must usegcr.io.

Step 2: Create the conversion script.

Create a file namedenroot-convert.sh and add the following scriptcontent. Before running this script, you must update theREMOTE_IMG andLOCAL_IMG variables to point to your container image and your chosen output path.

#!/bin/bash#SBATCH --gpus-per-node=8#SBATCH --exclusive#SBATCH --mem=0#SBATCH --ntasks-per-node=1# Run this script on the slurm login node:# sbatch -N 1 enroot-convert.shset-xset-e# The remote docker image URI.REMOTE_IMG="docker://us-docker.pkg.dev/{YOUR_CONTAINER_IMG_URI}:{YOUR_CONTAINER_IMAGE_TAG}"# The local path to the to be imported enroot squash file.LOCAL_IMG="${HOME}/my_nemo.sqsh"# The path to the enroot config file.TMP_ENROOT_CONFIG_PATH="/tmp/\$(id -u --name)/config/enroot"# Download the docker image to each node.srun-l-N"${SLURM_NNODES}"\bash-c"mkdir -p${TMP_ENROOT_CONFIG_PATH};echo 'machine us-docker.pkg.dev login oauth2accesstoken password$(gcloudauthprint-access-token)' >${TMP_ENROOT_CONFIG_PATH}/.credentials;rm -f${LOCAL_IMG};ENROOT_CONFIG_PATH=${TMP_ENROOT_CONFIG_PATH} ENROOT_MAX_PROCESSORS=$(($(nproc)/2)) enroot import -o${LOCAL_IMG}${REMOTE_IMG};"

Step 3: Run the script and verify the output.

Execute the script on the Slurm login node.

sbatch-N1enroot-convert.sh

After the job completes, find the conversion logs in a file namedslurm-<JOB_ID>.out and the final container image at the path you specifiedforLOCAL_IMG.

Download the training recipes

The training recipes are stored in a privategooglesource.com repository.To access them with the Git command line, you must first generate authenticationcredentials.

  1. Generate authentication credentials.

    Visit the following URL and follow the on-screen instructions.This configures your local environment to authenticate with therepository.https://www.googlesource.com/new-password

  2. Clone the repository.

    Once the credentials are authenticated, run the following command todownload the recipes.

    gitclonehttps://vertex-model-garden.googlesource.com/vertex-oss-training

Launch a training job

Once your environment is set up, you can launch a training job.

Step 1: Set environment variables

The following environment variables may be required for your job:

  • TheHF_TOKEN is required to download models and datasets from Hugging Face.
  • TheWANDB_API_KEY is required to useWeights & Biasesfor experiment analysis.
exportHF_TOKEN=YOUR_HF_TOKENexportWANDB_API_KEY=YOUR_WANDB_API_KEY

Step 2: Run the launch script

Navigate to your working directory and run therun.py script to start ajob. This example kicks off a demo training job with Llama 3.1-2b.

# Set the working directoryexportWORK_DIR=$HOME/vertex-oss-training/nemocd$WORK_DIR# Launch the training jobexportNEMORUN_HOME=$WORK_DIR &&\python3run.py-eslurm--slurm-typehcc-a3m--partitiona3m\-d$WORK_DIR-i$WORK_DIR/nemo-demo.sqsh\-spretrain/llama3p1_2b_pt.py-n2\--experiment-namenemo-demo-run

Launch parameters

  • --slurm-type is set based on the cluster type (for example,hcc-a3m,hcc-a3u,hcc-a4).
  • --partition must be set to an available partition. You can checkpartition names with thesinfo command.
  • Therun.py script automatically mounts several directories to theDocker container, including--log-dir,--cache-dir, and--data-dir,if they are set.

Monitoring job status and logs

After you launch the job, a status block is displayed:

ExperimentStatusfornemo-demo-run_1753123402Task0:nemo-demo-run-Status:RUNNING-Executor:SlurmExecutoron@localhost-Jobid:75-LocalDirectory:$NEMORUN_HOME/experiments/nemo-demo-run/nemo-demo-run_1753123402/nemo-demo-run

The execution logs are written to the path shown in theLocal Directory fieldfrom the status output. For example, you can find the log files at a pathsimilar to this:

$NEMORUN_HOME/experiments/nemo-demo-run/nemo-demo-run_1753123402/nemo-demo-run/<JOB_ID>.log

Common errors and solutions

This section describes common issues that may arise during job execution andprovides recommended steps to resolve them.

Invalid partition error

By default, jobs attempt to launch on the general partition. If the generalpartition doesn't exist or isn't available, the job will fail with thefollowing error:

sbatch:error:invalidpartitionspecified:generalsbatch:error:Batchjobsubmissionfailed:Invalidpartitionnamespecified

Solution:

Specify an available partition using the--partition or-p argument in yourlaunch command.To see a list of available partitions, run thesinfo command on theSlurm login node.

sinfo

The output shows the available partition names, such asa3u in this example:

PARTITIONAVAILTIMELIMITNODESSTATENODELIST
a3u*upinfinite2idle~alice-a3u-[2-3]
a3u*upinfinite2idlealice-a3u-[0-1]

Tokenizer download error

You may encounter an OSError related to a cross-device link when the scriptattempts to download the GPT2 tokenizer:

OSError:[Errno18]Invalidcross-devicelink:'gpt2-vocab.json'->'/root/.cache/torch/megatron/megatron-gpt-345m_vocab'

Solutions:

You have two options to resolve this issue:

  • Option #1:Rerun the job. This error is often transient. Rerunning the job usingthe same--cache-dir may resolve the issue.
  • Option #2:Manually download the tokenizer files. If rerunning the job fails,follow these steps:
    • Download the following two files:
      • gpt2-vocab.json
      • gpt2-merges.txt
    • Move the downloaded files into thetorch/megatron/ subdirectory withinyour cache directory (for example,<var>YOUR_CACHE_DIR</var>/torch/megatron/).
    • Rename the files as follows:
      • Renamegpt2-vocab.json tomegatron-gpt-345m_vocab.
      • Renamegpt2-merges.txt tomegatron-gpt-345m_merges.

NVIDIA NeMo-RL

TheNVIDIA NeMo-RLframework is designed to align large language models withhuman preferences and instructions. This section guides you through usingNeMo-RL on a cluster to perform advanced alignment tasks, including supervisedfine-tuning (SFT), preference-tuning (such as Direct Preference Optimization,or DPO), and Reinforcement Learning (RL).

The guide covers two primary workflows: running a standard batch training joband using the interactive development environment for debugging.

Prerequisites

Before you begin, create a cluster by following the instructions on theCreate cluster page, or use anexisting Managed Training cluster, if you have one.

Connect to the cluster login node

To connect to the cluster's login node, find the correct Google Cloud CLIcommand by navigating to theGoogle Compute Engine Virtual Machinepage in the Google Google Cloud console andclicking SSH > View Google Cloud CLI command. It will look similar to this:

ssh$USER_NAME@machine-addr

Example:

ssh$USER_NAME@nic0.sliua3m1-login-001.europe-north1-c.c.infinipod-shared-dev.internal.gcpnode.com

Use the prebuilt Docker image

Converted.sqsh files are provided for prebuilt container images. You canselect a container for your region and either set it directly as the containerimage parameter or download it to the cluster's file system.

To directly set it as the container image param, use one of the following paths.Note that you should replace<region> with your specific region(for example,europe,asia,us):

/gcs/managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh

To download the image to the cluster's lustre storage, use the following command:

gcloudstoragecpgs://managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqshDESTINATION

Download code

In order to get access to the training recipe with git CLI, visithttps://www.googlesource.com/new-password.The recipe can be downloaded with the following command:

cd$HOMEgitclonehttps://vertex-model-garden.googlesource.com/vertex-oss-training

Launch jobs

Step 1: Set environment variables.

To pull models and data from Hugging Face, theHF_TOKEN may need to be set.To use Weights & Biases for experiment analysis, theWANDB_API_KEY needs tobe set. Update these variables in the following file:

File to update:$HOME/vertex-oss-training/nemo_rl/configs/auth.sh

If you don't want to use Weights & Biases, set thelogger.wandb_enabled toFalsein your launch script.

Step 2: Download or copy the container file to your launch folder.

Here are some examples.

gcloudstoragecp\gs://managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh\$HOME/vertex-oss-training/nemo_rl/nemo_rl-h20250923.sqsh# ORcp/gcs/managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh\$HOME/vertex-oss-training/nemo_rl/nemo_rl-h20250923.sqshcd$HOME/vertex-oss-training/nemo_rl/

Step 3: Prepare or Clone the NeMo-RL repository.

Create a clone of the NeMo-RL code if not already present. Note that you mayneed to usegit submodule update --init --recursive if you've already clonedthe repository without the--recursive flag.

gitclonehttps://github.com/NVIDIA-NeMo/RL--recursive

Step 4: Launch the training job.

sbatch-N<num_nodes>launch.sh--cluster_typehcc-a3m--job_scriptalgorithms/dpo.sh

Where:

  • --cluster-type is set based on the cluster type:
    • A3-Mega:hcc-a3m
    • A3-Ultra:hcc-a3u
    • A4:hcc-a4
    • A3H:hcc-a3h
  • --partition should be set accordingly, wheresinfo can be used to checkthe slurm partitions.

After your job starts, a new directory named after its SLURM Job ID are createdin your current location. Inside, you'll find all the logs and checkpointsbelonging to this job. More precisely, inside that you'll find the followingdirectories and files:

  • checkpoints/ → This directory is mounted inside the NeMo-RL container andcontains all of the checkpoints from the training.
  • ray-logs/ → This directory contains the logs from ray head and ray workers.
  • nemo_rl_output.log → This file contains the Slurm logs from your submitted job.
  • attach.sh (Interactive jobs only) → This is a bash script which lets youattach to an interactive job. If your job is launched successfully, it mighttake a couple of minutes for this file to be created.

Development with NeMo-RL

Interactive Setup

Two options are available for quick interactive development with NeMo-RL.

nemorlinteractive

This is a straightforward helper command which lets you choose a GPU node fromthe cluster (let's say node number 5), and then it takes you to a runningcontainer for NeMo-RL inside your selected node. This command is helpful foryour single node workflows.

In order to usenemorlinteractive, you need to follow these prerequisite steps:

  1. Provide all auth tokens you want (for example, HF and WandB) loaded to thejob in theconfigs/auth.sh file.
  2. Set theCLUSTER_TYPE environment variable according to the followingguideline:

    exportCLUSTER_TYPE="hcc-a3m"# --> if you have A3-Mega clusterexportCLUSTER_TYPE="hcc-a3u"# --> if you have A3-Ultra clusterexportCLUSTER_TYPE="hcc-a4"# --> If you have A4 clusterexportCLUSTER_TYPE="hcc-a3h"# --> If you have A3H cluster
  3. Importnemorlinteractive in your bash terminal by sourcing thebash_utils.sh:

    sourcebash_utils.sh
  4. Run thenemorlinteractive command. For example:

    #\ Assuming you want to take the compute node number 5.nemorlinteractive5

Interactive launch

This option lets you run workloads interactively on multiple computenodes. Interactive jobs are most suitable for debugging and verification usecases. These workloads reserve the node indefinitely until the developerdecides that the debugging has concluded and they want to release the resources.

The following are the steps that need to be followed for this option:

Provide all auth tokens you want (for example HF and WandB) loaded to the jobin theconfigs/auth.sh file.

sbatch-N<num_nodes>launch.sh--cluster_typehcc-a3m--interactive
  • Wait for 2-5 minutes and you should see<job_id>/attach.sh created.

  • To monitor progress of the launch check<job_id>/nemo_rl_output.log to seethe progress of the launch script and check<job_id>/ray_logs/ to seethe progress of ray head and workers launch.

  • Connect to the interactive job. This script lets you connect again even ifyou lose connection:

bash<job_id>/attach.sh

What's next

Running a prebuilt workload verifies the cluster's operational status.The next step is to run your own custom training application.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.