Monitor and debug training with an interactive shell

This page shows you how to use an interactive shell to inspect thecontainer where your training code is running. You can browse the file systemand run debugging utilities in eachprebuiltcontainer orcustomcontainer running on Vertex AI.

Using an interactive shell to inspect your training container can help you debugproblems with your training code or your Vertex AI configuration.For example, you can use an interactive shell to do the following:

  • Run tracing and profiling tools.
  • Analyze GPU usage.
  • Check Google Cloud permissions available to the container.

You can also use Cloud Profiler to debug model training performance foryour Vertex AI serverless trainingjobs. For details, seeProfile model training performance using Profiler.

Before you begin

You can use an interactive shell when you perform serverless training with aCustomJob resource, aHyperparameterTuningJob resource, or a customTrainingPipeline resource. As youprepare your trainingcode andconfigure the serverless trainingresource of your choice, make sure tomeet the following requirements:

  • Ensure that your training container hasbash installed.

    Allprebuilt training containers havebash installed. If youcreate a custom container fortraining, use a base container thatincludesbash or installbash in your Dockerfile.

  • Perform serverless training in aregion that supports interactiveshells.

  • Ensure that anyone who wants to access an interactive shell has the followingpermissions for the Google Cloud project where serverless training isrunning:

    • aiplatform.customJobs.create
    • aiplatform.customJobs.get
    • aiplatform.customJobs.cancel

    If you initiate serverless training yourself, then you most likely already havethese permissions and can access an interactive shell. However, if you want touse an interactive shell to inspect a serverless training resource created bysomeone else in your organization, then you might need to obtain thesepermissions.

    One way to obtain these permissions is to ask an administrator of yourorganization to grant you theVertex AI Userrole (roles/aiplatform.user).

Requirements for advanced cases

If you are using certain advanced features, meet the following additionalrequirements:

  • If youattach a custom service accountto your serverless training resource, then make sure that any user who wants toaccess an interactive shell has theiam.serviceAccounts.actAs permission forthe attached service account.

    The guide to custom service accounts notes that you must have this permissionto attach a service account. You also need this permission to view aninteractive shell during serverless training.

    For example, to create aCustomJob with a service account attached, you musthave theiam.serviceAccounts.actAs permission for the service account. Ifone of your colleagues then wants to view an interactive shell for thisCustomJob, they must also have the sameiam.serviceAccounts.actAspermission.

  • If you have configured your project touse VPC Service Controls withVertex AI, then account forthe following additional limitations:

Enable interactive shells

To enable interactive shells for a serverless training resource, set theenableWebAccess APIfield totruewhen you create aCustomJob,HyperparameterTuningJob, or customTrainingPipeline.

The following examples show how to do this using several different tools:

Console

Follow the guide tocreating a customTrainingPipeline in the Google Cloud console. In theTrain new model pane, when you reach theModel details step, do thefollowing:

  1. ClickAdvanced options.

  2. Select theEnable training debugging checkbox.

Then, complete the rest of theTrain new model workflow.

gcloud

To learn how to use these commands, see the guide tocreating aCustomJob and the guidetocreating aHyperparameterTuningJob.

API

The following partial REST request bodies show where to specify theenableWebAccess field for each type of serverless training resource:

CustomJob

The following example is a partial request body for theprojects.locations.customJobs.create APImethod:

{..."jobSpec":{..."enableWebAccess":true}...}

For an example of sending an API request to create aCustomJob, seeCreating serverless training jobs.

HyperparameterTuningJob

The following example is a partial request body for theprojects.locations.hyperparameterTuningJobs.create APImethod:

{..."trialJobSpec":{..."enableWebAccess":true}...}

For an example of sending an API request to create aHyperparameterTuningJob, seeUsing hyperparametertuning.

Custom TrainingPipeline

The following examples show partial request bodies for theprojects.locations.trainingPipelines.create APImethod.Select one of the following tabs, depending on whether you are usinghyperparameter tuning:

Without hyperparameter tuning

{..."trainingTaskInputs":{..."enableWebAccess":true}...}

With hyperparameter tuning

{..."trainingTaskInputs":{..."trialJobSpec":{..."enableWebAccess":true}}...}

For an example of sending an API request to create a customTrainingPipeline, seeCreating trainingpipelines.

Python

To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.

Set theenable_web_access parameter totrue when you run one of thefollowing methods:

Navigate to an interactive shell

After you have initiated serverless training according to the guidance in thepreceding section, Vertex AI generates one or more URIs that youcan use to access interactive shells. Vertex AI generates a uniqueURI for eachtraining node in your job.

You can navigate to an interactive shell in one of the following ways:

  • Click a link in the Google Cloud console
  • Use the Vertex AI API to get the shell's web access URI

Navigate from the Google Cloud console

  1. In the Google Cloud console, in theVertex AI section, go to one of thefollowing pages:

  2. Click the name of your serverless training resource.

    If you created aTrainingPipeline for serverless training, click the name oftheCustomJob orHyperparameterTuningJob that was created by yourTrainingPipeline. For example, if your pipeline has the namePIPELINE_NAME, this might be calledPIPELINE_NAME-custom-job orPIPELINE_NAME-hyperparameter-tuning-job.

  3. On the page for your job, clickLaunch web terminal. If your job usesmultiple nodes, clickLaunch web terminal next to the node for which youwant an interactive shell.

    Note that you can only access an interactive shell while the job is running.If you don't seeLaunch web terminal, this might be becauseVertex AI hasn't started running your job yet, or because thejob has already finished or failed. If the job'sStatus isQueued orPending, wait a minute; then try refreshing the page.

    If you are using hyperparameter tuning, there are separateLaunch webterminal links for each trial.

Get the web access URI from the API

Use theprojects.locations.customJobs.get APImethod or theprojects.locations.hyperparameterTuningJobs.get APImethod tosee the URIs that you can use to access interactive shells.

Note: If you created aTrainingPipeline for serverless training, run theappropriateget method on theCustomJob identified by theTrainingPipeline'strainingTaskMetadata.backingCustomJob field or theHyperparameterTuningJob identified by theTrainingPipeline'strainingTaskMetadata.backingHyperparameterTuningJob field.

Depending on which type of serverless training resource you are using, select one ofthe following tabs to see examples of how to find thewebAccessUris API field,which contains an interactive shell URI for each node in your job:

CustomJob

The following tabs show different ways to send aprojects.locations.customJobs.get request:

gcloud

Run thegcloud ai custom-jobs describecommand:

gcloudaicustom-jobsdescribeJOB_ID\--region=LOCATION\--format=json

Replace the following:

  • JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job. (If you don't know your job's ID, you can run thegcloud ai custom-jobs listcommand and look for theappropriate job.)

  • LOCATION: The region where you created the job.

REST

Before using any of the request data, make the following replacements:

  • LOCATION: The region where you created the job.

  • PROJECT_ID: Yourproject ID.

  • JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: Ensure you have set theGOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID"

PowerShell (Windows)

Note: Ensure you have set theGOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID" | Select-Object -Expand Content
 

In the output, look for the following:

{..."state":"JOB_STATE_RUNNING",..."webAccessUris":{"workerpool0-0":"INTERACTIVE_SHELL_URI"}}

If you don't see thewebAccessUris field, this might be becauseVertex AI hasn't started running your job yet. Verify that you seeJOB_STATE_RUNNING in thestate field. If the state isJOB_STATE_QUEUED orJOB_STATE_PENDING, wait a minute; then try getting the project info again.

HyperparameterTuningJob

The following tabs show different ways to send aprojects.locations.hyperparameterTuningJobs.get request:

gcloud

Run thegcloud ai hp-tuning-jobs describecommand:

gcloudaihp-tuning-jobsdescribeJOB_ID\--region=LOCATION\--format=json

Replace the following:

  • JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job. (If you don't know your job's ID, you can run thegcloud ai hp-tuning-jobs listcommand and look for theappropriate job.)

  • LOCATION: The region where you created the job.

REST

Before using any of the request data, make the following replacements:

  • LOCATION: The region where you created the job.

  • PROJECT_ID: Yourproject ID.

  • JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: Ensure you have set theGOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID"

PowerShell (Windows)

Note: Ensure you have set theGOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID" | Select-Object -Expand Content
 

In the output, look for the following:

{..."state":"JOB_STATE_RUNNING",..."trials":[...{..."state":"ACTIVE",..."webAccessUris":{"workerpool0-0":"INTERACTIVE_SHELL_URI"}}],}

If you don't see thewebAccessUris field, this might be becauseVertex AI hasn't started running your job yet. Verify that you seeJOB_STATE_RUNNING in thestate field. If the state isJOB_STATE_QUEUED orJOB_STATE_PENDING, wait a minute; then try getting the project info again.

Vertex AI provides a set of interactive shell URIs for eachhyperparameter tuningtrial asthe trial enters theACTIVE state. If you want to get interactive shell URIsfor later trials, get the job info again after those trials start.

The preceding example shows the expected output for single-replica training: oneURI for the primary training node. If you are performing distributed training,the output contains one URI for each training node, identified by worker pool.

For example, if your job has a primary worker pool with one replica and asecondary worker pool with two replicas, then thewebAccessUris field lookssimilar to the following:

{"workerpool0-0":"URI_FOR_PRIMARY","workerpool1-0":"URI_FOR_FIRST_SECONDARY","workerpool1-1":"URI_FOR_SECOND_SECONDARY"}

Use an interactive shell

To use the interactive shell for a training node, navigate to one of the URIsthat you found in the preceding section. A Bash shell appears in your browser,giving you access to the file system of the container whereVertex AI is running your training code.

The following sections describe some things to consider as you use the shell andprovide some examples of monitoring tools you might use in the shell.

Prevent the job from ending

When Vertex AI finishes running your job or trial, you willimmediately lose access to your interactive shell. If this happens, you mightsee the messagecommand terminated with exit code 137 or the shell might stopresponding. If you created any files in the container's file system, they willnot persist after the job ends.

In some cases, you might want to purposefully make your job run longer in orderto debug with an interactive shell. For example, you can add code like thefollowing to your training code in order to make the job keep running for atleast an hour after an exception occurs:

importtimeimporttracebacktry:# Replace with a function that runs your training codetrain_model()exceptExceptionase:traceback.print_exc()time.sleep(60*60)# 1 hour

However, note that you incurVertex AI Trainingcharges as long as the job keeps running.

Check permissions issues

The interactive shell environment is authenticated usingapplication defaultcredentials (ADC) for theservice account that Vertex AI uses to run your training code. Youcan rungcloud auth list in the shell for more details.

In the shell, you can usebq and othertools that support ADC. This can help you verify that the job is able to accessa particular Cloud Storage bucket, BigQuery table, or otherGoogle Cloud resource that your training code needs.

Visualize Python execution withpy-spy

py-spy lets you profilean executing Python program, without modifying it. To usepy-spy in aninteractive shell, do the following:

  1. Installpy-spy:

    pip3installpy-spy
  2. Runps aux in the shell, and look for the PID of the Python trainingprogram.

  3. Run any of the subcommands described in thepy-spy documentation,using the PID that you found in the preceding step.

  4. If you usepy-spy record to create an SVG file, copy this file to aCloud Storage bucket so you can view it later on your localcomputer. For example:

    gcloudstoragecpprofile.svggs://BUCKET

    ReplaceBUCKET with the name of a bucket you have access to.

Analyze performance withperf

perf lets you analyze the performance of your training node.To install the version ofperf appropriate for your node's Linux kernel, runthe following commands:

apt-getupdateapt-getinstall-ylinux-tools-genericrm/usr/bin/perfLINUX_TOOLS_VERSION=$(ls/usr/lib/linux-tools|tail-n1)ln-s"/usr/lib/linux-tools/${LINUX_TOOLS_VERSION}/perf"/usr/bin/perf

After this, you can run any of the subcommands described in theperfdocumentation.

Retrieve information about GPU usage

GPU-enabled containers running on nodes with GPUs typically have severalcommand-line tools preinstalled that can help you monitor GPU usage. Forexample:

  • Usenvidia-smi to monitor GPU utilization ofvarious processes.

  • Usenvprof to collect a variety of GPU profilinginformation. Sincenvprof can't attach to an existing process, you mightwant to use the tool to start an additional process running your trainingcode. (This means your training code will run twice on the node.)For example:

    nvprof-oprof.nvvppython3-mMODULE_NAME

    ReplaceMODULE_NAME with the fully-qualified name of yourtrainingapplication's entry pointmodule;for example,trainer.task.

    Then transfer the output file to a Cloud Storage bucket so you cananalyze it later on your local computer. For example:

    gcloudstoragecpprof.nvvpgs://BUCKET

    ReplaceBUCKET with the name of a bucket you have access to.

  • If you encounter a GPU error (not a problem with your configuration orwith Vertex AI), usenvidia-bug-report.sh tocreate a bug report.

    Then transfer the report to a Cloud Storage bucket so you can analyzeit later on your local computer or send it to NVIDIA. For example:

    gcloudstoragecpnvidia-bug-report.log.gzgs://BUCKET

    ReplaceBUCKET with the name of a bucket you have access to.

Ifbash can't find any of these NVIDIA commands, try adding/usr/local/nvidia/bin and/usr/local/cuda/bin to the shell'sPATH:

exportPATH="/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}"

Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering

  1. Configurepeered-dns-domains.

    {VPC_NAME=NETWORK_NAMEREGION=LOCATIONgcloudservicespeered-dns-domainscreatetraining-cloud\--network=$VPC_NAME\--dns-suffix=$REGION.aiplatform-training.cloud.google.com.# Verifygcloudbetaservicespeered-dns-domainslist--network$VPC_NAME;}
    • NETWORK_NAME: Change to peered network.

    • LOCATION: Desired location (for example,us-central1).

  2. ConfigureDNS managed zone.

    {PROJECT_ID=PROJECT_IDZONE_NAME=$PROJECT_ID-aiplatform-training-cloud-google-comDNS_NAME=aiplatform-training.cloud.google.comDESCRIPTION=aiplatform-training.cloud.google.comgclouddnsmanaged-zonescreate$ZONE_NAME\--visibility=private\--networks=https://www.googleapis.com/compute/v1/projects/$PROJECT_ID/global/networks/$VPC_NAME\--dns-name=$DNS_NAME\--description="Training$DESCRIPTION"}
    • PROJECT_ID: Your project ID. You can find these IDs in the Google Cloud consolewelcome page.

  3. Record DNS transaction.

    {gclouddnsrecord-setstransactionstart--zone=$ZONE_NAMEgclouddnsrecord-setstransactionadd\--name=$DNS_NAME.\--type=A199.36.153.4199.36.153.5199.36.153.6199.36.153.7\--zone=$ZONE_NAME\--ttl=300gclouddnsrecord-setstransactionadd\--name=*.$DNS_NAME.\--type=CNAME$DNS_NAME.\--zone=$ZONE_NAME\--ttl=300gclouddnsrecord-setstransactionexecute--zone=$ZONE_NAME}
  4. Submit a training job with the interactive shell + VPC-SC + VPC Peering enabled.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.