Monitor and debug training with an interactive shell

This page shows you how to use an interactive shell to inspect thecontainer where your training code is running. You can browse the file systemand run debugging utilities in eachprebuiltcontainer orcustomcontainer running on Vertex AI.

Using an interactive shell to inspect your training container can help you debugproblems with your training code or your Vertex AI configuration.For example, you can use an interactive shell to do the following:

Run tracing and profiling tools.
Analyze GPU usage.
Check Google Cloud permissions available to the container.

You can also use Cloud Profiler to debug model training performance foryour Vertex AI serverless trainingjobs. For details, seeProfile model training performance using Profiler.

Before you begin

You can use an interactive shell when you perform serverless training with aCustomJob resource, aHyperparameterTuningJob resource, or a customTrainingPipeline resource. As youprepare your trainingcode andconfigure the serverless trainingresource of your choice, make sure tomeet the following requirements:

Ensure that your training container hasbash installed.
Allprebuilt training containers havebash installed. If youcreate a custom container fortraining, use a base container thatincludesbash or installbash in your Dockerfile.
Perform serverless training in aregion that supports interactiveshells.
Ensure that anyone who wants to access an interactive shell has the followingpermissions for the Google Cloud project where serverless training isrunning:
- aiplatform.customJobs.create
- aiplatform.customJobs.get
- aiplatform.customJobs.cancel
If you initiate serverless training yourself, then you most likely already havethese permissions and can access an interactive shell. However, if you want touse an interactive shell to inspect a serverless training resource created bysomeone else in your organization, then you might need to obtain thesepermissions.
One way to obtain these permissions is to ask an administrator of yourorganization to grant you theVertex AI Userrole (roles/aiplatform.user).

Requirements for advanced cases

If you are using certain advanced features, meet the following additionalrequirements:

If youattach a custom service accountto your serverless training resource, then make sure that any user who wants toaccess an interactive shell has theiam.serviceAccounts.actAs permission forthe attached service account.
The guide to custom service accounts notes that you must have this permissionto attach a service account. You also need this permission to view aninteractive shell during serverless training.
For example, to create aCustomJob with a service account attached, you musthave theiam.serviceAccounts.actAs permission for the service account. Ifone of your colleagues then wants to view an interactive shell for thisCustomJob, they must also have the sameiam.serviceAccounts.actAspermission.
If you have configured your project touse VPC Service Controls withVertex AI, then account forthe following additional limitations:
- You can't useprivate IP for serverless training. Ifyou require VPC-SC with VPC Peering, there is extra setup requiredto use the interactive shell. Follow the instructions covered inRay Dashboard and Interactive Shell with VPC-SC + VPC Peeringto configure the interactive shell setup with VPC-SC and VPC Peering inyour user project.
- From within an interactive shell, you can't access the public internet orGoogle Cloud resources outside your service perimeter.
- To secure access to interactive shells, you must addnotebooks.googleapis.com as a restricted service in your serviceperimeter, in addition toaiplatform.googleapis.com. If you onlyrestrictaiplatform.googleapis.com and notnotebooks.googleapis.com,then users can access interactive shells from machines outside the serviceperimeter, which reduces the security benefit of using VPC Service Controls.
  Note: More generally, we recommend that you restrict all services when youcreate a service perimeter. See the VPC Service Controls guide to creating a service perimeter.

Enable interactive shells

To enable interactive shells for a serverless training resource, set theenableWebAccess APIfield totruewhen you create aCustomJob,HyperparameterTuningJob, or customTrainingPipeline.

The following examples show how to do this using several different tools:

Console

Follow the guide tocreating a customTrainingPipeline in the Google Cloud console. In theTrain new model pane, when you reach theModel details step, do thefollowing:

ClickAdvanced options.
Select theEnable training debugging checkbox.

Then, complete the rest of theTrain new model workflow.

gcloud

If you want to create aCustomJob, run thegcloud ai custom-jobs create command, and specify the--enable-web-access flag on this command.
If you want to create aHyperparameterTuningJob, run thegcloud ai hp-tuning-jobs create command, and specify the--enable-web-access flag on this command.

To learn how to use these commands, see the guide tocreating aCustomJob and the guidetocreating aHyperparameterTuningJob.

API

The following partial REST request bodies show where to specify theenableWebAccess field for each type of serverless training resource:

CustomJob

The following example is a partial request body for theprojects.locations.customJobs.create APImethod:

{..."jobSpec":{..."enableWebAccess":true}...}

For an example of sending an API request to create aCustomJob, seeCreating serverless training jobs.

HyperparameterTuningJob

The following example is a partial request body for theprojects.locations.hyperparameterTuningJobs.create APImethod:

{..."trialJobSpec":{..."enableWebAccess":true}...}

For an example of sending an API request to create aHyperparameterTuningJob, seeUsing hyperparametertuning.

Custom TrainingPipeline

The following examples show partial request bodies for theprojects.locations.trainingPipelines.create APImethod.Select one of the following tabs, depending on whether you are usinghyperparameter tuning:

Without hyperparameter tuning

{..."trainingTaskInputs":{..."enableWebAccess":true}...}

With hyperparameter tuning

{..."trainingTaskInputs":{..."trialJobSpec":{..."enableWebAccess":true}}...}

For an example of sending an API request to create a customTrainingPipeline, seeCreating trainingpipelines.

Python

To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.

Set theenable_web_access parameter totrue when you run one of thefollowing methods:

If you want to create aCustomJob, use theCustomJob.run method.
If you want to create aHyperparameterTuningJob use theHyperparameterTuningJob.run method.
If you want to create a customTrainingPipeline, use one of the followingmethods:

Navigate to an interactive shell

After you have initiated serverless training according to the guidance in thepreceding section, Vertex AI generates one or more URIs that youcan use to access interactive shells. Vertex AI generates a uniqueURI for eachtraining node in your job.

You can navigate to an interactive shell in one of the following ways:

Click a link in the Google Cloud console
Use the Vertex AI API to get the shell's web access URI

Navigate from the Google Cloud console

In the Google Cloud console, in theVertex AI section, go to one of thefollowing pages:
- If you aren't using hyperparameter tuning, go to theCustom jobs page:
  Go to Custom jobs
- If you are using hyperparameter tuning, go to theHyperparameter tuning jobs page:
  Go to Hyperparameter tuning jobs
Click the name of your serverless training resource.
If you created aTrainingPipeline for serverless training, click the name oftheCustomJob orHyperparameterTuningJob that was created by yourTrainingPipeline. For example, if your pipeline has the namePIPELINE_NAME, this might be calledPIPELINE_NAME-custom-job orPIPELINE_NAME-hyperparameter-tuning-job.
On the page for your job, clickLaunch web terminal. If your job usesmultiple nodes, clickLaunch web terminal next to the node for which youwant an interactive shell.
Note that you can only access an interactive shell while the job is running.If you don't seeLaunch web terminal, this might be becauseVertex AI hasn't started running your job yet, or because thejob has already finished or failed. If the job'sStatus isQueued orPending, wait a minute; then try refreshing the page.
If you are using hyperparameter tuning, there are separateLaunch webterminal links for each trial.

Get the web access URI from the API

Use theprojects.locations.customJobs.get APImethod or theprojects.locations.hyperparameterTuningJobs.get APImethod tosee the URIs that you can use to access interactive shells.

Note: If you created aTrainingPipeline for serverless training, run theappropriateget method on theCustomJob identified by theTrainingPipeline'strainingTaskMetadata.backingCustomJob field or theHyperparameterTuningJob identified by theTrainingPipeline'strainingTaskMetadata.backingHyperparameterTuningJob field.

Depending on which type of serverless training resource you are using, select one ofthe following tabs to see examples of how to find thewebAccessUris API field,which contains an interactive shell URI for each node in your job:

CustomJob

The following tabs show different ways to send aprojects.locations.customJobs.get request:

gcloud

Run the gcloud ai custom-jobs describecommand:

gcloudaicustom-jobsdescribeJOB_ID\--region=LOCATION\--format=json

Replace the following:

JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job. (If you don't know your job's ID, you can run thegcloud ai custom-jobs listcommand and look for theappropriate job.)
LOCATION: The region where you created the job.

REST

Before using any of the request data, make the following replacements:

LOCATION: The region where you created the job.
PROJECT_ID: Yourproject ID.
JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID"

PowerShell (Windows)

Note: Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID" | Select-Object -Expand Content

In the output, look for the following:

{..."state":"JOB_STATE_RUNNING",..."webAccessUris":{"workerpool0-0":"INTERACTIVE_SHELL_URI"}}

If you don't see thewebAccessUris field, this might be becauseVertex AI hasn't started running your job yet. Verify that you seeJOB_STATE_RUNNING in thestate field. If the state isJOB_STATE_QUEUED orJOB_STATE_PENDING, wait a minute; then try getting the project info again.

HyperparameterTuningJob

The following tabs show different ways to send aprojects.locations.hyperparameterTuningJobs.get request:

gcloud

Run thegcloud ai hp-tuning-jobs describecommand:

gcloudaihp-tuning-jobsdescribeJOB_ID\--region=LOCATION\--format=json

Replace the following:

JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job. (If you don't know your job's ID, you can run thegcloud ai hp-tuning-jobs listcommand and look for theappropriate job.)
LOCATION: The region where you created the job.

REST

Before using any of the request data, make the following replacements:

LOCATION: The region where you created the job.
PROJECT_ID: Yourproject ID.
JOB_ID: The numerical ID of your job. This ID is the last last partof the job'sname field. You might have seen the ID when you created the job.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID"

PowerShell (Windows)

Note: Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID" | Select-Object -Expand Content

In the output, look for the following:

{..."state":"JOB_STATE_RUNNING",..."trials":[...{..."state":"ACTIVE",..."webAccessUris":{"workerpool0-0":"INTERACTIVE_SHELL_URI"}}],}

Vertex AI provides a set of interactive shell URIs for eachhyperparameter tuningtrial asthe trial enters theACTIVE state. If you want to get interactive shell URIsfor later trials, get the job info again after those trials start.

The preceding example shows the expected output for single-replica training: oneURI for the primary training node. If you are performing distributed training,the output contains one URI for each training node, identified by worker pool.

For example, if your job has a primary worker pool with one replica and asecondary worker pool with two replicas, then thewebAccessUris field lookssimilar to the following:

{"workerpool0-0":"URI_FOR_PRIMARY","workerpool1-0":"URI_FOR_FIRST_SECONDARY","workerpool1-1":"URI_FOR_SECOND_SECONDARY"}

Use an interactive shell

To use the interactive shell for a training node, navigate to one of the URIsthat you found in the preceding section. A Bash shell appears in your browser,giving you access to the file system of the container whereVertex AI is running your training code.

The following sections describe some things to consider as you use the shell andprovide some examples of monitoring tools you might use in the shell.

Prevent the job from ending

When Vertex AI finishes running your job or trial, you willimmediately lose access to your interactive shell. If this happens, you mightsee the messagecommand terminated with exit code 137 or the shell might stopresponding. If you created any files in the container's file system, they willnot persist after the job ends.

In some cases, you might want to purposefully make your job run longer in orderto debug with an interactive shell. For example, you can add code like thefollowing to your training code in order to make the job keep running for atleast an hour after an exception occurs:

importtimeimporttracebacktry:# Replace with a function that runs your training codetrain_model()exceptExceptionase:traceback.print_exc()time.sleep(60*60)# 1 hour

However, note that you incurVertex AI Trainingcharges as long as the job keeps running.

Check permissions issues

The interactive shell environment is authenticated usingapplication defaultcredentials (ADC) for theservice account that Vertex AI uses to run your training code. Youcan rungcloud auth list in the shell for more details.

In the shell, you can usebq and othertools that support ADC. This can help you verify that the job is able to accessa particular Cloud Storage bucket, BigQuery table, or otherGoogle Cloud resource that your training code needs.

Visualize Python execution with`py-spy`

py-spy lets you profilean executing Python program, without modifying it. To usepy-spy in aninteractive shell, do the following:

Installpy-spy:
```
pip3installpy-spy
```
Runps aux in the shell, and look for the PID of the Python trainingprogram.
Run any of the subcommands described in thepy-spy documentation,using the PID that you found in the preceding step.
If you usepy-spy record to create an SVG file, copy this file to aCloud Storage bucket so you can view it later on your localcomputer. For example:
```
gcloudstoragecpprofile.svggs://BUCKET
```
ReplaceBUCKET with the name of a bucket you have access to.

Analyze performance with`perf`

perf lets you analyze the performance of your training node.To install the version ofperf appropriate for your node's Linux kernel, runthe following commands:

apt-getupdateapt-getinstall-ylinux-tools-genericrm/usr/bin/perfLINUX_TOOLS_VERSION=$(ls/usr/lib/linux-tools|tail-n1)ln-s"/usr/lib/linux-tools/${LINUX_TOOLS_VERSION}/perf"/usr/bin/perf

After this, you can run any of the subcommands described in theperfdocumentation.

Retrieve information about GPU usage

GPU-enabled containers running on nodes with GPUs typically have severalcommand-line tools preinstalled that can help you monitor GPU usage. Forexample:

Usenvidia-smi to monitor GPU utilization ofvarious processes.
Usenvprof to collect a variety of GPU profilinginformation. Sincenvprof can't attach to an existing process, you mightwant to use the tool to start an additional process running your trainingcode. (This means your training code will run twice on the node.)For example:
```
nvprof-oprof.nvvppython3-mMODULE_NAME
```
ReplaceMODULE_NAME with the fully-qualified name of yourtrainingapplication's entry pointmodule;for example,trainer.task.
Then transfer the output file to a Cloud Storage bucket so you cananalyze it later on your local computer. For example:
```
gcloudstoragecpprof.nvvpgs://BUCKET
```
ReplaceBUCKET with the name of a bucket you have access to.
If you encounter a GPU error (not a problem with your configuration orwith Vertex AI), usenvidia-bug-report.sh tocreate a bug report.
Then transfer the report to a Cloud Storage bucket so you can analyzeit later on your local computer or send it to NVIDIA. For example:
```
gcloudstoragecpnvidia-bug-report.log.gzgs://BUCKET
```
ReplaceBUCKET with the name of a bucket you have access to.

Ifbash can't find any of these NVIDIA commands, try adding/usr/local/nvidia/bin and/usr/local/cuda/bin to the shell'sPATH:

exportPATH="/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}"

Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering

Configurepeered-dns-domains.

{VPC_NAME=NETWORK_NAMEREGION=LOCATIONgcloudservicespeered-dns-domainscreatetraining-cloud\--network=$VPC_NAME\--dns-suffix=$REGION.aiplatform-training.cloud.google.com.# Verifygcloudbetaservicespeered-dns-domainslist--network$VPC_NAME;}

NETWORK_NAME: Change to peered network.
LOCATION: Desired location (for example,us-central1).

ConfigureDNS managed zone.

{PROJECT_ID=PROJECT_IDZONE_NAME=$PROJECT_ID-aiplatform-training-cloud-google-comDNS_NAME=aiplatform-training.cloud.google.comDESCRIPTION=aiplatform-training.cloud.google.comgclouddnsmanaged-zonescreate$ZONE_NAME\--visibility=private\--networks=https://www.googleapis.com/compute/v1/projects/$PROJECT_ID/global/networks/$VPC_NAME\--dns-name=$DNS_NAME\--description="Training$DESCRIPTION"}

PROJECT_ID: Your project ID. You can find these IDs in the Google Cloud consolewelcome page.

Record DNS transaction.

{gclouddnsrecord-setstransactionstart--zone=$ZONE_NAMEgclouddnsrecord-setstransactionadd\--name=$DNS_NAME.\--type=A199.36.153.4199.36.153.5199.36.153.6199.36.153.7\--zone=$ZONE_NAME\--ttl=300gclouddnsrecord-setstransactionadd\--name=*.$DNS_NAME.\--type=CNAME$DNS_NAME.\--zone=$ZONE_NAME\--ttl=300gclouddnsrecord-setstransactionexecute--zone=$ZONE_NAME}

Submit a training job with the interactive shell + VPC-SC + VPC Peering enabled.

What's next

Learn how to optimize the performance of your serverless training jobs usingProfiler.
Learn more abouthow Vertex AI orchestrates customtraining.
Read aboutTraining code requirements.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Monitor and debug training with an interactive shell Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Requirements for advanced cases

Enable interactive shells

Console

gcloud

API

CustomJob

HyperparameterTuningJob

Custom TrainingPipeline

Without hyperparameter tuning

With hyperparameter tuning

Python

Navigate to an interactive shell

Navigate from the Google Cloud console

Get the web access URI from the API

CustomJob

gcloud

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

HyperparameterTuningJob

gcloud

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Use an interactive shell

Prevent the job from ending

Check permissions issues

Visualize Python execution withpy-spy

Analyze performance withperf

Retrieve information about GPU usage

Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering

What's next

Monitor and debug training with an interactive shell

Visualize Python execution with`py-spy`

Analyze performance with`perf`