gcloud ai-platform jobs submit training

NAME
gcloud ai-platform jobs submit training - submit an AI Platform training job
SYNOPSIS
gcloud ai-platform jobs submit trainingJOB[--config=CONFIG][--enable-web-access][--job-dir=JOB_DIR][--labels=[KEY=VALUE,…]][--master-accelerator=[count=COUNT],[type=TYPE]][--master-image-uri=MASTER_IMAGE_URI][--master-machine-type=MASTER_MACHINE_TYPE][--module-name=MODULE_NAME][--package-path=PACKAGE_PATH][--packages=[PACKAGE,…]][--parameter-server-accelerator=[count=COUNT],[type=TYPE]][--parameter-server-image-uri=PARAMETER_SERVER_IMAGE_URI][--python-version=PYTHON_VERSION][--region=REGION][--runtime-version=RUNTIME_VERSION][--scale-tier=SCALE_TIER][--service-account=SERVICE_ACCOUNT][--staging-bucket=STAGING_BUCKET][--use-chief-in-tf-config=USE_CHIEF_IN_TF_CONFIG][--worker-accelerator=[count=COUNT],[type=TYPE]][--worker-image-uri=WORKER_IMAGE_URI][--async    |--stream-logs][--kms-key=KMS_KEY :--kms-keyring=KMS_KEYRING--kms-location=KMS_LOCATION--kms-project=KMS_PROJECT][--parameter-server-count=PARAMETER_SERVER_COUNT--parameter-server-machine-type=PARAMETER_SERVER_MACHINE_TYPE][--worker-count=WORKER_COUNT--worker-machine-type=WORKER_MACHINE_TYPE][GCLOUD_WIDE_FLAG][--USER_ARGS …]
DESCRIPTION
Submit an AI Platform training job.

This creates temporary files and executes Python code staged by a user on CloudStorage. Model code can either be specified with a path, e.g.:

gcloudai-platformjobssubmittrainingmy_job--module-nametrainer.task--staging-bucketgs://my-bucket--package-path/my/code/path/trainer--packagesadditional-dep1.tar.gz,dep2.whl

Or by specifying an already built package:

gcloudai-platformjobssubmittrainingmy_job--module-nametrainer.task--staging-bucketgs://my-bucket--packagestrainer-0.0.1.tar.gz,additional-dep1.tar.gz,dep2.whl

If--package-path=/my/code/path/trainer is specified and there is asetup.py file at/my/code/path/setup.py, the setupfile will be invoked withsdist and the generated tar files will beuploaded to Cloud Storage. Otherwise, a temporarysetup.py filewill be generated for the build.

By default, this command runs asynchronously; it exits once the job issuccessfully submitted.

To follow the progress of your job, pass the--stream-logs flag(note that even with the--stream-logs flag, the job will continueto run after this command exits and must be cancelled withgcloudai-platform jobs cancel JOB_ID).

For more information, see:https://cloud.google.com/ai-platform/training/docs/overview

POSITIONAL ARGUMENTS
JOB
Name of the job.
[--USER_ARGS …]
Additional user arguments to be forwarded to user code

The '--' argument must be specified between gcloud specific args on the left andUSER_ARGS on the right.

FLAGS
--config=CONFIG
Path to the job configuration file. This file should be a YAML document (JSONalso accepted) containing a Job resource as defined in the API (all fields areoptional):https://cloud.google.com/ml/reference/rest/v1/projects.jobs

EXAMPLES:

JSON:

{"jobId":"my_job","labels":{"type":"prod","owner":"alice"},"trainingInput":{"scaleTier":"BASIC","packageUris":["gs://my/package/path"],"region":"us-east1"}}

YAML:

jobId:my_joblabels:type:prodowner:alicetrainingInput:scaleTier:BASICpackageUris:-gs://my/package/pathregion:us-east1
If an option is specified both in the configuration file **and** via commandline arguments, the command line arguments override the configuration file.
--enable-web-access
Whether you want AI Platform Training to enable [interactive shell access](https://cloud.google.com/ai-platform/training/docs/monitor-debug-interactive-shell)to training containers. If set totrue, you can access interactiveshells at the URIs given by TrainingOutput.web_access_uris orHyperparameterOutput.web_access_uris (within TrainingOutput.trials).
--job-dir=JOB_DIR
Cloud Storage path in which to store training outputs and other data needed fortraining.

This path will be passed to your TensorFlow program as the--job-dir command-line arg. The benefit of specifying this field isthat AI Platform will validate the path for use in training. However, note thatyour training program will need to parse the provided--job-dirargument.

If packages must be uploaded and--staging-bucket is not provided,this path will be used instead.

--labels=[KEY=VALUE,…]
List of label KEY=VALUE pairs to add.

Keys must start with a lowercase character and contain only hyphens(-), underscores (_), lowercase characters, andnumbers. Values must contain only hyphens (-), underscores(_), lowercase characters, and numbers.

--master-accelerator=[count=COUNT],[type=TYPE]
Hardware accelerator config for the master worker. Must specify both theaccelerator type (TYPE) for each server and the number of accelerators to attachto each server (COUNT).
type
Type of the accelerator. Choices arenvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count
Number of accelerators to attach to each machine running the job. Must begreater than 0.
--master-image-uri=MASTER_IMAGE_URI
Docker image to run on each master worker. This image must be in ContainerRegistry. Only one of--master-image-uri and--runtime-version must be specified.
--master-machine-type=MASTER_MACHINE_TYPE
Specifies the type of virtual machine to use for training job's master worker.

You must set this value when--scale-tier is set toCUSTOM.

--module-name=MODULE_NAME
Name of the module to run.
--package-path=PACKAGE_PATH
Path to a Python package to build. This should point to alocaldirectory containing the Python source for the job. It will be built usingsetuptools (which must be installed) using itsparentdirectory as context. If the parent directory contains asetup.pyfile, the build will use that; otherwise, it will use a simple built-in one.
--packages=[PACKAGE,…]
Path to Python archives used for training. These can be local paths (absolute orrelative), in which case they will be uploaded to the Cloud Storage bucket givenby--staging-bucket, or Cloud Storage URLs('gs://bucket-name/path/to/package.tar.gz').
--parameter-server-accelerator=[count=COUNT],[type=TYPE]
Hardware accelerator config for the parameter servers. Must specify both theaccelerator type (TYPE) for each server and the number of accelerators to attachto each server (COUNT).
type
Type of the accelerator. Choices arenvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count
Number of accelerators to attach to each machine running the job. Must begreater than 0.
--parameter-server-image-uri=PARAMETER_SERVER_IMAGE_URI
Docker image to run on each parameter server. This image must be in ContainerRegistry. If not specified, the value of--master-image-uri isused.
--python-version=PYTHON_VERSION
Version of Python used during training. Choices are 3.7, 3.5, and 2.7. However,this value must be compatible with the chosen runtime version for the job.

Must be used with a compatible runtime version:

  • 3.7 is compatible with runtime versions 1.15 and later.
  • 3.5 is compatible with runtime versions 1.4 through 1.14.
  • 2.7 is compatible with runtime versions 1.15 and earlier.
--region=REGION
Region of the machine learning training job to submit. If not specified, youmight be prompted to select a region (interactive mode only).

To avoid prompting when this flag is omitted, you can set thecompute/region property:

gcloudconfigsetcompute/regionREGION

A list of regions can be fetched by running:

gcloudcomputeregionslist

To unset the property, run:

gcloudconfigunsetcompute/region

Alternatively, the region can be stored in the environment variableCLOUDSDK_COMPUTE_REGION.

--runtime-version=RUNTIME_VERSION
AI Platform runtime version for this job. Must be specified unless--master-image-uri is specified instead. It is defined in documentation alongwith the list of supported versions:https://cloud.google.com/ai-platform/prediction/docs/runtime-version-list
--scale-tier=SCALE_TIER
Specify the machine types, the number of replicas for workers, and parameterservers.SCALE_TIER must be one of:
basic
Single worker instance. This tier is suitable for learning how to use AIPlatform, and for experimenting with new models using small datasets.
basic-gpu
Single worker instance with a GPU.
basic-tpu
Single worker instance with a Cloud TPU.
custom
CUSTOM tier is not a set tier, but rather enables you to use your own clusterspecification. When you use this tier, set values to configure your processingcluster according to these guidelines (using the--config flag):
  • Youmust setTrainingInput.masterType tospecify the type of machine to use for your master node. This is the onlyrequired setting.
  • Youmay setTrainingInput.workerCount tospecify the number of workers to use. If you specify one or more workers, youmust also setTrainingInput.workerType tospecify the type of machine to use for your worker nodes.
  • Youmay setTrainingInput.parameterServerCount to specify the number ofparameter servers to use. If you specify one or more parameter servers, youmust also setTrainingInput.parameterServerType to specify the type of machine touse for your parameter servers. Note that all of your workers must use the samemachine type, which can be different from your parameter server type and mastertype. Your parameter servers must likewise use the same machine type, which canbe different from your worker type and master type.
premium-1
Large number of workers with many parameter servers.
standard-1
Many workers and a few parameter servers.
--service-account=SERVICE_ACCOUNT
The email address of a service account to use when running the trainingappplication. You must have theiam.serviceAccounts.actAspermission for the specified service account. In addition, the AI PlatformTraining Google-managed service account must have theroles/iam.serviceAccountAdmin role for the specified serviceaccount.Learnmore about configuring a service account. If not specified, the AI PlatformTraining Google-managed service account is used by default.
--staging-bucket=STAGING_BUCKET
Bucket in which to stage training archives.

Required only if a file upload is necessary (that is, other flags include localpaths) and no other flags implicitly specify an upload path.

--use-chief-in-tf-config=USE_CHIEF_IN_TF_CONFIG
Use "chief" role in the cluster instead of "master". This is required forTensorFlow 2.0 and newer versions. Unlike "master" node, "chief" node does notrun evaluation.
--worker-accelerator=[count=COUNT],[type=TYPE]
Hardware accelerator config for the worker nodes. Must specify both theaccelerator type (TYPE) for each server and the number of accelerators to attachto each server (COUNT).
type
Type of the accelerator. Choices arenvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count
Number of accelerators to attach to each machine running the job. Must begreater than 0.
--worker-image-uri=WORKER_IMAGE_URI
Docker image to run on each worker node. This image must be in ContainerRegistry. If not specified, the value of--master-image-uri isused.
At most one of these can be specified:
--async
(DEPRECATED) Display information about the operation in progress without waitingfor the operation to complete. Enabled by default and can be omitted; use--stream-logs to run synchronously.
--stream-logs
Block until job completion and stream the logs while the job runs.

Note that even if command execution is halted, the job will still run untilcancelled with

gcloudai-platformjobscancelJOB_ID
Key resource - The Cloud KMS (Key Management Service) cryptokey that will beused to protect the job. The 'AI Platform Service Agent' service account musthold permission 'Cloud KMS CryptoKey Encrypter/Decrypter'. The arguments in thisgroup can be used to specify the attributes of this resource.
--kms-key=KMS_KEY
ID of the key or fully qualified identifier for the key.

To set thekms-key attribute:

  • provide the argument--kms-key on the command line.

This flag argument must be specified if any of the other arguments in this groupare specified.

--kms-keyring=KMS_KEYRING
The KMS keyring of the key.

To set thekms-keyring attribute:

  • provide the argument--kms-key on the command line with a fullyspecified name;
  • provide the argument--kms-keyring on the command line.
--kms-location=KMS_LOCATION
The Google Cloud location for the key.

To set thekms-location attribute:

  • provide the argument--kms-key on the command line with a fullyspecified name;
  • provide the argument--kms-location on the command line.
--kms-project=KMS_PROJECT
The Google Cloud project for the key.

To set thekms-project attribute:

  • provide the argument--kms-key on the command line with a fullyspecified name;
  • provide the argument--kms-project on the command line;
  • set the propertycore/project.
Configure parameter server machine type settings.
--parameter-server-count=PARAMETER_SERVER_COUNT
Number of parameter servers to use for the training job.

This flag argument must be specified if any of the other arguments in this groupare specified.

--parameter-server-machine-type=PARAMETER_SERVER_MACHINE_TYPE
Type of virtual machine to use for training job's parameter servers. This flagmust be specified if any of the other arguments in this group are specifiedmachine to use for training job's parameter servers.

This flag argument must be specified if any of the other arguments in this groupare specified.

Configure worker node machine type settings.
--worker-count=WORKER_COUNT
Number of worker nodes to use for the training job.

This flag argument must be specified if any of the other arguments in this groupare specified.

--worker-machine-type=WORKER_MACHINE_TYPE
Type of virtual machine to use for training job's worker nodes.

This flag argument must be specified if any of the other arguments in this groupare specified.

GCLOUD WIDE FLAGS
These flags are available to all commands:--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.

Run$gcloud help for details.

NOTES
These variants are also available:
gcloudalphaai-platformjobssubmittraining
gcloudbetaai-platformjobssubmittraining

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-01-21 UTC.