gcloud ai-platform jobs submit training Stay organized with collections Save and categorize content based on your preferences.
- NAME
- gcloud ai-platform jobs submit training - submit an AI Platform training job
- SYNOPSIS
gcloud ai-platform jobs submit trainingJOB[--config=CONFIG][--enable-web-access][--job-dir=JOB_DIR][--labels=[KEY=VALUE,…]][--master-accelerator=[count=COUNT],[type=TYPE]][--master-image-uri=MASTER_IMAGE_URI][--master-machine-type=MASTER_MACHINE_TYPE][--module-name=MODULE_NAME][--package-path=PACKAGE_PATH][--packages=[PACKAGE,…]][--parameter-server-accelerator=[count=COUNT],[type=TYPE]][--parameter-server-image-uri=PARAMETER_SERVER_IMAGE_URI][--python-version=PYTHON_VERSION][--region=REGION][--runtime-version=RUNTIME_VERSION][--scale-tier=SCALE_TIER][--service-account=SERVICE_ACCOUNT][--staging-bucket=STAGING_BUCKET][--use-chief-in-tf-config=USE_CHIEF_IN_TF_CONFIG][--worker-accelerator=[count=COUNT],[type=TYPE]][--worker-image-uri=WORKER_IMAGE_URI][--async|--stream-logs][--kms-key=KMS_KEY:--kms-keyring=KMS_KEYRING--kms-location=KMS_LOCATION--kms-project=KMS_PROJECT][--parameter-server-count=PARAMETER_SERVER_COUNT--parameter-server-machine-type=PARAMETER_SERVER_MACHINE_TYPE][--worker-count=WORKER_COUNT--worker-machine-type=WORKER_MACHINE_TYPE][GCLOUD_WIDE_FLAG …][--USER_ARGS…]
- DESCRIPTION
- Submit an AI Platform training job.
This creates temporary files and executes Python code staged by a user on CloudStorage. Model code can either be specified with a path, e.g.:
gcloudai-platformjobssubmittrainingmy_job--module-nametrainer.task--staging-bucketgs://my-bucket--package-path/my/code/path/trainer--packagesadditional-dep1.tar.gz,dep2.whlOr by specifying an already built package:
gcloudai-platformjobssubmittrainingmy_job--module-nametrainer.task--staging-bucketgs://my-bucket--packagestrainer-0.0.1.tar.gz,additional-dep1.tar.gz,dep2.whlIf
--package-path=/my/code/path/traineris specified and there is asetup.pyfile at/my/code/path/setup.py, the setupfile will be invoked withsdistand the generated tar files will beuploaded to Cloud Storage. Otherwise, a temporarysetup.pyfilewill be generated for the build.By default, this command runs asynchronously; it exits once the job issuccessfully submitted.
To follow the progress of your job, pass the
--stream-logsflag(note that even with the--stream-logsflag, the job will continueto run after this command exits and must be cancelled withgcloudai-platform jobs cancel JOB_ID).For more information, see:https://cloud.google.com/ai-platform/training/docs/overview
- POSITIONAL ARGUMENTS
JOB- Name of the job.
- [--
USER_ARGS…] - Additional user arguments to be forwarded to user code
The '--' argument must be specified between gcloud specific args on the left andUSER_ARGS on the right.
- FLAGS
--config=CONFIG- Path to the job configuration file. This file should be a YAML document (JSONalso accepted) containing a Job resource as defined in the API (all fields areoptional):https://cloud.google.com/ml/reference/rest/v1/projects.jobs
EXAMPLES:
JSON:
{"jobId":"my_job","labels":{"type":"prod","owner":"alice"},"trainingInput":{"scaleTier":"BASIC","packageUris":["gs://my/package/path"],"region":"us-east1"}}
YAML:
If an option is specified both in the configuration file **and** via commandline arguments, the command line arguments override the configuration file.jobId:my_joblabels:type:prodowner:alicetrainingInput:scaleTier:BASICpackageUris:-gs://my/package/pathregion:us-east1
--enable-web-access- Whether you want AI Platform Training to enable [interactive shell access](https://cloud.google.com/ai-platform/training/docs/monitor-debug-interactive-shell)to training containers. If set to
true, you can access interactiveshells at the URIs given by TrainingOutput.web_access_uris orHyperparameterOutput.web_access_uris (within TrainingOutput.trials). --job-dir=JOB_DIR- Cloud Storage path in which to store training outputs and other data needed fortraining.
This path will be passed to your TensorFlow program as the
--job-dircommand-line arg. The benefit of specifying this field isthat AI Platform will validate the path for use in training. However, note thatyour training program will need to parse the provided--job-dirargument.If packages must be uploaded and
--staging-bucketis not provided,this path will be used instead. --labels=[KEY=VALUE,…]- List of label KEY=VALUE pairs to add.
Keys must start with a lowercase character and contain only hyphens(
-), underscores (_), lowercase characters, andnumbers. Values must contain only hyphens (-), underscores(_), lowercase characters, and numbers. --master-accelerator=[count=COUNT],[type=TYPE]- Hardware accelerator config for the master worker. Must specify both theaccelerator type (TYPE) for each server and the number of accelerators to attachto each server (COUNT).
type- Type of the accelerator. Choices arenvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count- Number of accelerators to attach to each machine running the job. Must begreater than 0.
--master-image-uri=MASTER_IMAGE_URI- Docker image to run on each master worker. This image must be in ContainerRegistry. Only one of
--master-image-uriand--runtime-versionmust be specified. --master-machine-type=MASTER_MACHINE_TYPE- Specifies the type of virtual machine to use for training job's master worker.
You must set this value when
--scale-tieris set toCUSTOM. --module-name=MODULE_NAME- Name of the module to run.
--package-path=PACKAGE_PATH- Path to a Python package to build. This should point to a
localdirectory containing the Python source for the job. It will be built usingsetuptools(which must be installed) using itsparentdirectory as context. If the parent directory contains asetup.pyfile, the build will use that; otherwise, it will use a simple built-in one. --packages=[PACKAGE,…]- Path to Python archives used for training. These can be local paths (absolute orrelative), in which case they will be uploaded to the Cloud Storage bucket givenby
--staging-bucket, or Cloud Storage URLs('gs://bucket-name/path/to/package.tar.gz'). --parameter-server-accelerator=[count=COUNT],[type=TYPE]- Hardware accelerator config for the parameter servers. Must specify both theaccelerator type (TYPE) for each server and the number of accelerators to attachto each server (COUNT).
type- Type of the accelerator. Choices arenvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count- Number of accelerators to attach to each machine running the job. Must begreater than 0.
--parameter-server-image-uri=PARAMETER_SERVER_IMAGE_URI- Docker image to run on each parameter server. This image must be in ContainerRegistry. If not specified, the value of
--master-image-uriisused. --python-version=PYTHON_VERSION- Version of Python used during training. Choices are 3.7, 3.5, and 2.7. However,this value must be compatible with the chosen runtime version for the job.
Must be used with a compatible runtime version:
- 3.7 is compatible with runtime versions 1.15 and later.
- 3.5 is compatible with runtime versions 1.4 through 1.14.
- 2.7 is compatible with runtime versions 1.15 and earlier.
--region=REGION- Region of the machine learning training job to submit. If not specified, youmight be prompted to select a region (interactive mode only).
To avoid prompting when this flag is omitted, you can set the
property:compute/regiongcloudconfigsetcompute/regionREGIONA list of regions can be fetched by running:
gcloudcomputeregionslistTo unset the property, run:
gcloudconfigunsetcompute/regionAlternatively, the region can be stored in the environment variable
.CLOUDSDK_COMPUTE_REGION --runtime-version=RUNTIME_VERSION- AI Platform runtime version for this job. Must be specified unless--master-image-uri is specified instead. It is defined in documentation alongwith the list of supported versions:https://cloud.google.com/ai-platform/prediction/docs/runtime-version-list
--scale-tier=SCALE_TIER- Specify the machine types, the number of replicas for workers, and parameterservers.
SCALE_TIERmust be one of:basic- Single worker instance. This tier is suitable for learning how to use AIPlatform, and for experimenting with new models using small datasets.
basic-gpu- Single worker instance with a GPU.
basic-tpu- Single worker instance with a Cloud TPU.
custom- CUSTOM tier is not a set tier, but rather enables you to use your own clusterspecification. When you use this tier, set values to configure your processingcluster according to these guidelines (using the
--configflag):- You
mustsetTrainingInput.masterTypetospecify the type of machine to use for your master node. This is the onlyrequired setting. - You
maysetTrainingInput.workerCounttospecify the number of workers to use. If you specify one or more workers, youmustalso setTrainingInput.workerTypetospecify the type of machine to use for your worker nodes. - You
maysetTrainingInput.parameterServerCountto specify the number ofparameter servers to use. If you specify one or more parameter servers, youmustalso setTrainingInput.parameterServerTypeto specify the type of machine touse for your parameter servers. Note that all of your workers must use the samemachine type, which can be different from your parameter server type and mastertype. Your parameter servers must likewise use the same machine type, which canbe different from your worker type and master type.
- You
premium-1- Large number of workers with many parameter servers.
standard-1- Many workers and a few parameter servers.
--service-account=SERVICE_ACCOUNT- The email address of a service account to use when running the trainingappplication. You must have the
iam.serviceAccounts.actAspermission for the specified service account. In addition, the AI PlatformTraining Google-managed service account must have theroles/iam.serviceAccountAdminrole for the specified serviceaccount.Learnmore about configuring a service account. If not specified, the AI PlatformTraining Google-managed service account is used by default. --staging-bucket=STAGING_BUCKET- Bucket in which to stage training archives.
Required only if a file upload is necessary (that is, other flags include localpaths) and no other flags implicitly specify an upload path.
--use-chief-in-tf-config=USE_CHIEF_IN_TF_CONFIG- Use "chief" role in the cluster instead of "master". This is required forTensorFlow 2.0 and newer versions. Unlike "master" node, "chief" node does notrun evaluation.
--worker-accelerator=[count=COUNT],[type=TYPE]- Hardware accelerator config for the worker nodes. Must specify both theaccelerator type (TYPE) for each server and the number of accelerators to attachto each server (COUNT).
type- Type of the accelerator. Choices arenvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count- Number of accelerators to attach to each machine running the job. Must begreater than 0.
--worker-image-uri=WORKER_IMAGE_URI- Docker image to run on each worker node. This image must be in ContainerRegistry. If not specified, the value of
--master-image-uriisused. - At most one of these can be specified:
--async- (DEPRECATED) Display information about the operation in progress without waitingfor the operation to complete. Enabled by default and can be omitted; use
--stream-logsto run synchronously. --stream-logs- Block until job completion and stream the logs while the job runs.
Note that even if command execution is halted, the job will still run untilcancelled with
gcloudai-platformjobscancelJOB_ID
- Key resource - The Cloud KMS (Key Management Service) cryptokey that will beused to protect the job. The 'AI Platform Service Agent' service account musthold permission 'Cloud KMS CryptoKey Encrypter/Decrypter'. The arguments in thisgroup can be used to specify the attributes of this resource.
--kms-key=KMS_KEY- ID of the key or fully qualified identifier for the key.
To set the
kms-keyattribute:- provide the argument
--kms-keyon the command line.
This flag argument must be specified if any of the other arguments in this groupare specified.
- provide the argument
--kms-keyring=KMS_KEYRING- The KMS keyring of the key.
To set the
kms-keyringattribute:- provide the argument
--kms-keyon the command line with a fullyspecified name; - provide the argument
--kms-keyringon the command line.
- provide the argument
--kms-location=KMS_LOCATION- The Google Cloud location for the key.
To set the
kms-locationattribute:- provide the argument
--kms-keyon the command line with a fullyspecified name; - provide the argument
--kms-locationon the command line.
- provide the argument
--kms-project=KMS_PROJECT- The Google Cloud project for the key.
To set the
kms-projectattribute:- provide the argument
--kms-keyon the command line with a fullyspecified name; - provide the argument
--kms-projecton the command line; - set the property
core/project.
- provide the argument
- Configure parameter server machine type settings.
--parameter-server-count=PARAMETER_SERVER_COUNT- Number of parameter servers to use for the training job.
This flag argument must be specified if any of the other arguments in this groupare specified.
--parameter-server-machine-type=PARAMETER_SERVER_MACHINE_TYPE- Type of virtual machine to use for training job's parameter servers. This flagmust be specified if any of the other arguments in this group are specifiedmachine to use for training job's parameter servers.
This flag argument must be specified if any of the other arguments in this groupare specified.
- Configure worker node machine type settings.
--worker-count=WORKER_COUNT- Number of worker nodes to use for the training job.
This flag argument must be specified if any of the other arguments in this groupare specified.
--worker-machine-type=WORKER_MACHINE_TYPE- Type of virtual machine to use for training job's worker nodes.
This flag argument must be specified if any of the other arguments in this groupare specified.
- GCLOUD WIDE FLAGS
- These flags are available to all commands:
--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.Run
$gcloud helpfor details. - NOTES
- These variants are also available:
gcloudalphaai-platformjobssubmittraininggcloudbetaai-platformjobssubmittraining
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-01-21 UTC.