Create training pipelines Stay organized with collections Save and categorize content based on your preferences.
Training pipelines let you perform custom machine learning (ML) training andautomatically create aModel resource based on your training output.
If your task only involves running a training job and automatically creatinga model resource without orchestrating a full workflow, using a standalonetraining pipeline as described in this document might be sufficient.However, if your goal is to build a robust, automated, and repeatableend-to-end ML lifecycle that involves multiple steps (like data processing,training, evaluation, deployment, or monitoring),Vertex AI Pipelines is the recommendedapproach since it's designed for workflow orchestration.
Before you create a pipeline
Before you create a training pipeline on Vertex AI, you need tocreate aPython training application or acustom container to define thetraining code and dependencies you want to run on Vertex AI. If youcreate a Python training application using PyTorch, TensorFlow, scikit-learn, or XGBoost,you can use our prebuilt containers to run your code. If you're not sure whichof these options to choose, refer to thetraining code requirements to learn more.
Training pipeline options
Atraining pipeline encapsulates training jobs with additional steps. Thisguide explains two different training pipelines:
- Launch a
CustomJoband upload the resulting model to Vertex AI - Launch a hyperparameter tuning job and upload the resulting model toVertex AI
Additionally, you can usemanaged datasetsin your training pipeline.Learn more about configuring your training pipelineto use a managed dataset.
What aCustomJob includes
When you create acustom job, you specify settings that Vertex AIneeds to run your training code, including:
- Oneworker poolfor single-node training (
WorkerPoolSpec), or multipleworker pools for distributed training - Optional settings for configuring job scheduling (
Scheduling),settingcertain environment variables for your trainingcode,using a customservice account, andusing VPC NetworkPeering
Within the worker pool(s), you can specify the following settings:
- Machine types and accelerators
- Configuration of what type of training code the worker poolruns: either a Python trainingapplication (
PythonPackageSpec) or a custom container (ContainerSpec)
If you want to create a standalone custom job outside of aVertex AI Training pipeline, refer to theguide on custom jobs.
Configure your pipeline to use a managed dataset
Within your training pipeline, you can configure yourserverless training job orhyperparameter tuning job to use a managed dataset. Manageddatasets let you manage your datasets with your training applicationsand models.
To use a managed dataset in your training pipeline:
- Create your dataset.
- Update your training application to use a managed dataset. For moreinformation, seehow Vertex AI passes your dataset to yourtraining application.
Specify a managed dataset when you create your training pipeline. Forexample, if you create your training pipeline using the REST API, specifythe dataset settings in the
inputDataConfigsection.You must create the training pipeline in the same region where you createdthe dataset.
To learn more, refer to the API reference onTrainingPipeline.
Configure distributed training
Within your training pipeline, you can configure yourserverless training jobor hyperparameter tuning job for distributed training by specifying multipleworker pools.
All the examples on this page show single-replica training jobs with one workerpool. To modify them for distributed training:
- Use your first worker pool to configure your primary replica, and setthe replica count to 1.
- Add more worker pools to configure worker replicas, parameter serverreplicas, or evaluator replicas, if your machine learning frameworksupports these additional cluster tasks for distributed training.
Learn more aboutusing distributed training.
CustomJob and model upload
This training pipeline encapsulates a custom job with an added conveniencestep that makes it easier to deploy your model to Vertex AI aftertraining. This training pipeline does two main things:
The training pipeline creates a
CustomJobresource. The custom job runsthe training application using the computing resources that you specify.After the custom job completes, the training pipeline finds the modelartifacts that your training application creates in the output directory youspecified for your Cloud Storage bucket. It uses these artifactsto create amodel resource, which sets you up formodel deployment.
There are two different ways to set the location for your model artifacts:
If you set a
baseOutputDirectoryfor your training job, make sure yourtraining code saves your model artifacts to that location, using the$AIP_MODEL_DIRenvironment variableset by Vertex AI. After the training job is completed,Vertex AI searches for the resulting model artifacts ings://BASE_OUTPUT_DIRECTORY/model.CustomJobCustomContainerTrainingJobIf you set the
modelToUpload.artifactUrifield, thetraining pipeline uploads the model artifacts from that URI. You must set thisfield if you didn't setbaseOutputDirectory.
If you specify bothbaseOutputDirectory andmodelToUpload.artifactUri,Vertex AI usesmodelToUpload.artifactUri.
To create this type of training pipeline:
Console
In the Google Cloud console, in the Vertex AI section, goto theTraining pipelines page.
ClickCreate to open theTrain new model pane.
On theTraining method step, specify the following settings:
If you want touse a managed dataset fortraining, then specify aDataset and anAnnotation set.
Otherwise, in theDataset drop-down list, selectNo manageddataset.
SelectCustom training (advanced).
ClickContinue.
On theModel details step, chooseTrain new model orTrain new version.If you select train new model, enter a name of your choice,MODEL_NAME, for your model. ClickContinue.
On theTraining container step, specify the following settings:
Selectwhether to use aPrebuilt container or aCustomcontainerfor training.
Depending on your choice, do one of the following:
If you want to use a prebuilt container for training, then provideVertex AI with information it needs to use the trainingpackage that you have uploaded to Cloud Storage:
Use theModel framework andModel framework versiondrop-down lists to specify theprebuiltcontainer that you want touse.
In thePackage location field, specify theCloud Storage URI of thePython training application thatyou have created anduploaded. Thisfile usually ends with
.tar.gz.In thePython module field, enter themodule name of yourtraining application's entrypoint.
If you want to use acustom container fortraining, then in theContainer image field, specify the Artifact Registryor Docker Hub URI of your container image.
In theModel output directory field,specify the Cloud Storage URI of a directory in abucket that you have access to. The directory does not need to exist yet.
This value gets passed to Vertex AI in the
baseOutputDirectoryAPIfield, which setsseveral environment variables that your training application can accesswhen it runs.At the end of training, Vertex AI looks formodelartifacts in a subdirectory ofthis URI in order to create a
Model. (This subdirectory is available toyour training code as theAIP_MODEL_DIRenvironment variable.)When you don't use hyperparametertuning, Vertex AI expects to find model artifacts in
BASE_OUTPUT_DIRECTORY/model/.Optional: In theArguments field, you can specify arguments forVertex AI to use when it starts running your training code.The maximum length for all arguments combined is 100,000 characters.The behavior of these arguments differs depending on what type ofcontainer you are using:
If you are using a prebuilt container, then Vertex AIpasses the arguments as command-line flags to yourPython module.
If you are using a custom container, then Vertex AIoverrides your container's
CMDinstruction with thearguments.
ClickContinue.
On theHyperparameter tuning step, make sure that theEnablehyperparameter tuning checkbox is not selected. ClickContinue.
On theCompute and pricing step, specify the following settings:
In theRegion drop-down list, select a "region that supports customtraining"
In theWorker pool 0 section, specifycomputeresources to use for training.
If you specify accelerators,make sure the type of accelerator that youchoose is available in your selectedregion.
If you want to performdistributedtraining, then clickAdd moreworker pools and specify an additional set of compute resources foreach additional worker pool that you want.
ClickContinue.
On thePrediction container step, specify the followingsettings:
Selectwhether to use aPrebuilt container or aCustomcontainerto serve predictions from your trained model.
Depending on your choice, do one of the following:
If you want to use a prebuilt container to serve predictions, thenuse theModel framework,Model framework version, andAccelerator type fields to choosewhich prebuilt predictioncontainer to use for prediction.
MatchModel framework andModel framework version to themachine learning framework you used for training. Only specify anAccelerator type if you want to lateruse GPUs for online orbatch predictions.
If you want to use a custom container to serve predictions, then dothe following:
In theContainer image field, specify theArtifact Registry URI of your containerimage.
Optionally, you may specify aCommand tooverride thecontainer's
ENTRYPOINTinstruction.
TheModel directory field contains the value that you previously setin theModel output directory field of theTraining containerstep. Changing either of these fields has the same effect. See theprevious instruction for moreinformation about this field.
Leave the fields in thePredict schemata section blank.
ClickStart training to start the serverless training pipeline.
REST
Use the following code sample to create a training pipeline using thecreate method of thetrainingPipeline resource.
Note: If you want to set this pipeline to create a new modelversion,you can optionally add thePARENT_MODELin thetrainingPipeline field.
To learn more, seeModel versioning with Vertex AI Model Registry.
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where the training code is run and the
Modelis stored. - PROJECT_ID: Your project ID.
- TRAINING_PIPELINE_NAME: Required. A display name for the trainingPipeline.
- If your training application uses a Vertex AI dataset, specify the following:
- DATASET_ID: The ID of the dataset.
- ANNOTATIONS_FILTER: Filters the dataset by the annotations that you specify.
- ANNOTATION_SCHEMA_URI: Filters the dataset by the specified annotation schema URI.
- Use one of the following options to specify how data items are split into training, validation, and test sets.
- To split the dataset based on fractions defining the size of each set, specify the following:
- TRAINING_FRACTION: The fraction of the dataset to use to train your model.
- VALIDATION_FRACTION: The fraction of the dataset to use to validate your model.
- TEST_FRACTION: The fraction of the dataset to use to evaluate your model.
- To split the dataset based on filters, specify the following:
- TRAINING_FILTER: Filters the dataset to data items to use for training your model.
- VALIDATION_FILTER: Filters the dataset to data items to use for validating your model.
- TEST_FILTER: Filters the dataset to data items to use for evaluating your model.
- To use a predefined split, specify the following:
- PREDEFINED_SPLIT_KEY: The name of the column to use to split the dataset. Acceptable values in this column include `training`, `validation`, and `test`.
- To split the dataset based on the timestamp on the dataitems, specify the following:
- TIMESTAMP_TRAINING_FRACTION: The fraction of the dataset to use to train your model.
- TIMESTAMP_VALIDATION_FRACTION: The fraction of the dataset to use to validate your model.
- TIMESTAMP_TEST_FRACTION: The fraction of the dataset to use to evaluate your model.
- TIMESTAMP_SPLIT_KEY: The name of the timestamp column to use to split the dataset.
- To split the dataset based on fractions defining the size of each set, specify the following:
- OUTPUT_URI_PREFIX: The Cloud Storage location where Vertex AI exports your training dataset, once it has been split into training, validation, and test sets.
- Define the custom training job:
- MACHINE_TYPE: The type of the machine. Refer toavailable machine types for training.
- ACCELERATOR_TYPE: (Optional.) The type of accelerator to attach to each trial.
- ACCELERATOR_COUNT: (Optional.) The number of accelerators to attach to each trial.
- REPLICA_COUNT: The number of worker replicas to use for each trial.
- If your training application runs in a custom container, specify the following:
- CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry or Docker Hub that is to be run on each worker replica.
- CUSTOM_CONTAINER_COMMAND: (Optional.) The command to be invoked when the container is started. This command overrides the container's default entrypoint.
- CUSTOM_CONTAINER_ARGS: (Optional.) The arguments to be passed when starting the container. The maximum length for all arguments combined is 100,000 characters.
- If your training application is a Python package that runs in a prebuilt container, specify the following:
- PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to theavailable prebuilt containers for training.
- PYTHON_PACKAGE_URIS: The Cloud Storage location of the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
- PYTHON_MODULE: The Python module name to run after installing the packages.
- PYTHON_PACKAGE_ARGS: (Optional.) Command-line arguments to be passed to the Python module. The maximum length for all arguments combined is 100,000 characters.
- TIMEOUT: (Optional.) The maximum running time for the job.
- MODEL_NAME: A display name for the model uploaded (created) by the TrainingPipeline.
- MODEL_DESCRIPTION: A description for the model.
- IMAGE_URI: The URI of the container image to use for running predictions. For example,
us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-1:latest. Useprebuilt containers orcustom containers. - modelToUpload.labels: Any set of key-value pairs to organize your models. For example:
- "env": "prod"
- "tier": "backend"
- Specify theLABEL_NAME andLABEL_VALUE for any labels that you want to apply to this training pipeline.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines
Request JSON body:
{ "displayName": "TRAINING_PIPELINE_NAME", "inputDataConfig": { "datasetId":DATASET_ID, "annotationsFilter":ANNOTATIONS_FILTER, "annotationSchemaUri":ANNOTATION_SCHEMA_URI, // Union field split can be only one of the following: "fractionSplit": { "trainingFraction":TRAINING_FRACTION, "validationFraction":VALIDATION_FRACTION, "testFraction":TEST_FRACTION }, "filterSplit": { "trainingFilter":TRAINING_FILTER, "validationFilter":VALIDATION_FILTER, "testFilter":TEST_FILTER }, "predefinedSplit": { "key":PREDEFINED_SPLIT_KEY }, "timestampSplit": { "trainingFraction":TIMESTAMP_TRAINING_FRACTION, "validationFraction":TIMESTAMP_VALIDATION_FRACTION, "testFraction":TIMESTAMP_TEST_FRACTION, "key":TIMESTAMP_SPLIT_KEY } // End of list of possible types for union field split. "gcsDestination": { "outputUriPrefix":OUTPUT_URI_PREFIX } }, "trainingTaskDefinition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml", "trainingTaskInputs": { "workerPoolSpecs": [ { "machineSpec": { "machineType":MACHINE_TYPE, "acceleratorType":ACCELERATOR_TYPE, "acceleratorCount":ACCELERATOR_COUNT }, "replicaCount":REPLICA_COUNT, // Union field task can be only one of the following: "containerSpec": { "imageUri":CUSTOM_CONTAINER_IMAGE_URI, "command": [CUSTOM_CONTAINER_COMMAND ], "args": [CUSTOM_CONTAINER_ARGS ] }, "pythonPackageSpec": { "executorImageUri":PYTHON_PACKAGE_EXECUTOR_IMAGE_URI, "packageUris": [PYTHON_PACKAGE_URIS ], "pythonModule":PYTHON_MODULE, "args": [PYTHON_PACKAGE_ARGS ] } // End of list of possible types for union field task. } ], "scheduling": { "TIMEOUT":TIMEOUT } } }, "modelToUpload": { "displayName": "MODEL_NAME", "predictSchemata": {}, "containerSpec": { "imageUri": "IMAGE_URI" } }, "labels": {LABEL_NAME_1":LABEL_VALUE_1,LABEL_NAME_2":LABEL_VALUE_2 }}To send your request, choose one of these options:
curl
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines"
PowerShell
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines" | Select-Object -Expand Content
The response contains information about specifications as well as theTRAININGPIPELINE_ID.
Response
{ "name": "projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines/TRAININGPIPELINE_ID", "displayName": "TRAINING_PIPELINE_NAME", "input_data_config" { "dataset_id": "1234567891011121314" "gcs_destination" { "output_uri_prefix": "gs://BUCKET_NAME/data/20200915191342" } "annotation_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/annotation/image_classification_1.0.0.yaml" }, "trainingTaskDefinition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml", "trainingTaskInputs": { "workerPoolSpecs": [ { "machineSpec": { "machineType": "n1-standard-4" }, "replicaCount": "1", "pythonPackageSpec": { "executorImageUri": "us-docker.pkg.dev/vertex-ai/training/training-tf-cpu.2-1:latest", "packageUris": [ "gs://BUCKET_NAME/training/hello-custom-training-1.0.tar.gz" ], "pythonModule": "trainer.task", "args": [ "--model-dir=gs://BUCKET_NAME/output/" ] } } ] }, "trainingTaskMetadata": { "backingCustomJob": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID" }, "modelToUpload": { "displayName": "MODEL_NAME", "predictSchemata": {}, "containerSpec": { "imageUri": "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-1:latest" } }, "state": "PIPELINE_STATE_PENDING", "createTime": "2020-09-15T19:09:54.342080Z", "startTime": "2020-09-15T19:13:42.991045Z",}Java
Before trying this sample, follow theJava setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AIJava API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
importcom.google.cloud.aiplatform.v1.LocationName;importcom.google.cloud.aiplatform.v1.Model;importcom.google.cloud.aiplatform.v1.ModelContainerSpec;importcom.google.cloud.aiplatform.v1.PipelineServiceClient;importcom.google.cloud.aiplatform.v1.PipelineServiceSettings;importcom.google.cloud.aiplatform.v1.TrainingPipeline;importcom.google.gson.JsonArray;importcom.google.gson.JsonObject;importcom.google.protobuf.Value;importcom.google.protobuf.util.JsonFormat;importjava.io.IOException;publicclassCreateTrainingPipelineCustomJobSample{publicstaticvoidmain(String[]args)throwsIOException{// TODO(developer): Replace these variables before running the sample.Stringproject="PROJECT";StringdisplayName="DISPLAY_NAME";StringmodelDisplayName="MODEL_DISPLAY_NAME";StringcontainerImageUri="CONTAINER_IMAGE_URI";StringbaseOutputDirectoryPrefix="BASE_OUTPUT_DIRECTORY_PREFIX";createTrainingPipelineCustomJobSample(project,displayName,modelDisplayName,containerImageUri,baseOutputDirectoryPrefix);}staticvoidcreateTrainingPipelineCustomJobSample(Stringproject,StringdisplayName,StringmodelDisplayName,StringcontainerImageUri,StringbaseOutputDirectoryPrefix)throwsIOException{PipelineServiceSettingssettings=PipelineServiceSettings.newBuilder().setEndpoint("us-central1-aiplatform.googleapis.com:443").build();Stringlocation="us-central1";// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests. After completing all of your requests, call// the "close" method on the client to safely clean up any remaining background resources.try(PipelineServiceClientclient=PipelineServiceClient.create(settings)){JsonObjectjsonMachineSpec=newJsonObject();jsonMachineSpec.addProperty("machineType","n1-standard-4");// A working docker image can be found at// gs://cloud-samples-data/ai-platform/mnist_tfrecord/custom_job// This sample image accepts a set of arguments including model_dir.JsonObjectjsonContainerSpec=newJsonObject();jsonContainerSpec.addProperty("imageUri",containerImageUri);JsonArrayjsonArgs=newJsonArray();jsonArgs.add("--model_dir=$(AIP_MODEL_DIR)");jsonContainerSpec.add("args",jsonArgs);JsonObjectjsonJsonWorkerPoolSpec0=newJsonObject();jsonJsonWorkerPoolSpec0.addProperty("replicaCount",1);jsonJsonWorkerPoolSpec0.add("machineSpec",jsonMachineSpec);jsonJsonWorkerPoolSpec0.add("containerSpec",jsonContainerSpec);JsonArrayjsonWorkerPoolSpecs=newJsonArray();jsonWorkerPoolSpecs.add(jsonJsonWorkerPoolSpec0);JsonObjectjsonBaseOutputDirectory=newJsonObject();// The GCS location for outputs must be accessible by the project's AI Platform// service account.jsonBaseOutputDirectory.addProperty("output_uri_prefix",baseOutputDirectoryPrefix);JsonObjectjsonTrainingTaskInputs=newJsonObject();jsonTrainingTaskInputs.add("workerPoolSpecs",jsonWorkerPoolSpecs);jsonTrainingTaskInputs.add("baseOutputDirectory",jsonBaseOutputDirectory);Value.BuildertrainingTaskInputsBuilder=Value.newBuilder();JsonFormat.parser().merge(jsonTrainingTaskInputs.toString(),trainingTaskInputsBuilder);ValuetrainingTaskInputs=trainingTaskInputsBuilder.build();StringtrainingTaskDefinition="gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml";StringimageUri="gcr.io/cloud-aiplatform/prediction/tf-cpu.1-15:latest";ModelContainerSpeccontainerSpec=ModelContainerSpec.newBuilder().setImageUri(imageUri).build();ModelmodelToUpload=Model.newBuilder().setDisplayName(modelDisplayName).setContainerSpec(containerSpec).build();TrainingPipelinetrainingPipeline=TrainingPipeline.newBuilder().setDisplayName(displayName).setTrainingTaskDefinition(trainingTaskDefinition).setTrainingTaskInputs(trainingTaskInputs).setModelToUpload(modelToUpload).build();LocationNameparent=LocationName.of(project,location);TrainingPipelineresponse=client.createTrainingPipeline(parent,trainingPipeline);System.out.format("response: %s\n",response);System.out.format("Name: %s\n",response.getName());}}}Python
To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.
The following examples show how to use theVertex AI SDK for Python to create aserverless trainingpipeline. Choose whether you plan to use acustomcontainer or aprebuiltcontainer for training:
Prebuilt container
When you use the Vertex AI SDK for Python to create a training pipeline that runs yourPython code in a prebuilt container, you can provide your training code inone of the following ways:
Specify theURI of a Python source distribution package inCloud Storage.
(This option is also available when you create a training pipeline withoutusing the Vertex AI SDK for Python.)
Specify the path to a Python script on your local machine. Before it createsa training pipeline, the Vertex AI SDK for Python packages your script as a sourcedistribution and uploads it to the Cloud Storage bucket of yourchoice.
(This option is only available when you use the Vertex AI SDK for Python.)
To see a code sample for each of these options, select the corresponding tab:
Package
The following sample uses theCustomPythonPackageTrainingJobclass.
defcreate_training_pipeline_custom_package_job_sample(project:str,location:str,staging_bucket:str,display_name:str,python_package_gcs_uri:str,python_module_name:str,container_uri:str,model_serving_container_image_uri:str,dataset_id:Optional[str]=None,model_display_name:Optional[str]=None,args:Optional[List[Union[str,float,int]]]=None,replica_count:int=1,machine_type:str="n1-standard-4",accelerator_type:str="ACCELERATOR_TYPE_UNSPECIFIED",accelerator_count:int=0,training_fraction_split:float=0.8,validation_fraction_split:float=0.1,test_fraction_split:float=0.1,sync:bool=True,tensorboard_resource_name:Optional[str]=None,service_account:Optional[str]=None,):aiplatform.init(project=project,location=location,staging_bucket=staging_bucket)job=aiplatform.CustomPythonPackageTrainingJob(display_name=display_name,python_package_gcs_uri=python_package_gcs_uri,python_module_name=python_module_name,container_uri=container_uri,model_serving_container_image_uri=model_serving_container_image_uri,)# This example uses an ImageDataset, but you can use another typedataset=aiplatform.ImageDataset(dataset_id)ifdataset_idelseNonemodel=job.run(dataset=dataset,model_display_name=model_display_name,args=args,replica_count=replica_count,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,training_fraction_split=training_fraction_split,validation_fraction_split=validation_fraction_split,test_fraction_split=test_fraction_split,sync=sync,tensorboard=tensorboard_resource_name,service_account=service_account,)model.wait()print(model.display_name)print(model.resource_name)print(model.uri)returnmodelScript
The following sample uses theCustomTrainingJobclass.
defcreate_training_pipeline_custom_job_sample(project:str,location:str,staging_bucket:str,display_name:str,script_path:str,container_uri:str,model_serving_container_image_uri:str,dataset_id:Optional[str]=None,model_display_name:Optional[str]=None,args:Optional[List[Union[str,float,int]]]=None,replica_count:int=0,machine_type:str="n1-standard-4",accelerator_type:str="ACCELERATOR_TYPE_UNSPECIFIED",accelerator_count:int=0,training_fraction_split:float=0.8,validation_fraction_split:float=0.1,test_fraction_split:float=0.1,sync:bool=True,tensorboard_resource_name:Optional[str]=None,service_account:Optional[str]=None,):aiplatform.init(project=project,location=location,staging_bucket=staging_bucket)job=aiplatform.CustomTrainingJob(display_name=display_name,script_path=script_path,container_uri=container_uri,model_serving_container_image_uri=model_serving_container_image_uri,)# This example uses an ImageDataset, but you can use another typedataset=aiplatform.ImageDataset(dataset_id)ifdataset_idelseNonemodel=job.run(dataset=dataset,model_display_name=model_display_name,args=args,replica_count=replica_count,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,training_fraction_split=training_fraction_split,validation_fraction_split=validation_fraction_split,test_fraction_split=test_fraction_split,sync=sync,tensorboard=tensorboard_resource_name,service_account=service_account,)model.wait()print(model.display_name)print(model.resource_name)print(model.uri)returnmodelCustom container
The following sample uses theCustomContainerTrainingJobclass.
defcreate_training_pipeline_custom_container_job_sample(project:str,location:str,staging_bucket:str,display_name:str,container_uri:str,model_serving_container_image_uri:str,dataset_id:Optional[str]=None,model_display_name:Optional[str]=None,args:Optional[List[Union[str,float,int]]]=None,replica_count:int=1,machine_type:str="n1-standard-4",accelerator_type:str="ACCELERATOR_TYPE_UNSPECIFIED",accelerator_count:int=0,training_fraction_split:float=0.8,validation_fraction_split:float=0.1,test_fraction_split:float=0.1,sync:bool=True,tensorboard_resource_name:Optional[str]=None,service_account:Optional[str]=None,):aiplatform.init(project=project,location=location,staging_bucket=staging_bucket)job=aiplatform.CustomContainerTrainingJob(display_name=display_name,container_uri=container_uri,model_serving_container_image_uri=model_serving_container_image_uri,)# This example uses an ImageDataset, but you can use another typedataset=aiplatform.ImageDataset(dataset_id)ifdataset_idelseNonemodel=job.run(dataset=dataset,model_display_name=model_display_name,args=args,replica_count=replica_count,machine_type=machine_type,accelerator_type=accelerator_type,accelerator_count=accelerator_count,training_fraction_split=training_fraction_split,validation_fraction_split=validation_fraction_split,test_fraction_split=test_fraction_split,sync=sync,tensorboard=tensorboard_resource_name,service_account=service_account,)model.wait()print(model.display_name)print(model.resource_name)print(model.uri)returnmodelHyperparameter tuning job and model upload
This training pipeline encapsulates a hyperparameter tuning job with an addedconvenience step that makes it easier to deploy your model toVertex AI after training. This training pipeline does two mainthings:
The training pipeline creates ahyperparameter tuning job resource. Thehyperparameter tuning job creates multiple trials. For each trial, acustom job runs your training application using the computing resources andhyperparameters that you specify.
After the hyperparameter tuning job completes, the training pipeline findsthe model artifacts from the best trial, within the output directory(
baseOutputDirectory) you specified for your Cloud Storagebucket. The training pipeline uses these artifacts to create amodelresource, which sets you up formodel deployment.
For this training pipeline, you must specify abaseOutputDirectory whereVertex AI searches for the model artifacts from the best trial.
Hyperparameter tuning jobs have additional settings to configure. Learn moreabout the settings for aHyperparameterTuningJob.
REST
Use the following code sample to create a training pipeline using thecreate method of thetrainingPipeline resource.
Before using any of the request data, make the following replacements:
- LOCATION_ID: Your project's region.
- PROJECT_ID: Your project ID.
- TRAINING_PIPELINE_NAME: Required. A display name for the trainingPipeline.
- If your training application uses a Vertex AI dataset, specify the following:
- DATASET_ID: The ID of the dataset.
- ANNOTATIONS_FILTER: Filters the dataset by the annotations that you specify.
- ANNOTATION_SCHEMA_URI: Filters the dataset by the specified annotation schema URI.
- Use one of the following options to specify how data items are split into training, validation, and test sets.
- To split the dataset based on fractions defining the size of each set, specify the following:
- TRAINING_FRACTION: The fraction of the dataset to use to train your model.
- VALIDATION_FRACTION: The fraction of the dataset to use to validate your model.
- TEST_FRACTION: The fraction of the dataset to use to evaluate your model.
- To split the dataset based on filters, specify the following:
- TRAINING_FILTER: Filters the dataset to data items to use for training your model.
- VALIDATION_FILTER: Filters the dataset to data items to use for validating your model.
- TEST_FILTER: Filters the dataset to data items to use for evaluating your model.
- To use a predefined split, specify the following:
- PREDEFINED_SPLIT_KEY: The name of the column to use to split the dataset. Acceptable values in this column include `training`, `validation`, and `test`.
- To split the dataset based on the timestamp on the dataitems, specify the following:
- TIMESTAMP_TRAINING_FRACTION: The fraction of the dataset to use to train your model.
- TIMESTAMP_VALIDATION_FRACTION: The fraction of the dataset to use to validate your model.
- TIMESTAMP_TEST_FRACTION: The fraction of the dataset to use to evaluate your model.
- TIMESTAMP_SPLIT_KEY: The name of the timestamp column to use to split the dataset.
- To split the dataset based on fractions defining the size of each set, specify the following:
- OUTPUT_URI_PREFIX: The Cloud Storage location where Vertex AI exports your training dataset, after it has been split into training, validation, and test sets.
- Specify your hyperparameter tuning job:
- Specify your metrics:
- METRIC_ID: The name of this metric.
- METRIC_GOAL: The goal of this metric. Can be
MAXIMIZEorMINIMIZE.
- Specify your hyperparameters:
- PARAMETER_ID: The name of this hyperparameter.
- PARAMETER_SCALE: (Optional.) How the parameter should be scaled. Leave unset for CATEGORICAL parameters. Can be
UNIT_LINEAR_SCALE,UNIT_LOG_SCALE,UNIT_REVERSE_LOG_SCALE, orSCALE_TYPE_UNSPECIFIED - If this hyperparameter's type is DOUBLE, specify the minimum (DOUBLE_MIN_VALUE) and maximum (DOUBLE_MAX_VALUE) values for this hyperparameter.
- If this hyperparameter's type is INTEGER, specify the minimum (INTEGER_MIN_VALUE) and maximum (INTEGER_MAX_VALUE) values for this hyperparameter.
- If this hyperparameter's type is CATEGORICAL, specify the acceptable values (CATEGORICAL_VALUES) as an array of strings.
- If this hyperparameter's type is DISCRETE, specify the acceptable values (DISCRETE_VALUES) as an array of numbers.
- ALGORITHM: (Optional.) The search algorithm to use in this hyperparameter tuning job. Can be
ALGORITHM_UNSPECIFIED,GRID_SEARCH, orRANDOM_SEARCH. - MAX_TRIAL_COUNT: The maximum number of trials to run in this job.
- PARALLEL_TRIAL_COUNT: The maximum number of trials that can run in parallel.
- MAX_FAILED_TRIAL_COUNT: The number of jobs that can fail before the hyperparameter tuning job fails.
- Define the trial custom training job:
- MACHINE_TYPE: The type of the machine. Refer to theavailable machine types for training.
- ACCELERATOR_TYPE: (Optional.) The type of accelerator to attach to each trial.
- ACCELERATOR_COUNT: (Optional.) The number of accelerators to attach to each trial.
- REPLICA_COUNT: The number of worker replicas to use for each trial.
- If your training application runs in a custom container, specify the following:
- CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry or Docker Hub that is to be run on each worker replica.
- CUSTOM_CONTAINER_COMMAND: (Optional.) The command to be invoked when the container is started. This command overrides the container's default entrypoint.
- CUSTOM_CONTAINER_ARGS: (Optional.) The arguments to be passed when starting the container.
- If your training application is a Python package that runs in a prebuilt container, specify the following:
- PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to theavailable prebuilt containers for training.
- PYTHON_PACKAGE_URIS: The Cloud Storage location of the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
- PYTHON_MODULE: The Python module name to run after installing the packages.
- PYTHON_PACKAGE_ARGS: (Optional.) Command-line arguments to be passed to the Python module.
- Learn aboutjob scheduling options.
- TIMEOUT: (Optional.) The maximum running time for each trial.
- Specify theLABEL_NAME andLABEL_VALUE for any labels that you want to apply to this hyperparameter tuning job.
- Specify your metrics:
- MODEL_NAME: A display name for the model uploaded (created) by the TrainingPipeline.
- MODEL_DESCRIPTION: Optional. A description for the model.
- PREDICTION_IMAGE_URI: Required. Specify one of the two following options:
- The image URI of theprebuilt container to use for prediction, such as "tf2-cpu.2-1:latest".
- The image URI of your owncustom container to use for prediction.
- modelToUpload.labels: Optional. Any set of key-value pairs to organize your models. For example:
- "env": "prod"
- "tier": "backend"
- Specify theLABEL_NAME andLABEL_VALUE for any labels that you want to apply to this training pipeline.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines
Request JSON body:
{ "displayName": "TRAINING_PIPELINE_NAME", "inputDataConfig": { "datasetId":DATASET_ID, "annotationsFilter":ANNOTATIONS_FILTER, "annotationSchemaUri":ANNOTATION_SCHEMA_URI, // Union field split can be only one of the following: "fractionSplit": { "trainingFraction":TRAINING_FRACTION, "validationFraction":VALIDATION_FRACTION, "testFraction":TEST_FRACTION }, "filterSplit": { "trainingFilter":TRAINING_FILTER, "validationFilter":VALIDATION_FILTER, "testFilter":TEST_FILTER }, "predefinedSplit": { "key":PREDEFINED_SPLIT_KEY }, "timestampSplit": { "trainingFraction":TIMESTAMP_TRAINING_FRACTION, "validationFraction":TIMESTAMP_VALIDATION_FRACTION, "testFraction":TIMESTAMP_TEST_FRACTION, "key":TIMESTAMP_SPLIT_KEY } // End of list of possible types for union field split. "gcsDestination": { "outputUriPrefix":OUTPUT_URI_PREFIX } }, "trainingTaskDefinition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/hyperparameter_tuning_task_1.0.0.yaml", "trainingTaskInputs": { "studySpec": { "metrics": [ { "metricId":METRIC_ID, "goal":METRIC_GOAL } ], "parameters": [ { "parameterId":PARAMETER_ID, "scaleType":PARAMETER_SCALE, // Union field parameter_value_spec can be only one of the following: "doubleValueSpec": { "minValue":DOUBLE_MIN_VALUE, "maxValue":DOUBLE_MAX_VALUE }, "integerValueSpec": { "minValue":INTEGER_MIN_VALUE, "maxValue":INTEGER_MAX_VALUE }, "categoricalValueSpec": { "values": [CATEGORICAL_VALUES ] }, "discreteValueSpec": { "values": [DISCRETE_VALUES ] } // End of list of possible types for union field parameter_value_spec. } ], "ALGORITHM":ALGORITHM }, "maxTrialCount":MAX_TRIAL_COUNT, "parallelTrialCount":PARALLEL_TRIAL_COUNT, "maxFailedTrialCount":MAX_FAILED_TRIAL_COUNT, "trialJobSpec": { "workerPoolSpecs": [ { "machineSpec": { "machineType":MACHINE_TYPE, "acceleratorType":ACCELERATOR_TYPE, "acceleratorCount":ACCELERATOR_COUNT }, "replicaCount":REPLICA_COUNT, // Union field task can be only one of the following: "containerSpec": { "imageUri":CUSTOM_CONTAINER_IMAGE_URI, "command": [CUSTOM_CONTAINER_COMMAND ], "args": [CUSTOM_CONTAINER_ARGS ] }, "pythonPackageSpec": { "executorImageUri":PYTHON_PACKAGE_EXECUTOR_IMAGE_URI, "packageUris": [PYTHON_PACKAGE_URIS ], "pythonModule":PYTHON_MODULE, "args": [PYTHON_PACKAGE_ARGS ] } // End of list of possible types for union field task. } ], "scheduling": { "TIMEOUT":TIMEOUT } }, "labels": {LABEL_NAME_1":LABEL_VALUE_1,LABEL_NAME_2":LABEL_VALUE_2 } }, "modelToUpload": { "displayName": "MODEL_NAME", "description": "MODEL_DESCRIPTION", "predictSchemata": {}, "containerSpec": { "imageUri": "PREDICTION_IMAGE_URI" } }, "labels": {LABEL_NAME_1":LABEL_VALUE_1,LABEL_NAME_2":LABEL_VALUE_2 }}To send your request, choose one of these options:
curl
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines"
PowerShell
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines" | Select-Object -Expand Content
The response contains information about specifications as well as theTRAININGPIPELINE_ID.
Response
{ "name": "projects/PROJECT_ID/locations/LOCATION_ID/trainingPipelines/TRAININGPIPELINE_ID", "displayName": "TRAINING_PIPELINE_NAME", "input_data_config" { "dataset_id": "1234567891011121314" "gcs_destination" { "output_uri_prefix": "gs://BUCKET_NAME/data/20200915191342" } "annotation_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/annotation/image_classification_1.0.0.yaml" }, "trainingTaskDefinition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml", "trainingTaskInputs": { "name": "projects/12345/locations/us-central1/hyperparameterTuningJobs/6789", "displayName": "myHyperparameterTuningJob", "studySpec": { "metrics": [ { "metricId": "myMetric", "goal": "MINIMIZE" } ], "parameters": [ { "parameterId": "myParameter1", "integerValueSpec": { "minValue": "1", "maxValue": "128" }, "scaleType": "UNIT_LINEAR_SCALE" }, { "parameterId": "myParameter2", "doubleValueSpec": { "minValue": 1e-07, "maxValue": 1 }, "scaleType": "UNIT_LINEAR_SCALE" } ], "ALGORITHM": "RANDOM_SEARCH" }, "maxTrialCount": 20, "parallelTrialCount": 1, "trialJobSpec": { "workerPoolSpecs": [ { "machineSpec": { "machineType": "n1-standard-4" }, "replicaCount": "1", "pythonPackageSpec": { "executorImageUri": "us-docker.pkg.dev/vertex-ai/training/training-tf-cpu.2-1:latest", "packageUris": [ "gs://my-bucket/my-training-application/trainer.tar.bz2" ], "pythonModule": "my-trainer.trainer" } } ] }, "state": "PIPELINE_STATE_PENDING", "createTime": "2020-09-15T19:09:54.342080Z", "startTime": "2020-09-15T19:13:42.991045Z",}Monitor training
To view training logs, do the following:
In the Google Cloud console, in the Vertex AI section, go totheTraining page.
Click the name of your job to go to the custom job page.
ClickView logs.
You can alsouse an interactiveshell to inspect your trainingcontainers while the training pipeline is running.
View your trained model
When the serverless training pipeline completes, you canfind the trained model inthe Google Cloud console, in the Vertex AI section, on theModels page.
What's next
- Learn how to pinpoint training performance bottlenecks to train models fasterand cheaper usingCloud Profiler.
- Deploy your model to an endpoint.
- Create ahyperparameter tuning job.
- Learn how toschedule serverless training jobs based on resource availability.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.