Custom training beginner's guide

This beginner's guide is an introduction to custom training on Vertex AI. Custom training refers to training a model using an ML framework such as TensorFlow, PyTorch, or XGBoost.

Learning Objectives

Vertex AI experience level: Beginner

Estimated reading time: 15 minutes

What you'll learn:

  • Benefits of using a managed service for custom training.
  • Best practices for packaging training code.
  • How to submit and monitor a training job.

Why use a managed training service?

Imagine you're working on a new ML problem. You open up a notebook, import yourdata, and run experimentation. In this scenario, you create a model with the MLframework of your choice, and execute notebook cells to run a training loop.When training completes, you evaluate the results of your model, make changes,and then re-run training. This workflow is useful for experimentation,but as you start to think about building production applications with ML,you might find that manually executing the cells of your notebook isn't themost convenient option.

For example, if your dataset and model are large you might want to try outdistributed training. Additionally, in a production setting it's unlikely thatyou'll only need to train your model once. Over time, you'll retrain your modelto make sure it stays fresh and keeps producing valuable results. When you wantto automate experimentation at scale, or retrain models for a productionapplication, utilizing a managed ML training service will simplify yourworkflows.

This guide provides an introduction to training custom models onVertex AI. Because the training service is fully managed,Vertex AI automatically provisions compute resources, perform thetraining task, and ensure deletion of compute resources once the training jobis finished. Note that there are additional customizations, features,and ways to interface with the service that are not covered here. Thisguide is intended to provide an overview. For more detail, refer totheVertex AI Training documentation.

Overview of custom training

Training custom models on Vertex AI follows this standard workflow:

  1. Package up your training application code.

  2. Configure and submit custom training job.

  3. Monitor custom training job.

Packaging training application code

Running a custom training job on Vertex AI is done with containers.Containers are packages of your application code, in this case your trainingcode, together with dependencies such as specific versions of librariesrequired to run your code. In addition to helping with dependency management,containers are able to run virtually anywhere, allowing for increasedportability. Packaging your training code with its parameters and dependenciesinto a container to create a portable component is an important step whenmoving your ML applications from protoype to production.

Before you can launch a custom training job, you'll need to package up yourtraining application. Training application in this case refers to a file, ormultiple files, that perform tasks like loading data, preprocessing data,defining a model, and executing a training loop. The Vertex AItraining service runs whatever code you provide, so it's entirely up toyou what steps you include in your training application.

Vertex AI providesprebuilt containersfor TensorFlow, PyTorch, XGBoost, and Scikit-learn. These containers areupdated regularly and include common libraries you might need in yourtraining code. You can choose to run your training code with one of thesecontainers, or create a custom container that has your training code anddependencies pre-installed.

There are three options for packaging your code on Vertex AI:

  1. Submit a single Python file.
  2. Create a Python source distribution.
  3. Use custom containers.

Python file

This option is suitable for quick experimentation. You can use this optionif all of the code needed to execute your training application is in one Pythonfile and one of the prebuilt Vertex AI training containers has all of thelibraries needed to run your application. For an example of packaging yourtraining application as a single Python file, see the notebook tutorialCustom training and batch inference.

Python Source Distribution

You can create a Python Source Distributionthat contains your training application. You'll store your source distributionwith the training code and dependencies in a Cloud Storage bucket. For anexample of packaging your training application as a Python Source Distribution,see the notebook tutorialTraining, tuning and deploying a PyTorch classification model.

Custom Container

This option is useful when you want more control over your application,or maybe you want to run code not written in Python. In this case you'll needto write a Dockerfile, build your custom image, and push it to Artifact Registry.For an example of containerizing your training application see the notebooktutorialProfile model training performance using Profiler.

Recommended training application structure

If you choose to package up your code as a Python source distribution or as acustom container, it's recommended that you structure your application asfollows:

training-application-dir/....setup.py....Dockerfiletrainer/....task.py....model.py....utils.py

Create a directory to store all of your training application code, in thiscase,training-application-dir. This directory will contain asetup.py fileif you're using a Python Source Distribution, or aDockerfile if you're usinga custom container.

In both scenarios, this high level directory will also contain a subdirectorytrainer, that contains all the code to execute training. Withintrainer,task.py is the main entrypoint to your application. This file executes modeltraining. You can choose to put all of your code in this file, but forproduction applications you're likely to have additional files, for examplemodel.py,data.py,utils.py to name a few.

Running custom training

Training jobs on Vertex AI automatically provisions compute resources,execute the training application code, and ensure deletion of compute resourcesonce the training job is finished.

As you build out more complicated workflows, it's likely that you'll use theVertex AI SDK for Python toconfigure, submit, and monitor your training jobs. However, the first timeyou run a custom training job it can be easier to use the Google Cloud console.

  1. In the Google Cloud console, go to theTraining page.

    Go to Training

  2. ClickTrain new model.

  3. Under modelTraining method, selectCustom training (advanced).

    specify training method

  4. Under the Training Container section, select either prebuilt or custom container, depending on how you packaged your application.

    specify training container

  5. UnderCompute and pricing, specify the hardware for the training job. For single node training, you only need to configure Worker Pool 0. If you're interested in running distributed training, you'll need to understand the other worker pools, and you can learn more onDistributed training.

    training compute

Configuring the inference container is optional. If you only want to traina model on Vertex AI and access the resulting saved model artifacts,you can skip this step. If you want to host and deploy the resulting model onthe Vertex AI managed inference service, you'll need to configure aninference container. To learn more, seeGet inferences from a custom trained model.

Monitoring Training jobs

You can monitor your training job in the Google Cloud console. You'll see a listof all the jobs that have run. You can click a particular job and examinethe logs if something goes wrong.

training dashboard

Notebooks

To see examples of how to enable Profiler for a custom training job, run the following notebooks in the environment of your choice:

Video

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.