Train a model with GPUs on GKE Autopilot mode

This quickstart shows you how to deploy a training model with GPUs inGoogle Kubernetes Engine (GKE) and store the predictions in Cloud Storage. Thisdocument is intended for GKE administrators who have existingAutopilot mode clusters and want to run GPU workloads for the firsttime.

You can also run these workloads on Standard clusters if youcreate separate GPU node pools in your clusters. For instructions, seeTrain a model with GPUs on GKE Standard mode.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the GKE and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  5. Install the Google Cloud CLI.

    Note: If you installed the gcloud CLI previously, make sure you have the latest version by runninggcloud components update.
  6. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  7. Toinitialize the gcloud CLI, run the following command:

    gcloudinit
  8. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  9. Verify that billing is enabled for your Google Cloud project.

  10. Enable the GKE and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  11. Install the Google Cloud CLI.

    Note: If you installed the gcloud CLI previously, make sure you have the latest version by runninggcloud components update.
  12. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  13. Toinitialize the gcloud CLI, run the following command:

    gcloudinit
  14. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Clone the sample repository

In Cloud Shell, run the following command:

gitclonehttps://github.com/GoogleCloudPlatform/ai-on-gke &&\cdai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Create a cluster

  1. In the Google Cloud console, go to theCreate an Autopilot cluster page:

    Go to Create an Autopilot cluster

  2. In theName field, entergke-gpu-cluster.

  3. In theRegion list, selectus-central1.

  4. ClickCreate.

Create a Cloud Storage bucket

  1. In the Google Cloud console, go to theCreate a bucket page:

    Go to Create a bucket

  2. In theGet started section, enter a globally unique name for your bucket:

    PROJECT_ID-gke-gpu-bucket

    ReplacePROJECT_ID with your Google Cloudproject ID.

  3. ClickContinue.

  4. ForLocation type, selectRegion.

  5. In theRegion list, selectus-central1 (Iowa) and clickContinue.

  6. In theChoose a storage class for your data section, clickContinue.

  7. In theChoose how to control access to objects section, forAccess control,selectUniform.

  8. ClickCreate.

  9. In thePublic access will be prevented dialog, ensure that theEnforce public access prevention on this bucket checkbox is selected, andclickConfirm.

Configure your cluster to access the bucket using Workload Identity Federation for GKE

To let your cluster access the Cloud Storage bucket, you do thefollowing:

  1. Create a Kubernetes ServiceAccount in your cluster.
  2. Create an IAM allow policy that lets the ServiceAccount accessthe bucket.

Create a Kubernetes ServiceAccount in your cluster

In Cloud Shell, do the following:

  1. Connect to your cluster:

    gcloudcontainerclustersget-credentialsgke-gpu-cluster\--location=us-central1
  2. Create a Kubernetes namespace:

    kubectlcreatenamespacegke-gpu-namespace
  3. Create a Kubernetes ServiceAccount in the namespace:

    kubectlcreateserviceaccountgpu-k8s-sa--namespace=gke-gpu-namespace

Create an IAM allow policy on the bucket

Grant the Storage Object Admin (roles/storage.objectAdmin) role on the bucketto the Kubernetes ServiceAccount:

gcloudstoragebucketsadd-iam-policy-bindinggs://PROJECT_ID-gke-gpu-bucket\--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/gke-gpu-namespace/sa/gpu-k8s-sa\--role=roles/storage.objectAdmin\--condition=None

ReplacePROJECT_NUMBER with your Google Cloudproject number.

Verify that Pods can access the Cloud Storage bucket

  1. In Cloud Shell, create the following environment variables:

    exportK8S_SA_NAME=gpu-k8s-saexportBUCKET_NAME=PROJECT_ID-gke-gpu-bucket

    ReplacePROJECT_ID with your Google Cloudproject ID.

  2. Create a Pod that has aTensorFlow container:

    envsubst <src/gke-config/standard-tensorflow-bash.yaml|kubectl--namespace=gke-gpu-namespaceapply-f-

    This command inserts the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace$K8S_SA_NAME and$BUCKET_NAME with thecorresponding values.

  3. Create a sample file in the bucket:

    touchsample-filegcloudstoragecpsample-filegs://PROJECT_ID-gke-gpu-bucket
  4. Wait for your Pod to become ready:

    kubectlwait--for=condition=Readypod/test-tensorflow-pod-n=gke-gpu-namespace--timeout=180s

    When the Pod is ready, the output is the following:

    pod/test-tensorflow-pod condition met

    If the command times out, GKE might still be creating newnodes to run the Pods. Run the command again and wait for the Pod to becomeready.

  5. Open a shell in the TensorFlow container:

    kubectl-ngke-gpu-namespaceexec--stdin--ttytest-tensorflow-pod--containertensorflow--/bin/bash
  6. Try to read the sample file that you created:

    ls/data

    The output shows the sample file.

  7. Check the logs to identify the GPU attached to the Pod:

    python-c"import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

    The output shows the GPU attached to the Pod, similar to the following:

    ...PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')
  8. Exit the container:

    exit
  9. Delete the sample Pod:

    kubectldelete-fsrc/gke-config/standard-tensorflow-bash.yaml\--namespace=gke-gpu-namespace
Success: At this point, your cluster runs a GPU node pool and can communicatewith the Cloud Storage bucket using the Kubernetes ServiceAccount.

Train and predict using theMNIST dataset

In this section, you run a training workload on theMNIST example dataset.

  1. Copy the example data to the Cloud Storage bucket:

    gcloudstoragecpsrc/tensorflow-mnist-examplegs://PROJECT_ID-gke-gpu-bucket/--recursive
  2. Create the following environment variables:

    exportK8S_SA_NAME=gpu-k8s-saexportBUCKET_NAME=PROJECT_ID-gke-gpu-bucket
  3. Review the training Job:

    # Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##      http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:batch/v1kind:Jobmetadata:name:mnist-training-jobspec:template:metadata:name:mnistannotations:gke-gcsfuse/volumes:"true"spec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-tesla-t4tolerations:-key:"nvidia.com/gpu"operator:"Exists"effect:"NoSchedule"containers:-name:tensorflowimage:tensorflow/tensorflow:latest-gpucommand:["/bin/bash","-c","--"]args:["cd/data/tensorflow-mnist-example;pipinstall-rrequirements.txt;pythontensorflow_mnist_train_distributed.py"]resources:limits:nvidia.com/gpu:1cpu:1memory:3GivolumeMounts:-name:gcs-fuse-csi-volmountPath:/datareadOnly:falseserviceAccountName:$K8S_SA_NAMEvolumes:-name:gcs-fuse-csi-volcsi:driver:gcsfuse.csi.storage.gke.ioreadOnly:falsevolumeAttributes:bucketName:$BUCKET_NAMEmountOptions:"implicit-dirs"restartPolicy:"Never"
  4. Deploy the training Job:

    envsubst <src/gke-config/standard-tf-mnist-train.yaml|kubectl-ngke-gpu-namespaceapply-f-

    This command replaces the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace$K8S_SA_NAME and$BUCKET_NAME with thecorresponding values.

  5. Wait until the Job has theCompleted status:

    kubectlwait-ngke-gpu-namespace--for=condition=Completejob/mnist-training-job--timeout=180s

    When the Job is ready, the output is similar to the following:

    job.batch/mnist-training-job condition met

    If the command times out, GKE might still be creating newnodes to run the Pods. Run the command again and wait for the Job to becomeready.

  6. Check the logs from the TensorFlow container:

    kubectllogs-fjobs/mnist-training-job-ctensorflow-ngke-gpu-namespace

    The output shows the following events occur:

    • Install required Python packages
    • Download the MNIST dataset
    • Train the model using a GPU
    • Save the model
    • Evaluate the model
    ...Epoch 12/12927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954Learning rate for epoch 12 is 9.999999747378752e-06938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446Training finished. Model saved
  7. Delete the training workload:

    kubectl-ngke-gpu-namespacedelete-fsrc/gke-config/standard-tf-mnist-train.yaml

Deploy an inference workload

In this section, you deploy an inference workload that takes a sample dataset asinput and returns predictions.

  1. Copy the images for prediction to the bucket:

    gcloudstoragecpdata/mnist_predictgs://PROJECT_ID-gke-gpu-bucket/--recursive
  2. Review the inference workload:

    # Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##      http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:batch/v1kind:Jobmetadata:name:mnist-batch-prediction-jobspec:template:metadata:name:mnistannotations:gke-gcsfuse/volumes:"true"spec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-tesla-t4tolerations:-key:"nvidia.com/gpu"operator:"Exists"effect:"NoSchedule"containers:-name:tensorflowimage:tensorflow/tensorflow:latest-gpucommand:["/bin/bash","-c","--"]args:["cd/data/tensorflow-mnist-example;pipinstall-rrequirements.txt;pythontensorflow_mnist_batch_predict.py"]resources:limits:nvidia.com/gpu:1cpu:1memory:3GivolumeMounts:-name:gcs-fuse-csi-volmountPath:/datareadOnly:falseserviceAccountName:$K8S_SA_NAMEvolumes:-name:gcs-fuse-csi-volcsi:driver:gcsfuse.csi.storage.gke.ioreadOnly:falsevolumeAttributes:bucketName:$BUCKET_NAMEmountOptions:"implicit-dirs"restartPolicy:"Never"
  3. Deploy the inference workload:

    envsubst <src/gke-config/standard-tf-mnist-batch-predict.yaml|kubectl-ngke-gpu-namespaceapply-f-

    This command replaces the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace$K8S_SA_NAME and$BUCKET_NAME with thecorresponding values.

  4. Wait until the Job has theCompleted status:

    kubectlwait-ngke-gpu-namespace--for=condition=Completejob/mnist-batch-prediction-job--timeout=180s

    The output is similar to the following:

    job.batch/mnist-batch-prediction-job condition met
  5. Check the logs from the TensorFlow container:

    kubectllogs-fjobs/mnist-batch-prediction-job-ctensorflow-ngke-gpu-namespace

    The output is the prediction for each image and the model's confidencein the prediction, similar to the following:

    Found 10 files belonging to 1 classes.1/1 [==============================] - 2s 2s/stepThe image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.
Success: You've successfully trained a model and used it to evaluate new data.

Clean up

To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, do one of the following:

Delete the Kubernetes resources in the cluster and the Google Cloud resources

  1. Delete the Kubernetes namespace and the workloads that you deployed:

    kubectl-ngke-gpu-namespacedelete-fsrc/gke-config/standard-tf-mnist-batch-predict.yamlkubectldeletenamespacegke-gpu-namespace
  2. Delete the Cloud Storage bucket:

    1. Go to theBuckets page:

      Go to Buckets

    2. Select the checkbox forPROJECT_ID-gke-gpu-bucket.

    3. ClickDelete.

    4. To confirm deletion, typeDELETE and clickDelete.

  3. Delete the Google Cloud service account:

    1. Go to theService accounts page:

      Go to Service accounts

    2. Select your project.

    3. Select the checkbox forgke-gpu-sa@PROJECT_ID.iam.gserviceaccount.com.

    4. ClickDelete.

    5. To confirm deletion, clickDelete.

Delete the GKE cluster and the Google Cloud resources

  1. Delete the GKE cluster:

    1. Go to theClusters page:

      Go to Clusters

    2. Select the checkbox forgke-gpu-cluster.

    3. ClickDelete.

    4. To confirm deletion, typegke-gpu-cluster and clickDelete.

  2. Delete the Cloud Storage bucket:

    1. Go to theBuckets page:

      Go to Buckets

    2. Select the checkbox forPROJECT_ID-gke-gpu-bucket.

    3. ClickDelete.

    4. To confirm deletion, typeDELETE and clickDelete.

  3. Delete the Google Cloud service account:

    1. Go to theService accounts page:

      Go to Service accounts

    2. Select your project.

    3. Select the checkbox forgke-gpu-sa@PROJECT_ID.iam.gserviceaccount.com.

    4. ClickDelete.

    5. To confirm deletion, clickDelete.

Delete the project

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.