Train a model with GPUs on GKE Standard mode

This quickstart tutorial shows you how to deploy a training model with GPUs inGoogle Kubernetes Engine (GKE) and store the predictions in Cloud Storage.This tutorial uses a TensorFlow model and GKE Standard clusters.You can also run these workloads on Autopilot clusters with fewersetup steps. For instructions, seeTrain a model with GPUs on GKE Autopilot mode.

This document is intended for GKE administrators who have existingStandard clusters and want to run GPU workloads for the first time.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. If you're using an existing project for this guide,verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

  4. Verify that billing is enabled for your Google Cloud project.

  5. Enable the Kubernetes Engine and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  7. If you're using an existing project for this guide,verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

  8. Verify that billing is enabled for your Google Cloud project.

  9. Enable the Kubernetes Engine and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  10. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Required roles

To get the permissions that you need to train a model on GPUs, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

Clone the sample repository

In Cloud Shell, run the following command:

gitclonehttps://github.com/GoogleCloudPlatform/ai-on-gke/ai-on-gkecdai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Create a Standard mode cluster and a GPU node pool

Use Cloud Shell to do the following:

  1. Create a Standard cluster that usesWorkload Identity Federation for GKEand installs theCloud Storage FUSE driver:

    gcloudcontainerclusterscreategke-gpu-cluster\--addonsGcsFuseCsiDriver\--location=us-central1\--num-nodes=1\--workload-pool=PROJECT_ID.svc.id.goog

    ReplacePROJECT_ID with your Google Cloudproject ID.

    Cluster creation might take several minutes.

  2. Create a GPU node pool:

    gcloudcontainernode-poolscreategke-gpu-pool-1\--accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default\--machine-type=n1-standard-16--num-nodes=1\--location=us-central1\--cluster=gke-gpu-cluster

Create a Cloud Storage bucket

  1. In the Google Cloud console, go to theCreate a bucket page:

    Go to Create a bucket

  2. In theName your bucket field, enter the following name:

    PROJECT_ID-gke-gpu-bucket
  3. ClickContinue.

  4. ForLocation type, selectRegion.

  5. In theRegion list, selectus-central1 (Iowa) and clickContinue.

  6. In theChoose a storage class for your data section, clickContinue.

  7. In theChoose how to control access to objects section, forAccess control,selectUniform.

  8. ClickCreate.

  9. In thePublic access will be prevented dialog, ensure that theEnforce public access prevention on this bucket checkbox is selected, andclickConfirm.

Configure your cluster to access the bucket using Workload Identity Federation for GKE

To let your cluster access the Cloud Storage bucket, you do the following:

  1. Create a Google Cloud service account.
  2. Create a Kubernetes ServiceAccount in your cluster.
  3. Bind the Kubernetes ServiceAccount to the Google Cloud service account.

Create a Google Cloud service account

  1. In the Google Cloud console, go to theCreate service account page:

    Go to Create service account

  2. In theService account ID field, entergke-ai-sa.

  3. ClickCreate and continue.

  4. In theRole list, select theCloud Storage> Storage Insights Collector Service role.

  5. ClickAdd another role.

  6. In theSelect a role list, select theCloud Storage> Storage Object Admin role.

  7. ClickContinue, and then clickDone.

Create a Kubernetes ServiceAccount in your cluster

In Cloud Shell, do the following:

  1. Create a Kubernetes namespace:

    kubectlcreatenamespacegke-ai-namespace
  2. Create a Kubernetes ServiceAccount in the namespace:

    kubectlcreateserviceaccountgpu-k8s-sa--namespace=gke-ai-namespace

Bind the Kubernetes ServiceAccount to the Google Cloud service account

In Cloud Shell, run the following commands:

  1. Add an IAM binding to the Google Cloud service account:

    gcloudiamservice-accountsadd-iam-policy-bindinggke-ai-sa@PROJECT_ID.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:PROJECT_ID.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"

    The--member flag provides the full identity of the Kubernetes ServiceAccountin Google Cloud.

  2. Annotate the Kubernetes ServiceAccount:

    kubectlannotateserviceaccountgpu-k8s-sa\--namespacegke-ai-namespace\iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com

Verify that Pods can access the Cloud Storage bucket

  1. In Cloud Shell, create the following environment variables:

    exportK8S_SA_NAME=gpu-k8s-saexportBUCKET_NAME=PROJECT_ID-gke-gpu-bucket

    ReplacePROJECT_ID with your Google Cloudproject ID.

  2. Create a Pod that has aTensorFlow container:

    envsubst <src/gke-config/standard-tensorflow-bash.yaml|kubectl--namespace=gke-ai-namespaceapply-f-

    This command substitutes the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace$K8S_SA_NAME and$BUCKET_NAME with thecorresponding values.

  3. Create a sample file in the bucket:

    touchsample-filegcloudstoragecpsample-filegs://PROJECT_ID-gke-gpu-bucket
  4. Wait for your Pod to become ready:

    kubectlwait--for=condition=Readypod/test-tensorflow-pod-n=gke-ai-namespace--timeout=180s

    When the Pod is ready, the output is the following:

    pod/test-tensorflow-pod condition met
  5. Open a shell in the TensorFlow container:

    kubectl-ngke-ai-namespaceexec--stdin--ttytest-tensorflow-pod--containertensorflow--/bin/bash
  6. Try to read the sample file that you created:

    ls/data

    The output shows the sample file.

  7. Check the logs to identify the GPU attached to the Pod:

    python3-mpipinstall'tensorflow[and-cuda]'python-c"import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

    The output shows the GPU attached to the Pod, similar to the following:

    ...PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')
  8. Exit the container:

    exit
  9. Delete the sample Pod:

    kubectldelete-fsrc/gke-config/standard-tensorflow-bash.yaml\--namespace=gke-ai-namespace
Success: At this point, your cluster runs a GPU node pool and can communicatewith the Cloud Storage bucket using the Kubernetes ServiceAccount.

Train and predict using theMNIST dataset

In this section, you run a training workload on theMNIST example dataset.

  1. Copy the example data to the Cloud Storage bucket:

    gcloudstoragecpsrc/tensorflow-mnist-examplegs://PROJECT_ID-gke-gpu-bucket/--recursive
  2. Create the following environment variables:

    exportK8S_SA_NAME=gpu-k8s-saexportBUCKET_NAME=PROJECT_ID-gke-gpu-bucket
  3. Review the training Job:

    # Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##      http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:batch/v1kind:Jobmetadata:name:mnist-training-jobspec:template:metadata:name:mnistannotations:gke-gcsfuse/volumes:"true"spec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-tesla-t4tolerations:-key:"nvidia.com/gpu"operator:"Exists"effect:"NoSchedule"containers:-name:tensorflowimage:tensorflow/tensorflow:latest-gpucommand:["/bin/bash","-c","--"]args:["cd/data/tensorflow-mnist-example;pipinstall-rrequirements.txt;pythontensorflow_mnist_train_distributed.py"]resources:limits:nvidia.com/gpu:1cpu:1memory:3GivolumeMounts:-name:gcs-fuse-csi-volmountPath:/datareadOnly:falseserviceAccountName:$K8S_SA_NAMEvolumes:-name:gcs-fuse-csi-volcsi:driver:gcsfuse.csi.storage.gke.ioreadOnly:falsevolumeAttributes:bucketName:$BUCKET_NAMEmountOptions:"implicit-dirs"restartPolicy:"Never"
  4. Deploy the training Job:

    envsubst <src/gke-config/standard-tf-mnist-train.yaml|kubectl-ngke-ai-namespaceapply-f-

    This command substitutes the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace$K8S_SA_NAME and$BUCKET_NAME with thecorresponding values.

  5. Wait until the Job has theCompleted status:

    kubectlwait-ngke-ai-namespace--for=condition=Completejob/mnist-training-job--timeout=180s

    The output is similar to the following:

    job.batch/mnist-training-job condition met
  6. Check the logs from the TensorFlow container:

    kubectllogs-fjobs/mnist-training-job-ctensorflow-ngke-ai-namespace

    The output shows the following events occur:

    • Install required Python packages
    • Download the MNIST dataset
    • Train the model using a GPU
    • Save the model
    • Evaluate the model
    ...Epoch 12/12927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954Learning rate for epoch 12 is 9.999999747378752e-06938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446Training finished. Model saved
  7. Delete the training workload:

    kubectl-ngke-ai-namespacedelete-fsrc/gke-config/standard-tf-mnist-train.yaml

Deploy an inference workload

In this section, you deploy an inference workload that takes a sample dataset asinput and returns predictions.

  1. Copy the images for prediction to the bucket:

    gcloudstoragecpdata/mnist_predictgs://PROJECT_ID-gke-gpu-bucket/--recursive
  2. Review the inference workload:

    # Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##      http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:batch/v1kind:Jobmetadata:name:mnist-batch-prediction-jobspec:template:metadata:name:mnistannotations:gke-gcsfuse/volumes:"true"spec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-tesla-t4tolerations:-key:"nvidia.com/gpu"operator:"Exists"effect:"NoSchedule"containers:-name:tensorflowimage:tensorflow/tensorflow:latest-gpucommand:["/bin/bash","-c","--"]args:["cd/data/tensorflow-mnist-example;pipinstall-rrequirements.txt;pythontensorflow_mnist_batch_predict.py"]resources:limits:nvidia.com/gpu:1cpu:1memory:3GivolumeMounts:-name:gcs-fuse-csi-volmountPath:/datareadOnly:falseserviceAccountName:$K8S_SA_NAMEvolumes:-name:gcs-fuse-csi-volcsi:driver:gcsfuse.csi.storage.gke.ioreadOnly:falsevolumeAttributes:bucketName:$BUCKET_NAMEmountOptions:"implicit-dirs"restartPolicy:"Never"
  3. Deploy the inference workload:

    envsubst <src/gke-config/standard-tf-mnist-batch-predict.yaml|kubectl-ngke-ai-namespaceapply-f-

    This command substitutes the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace$K8S_SA_NAME and$BUCKET_NAME with thecorresponding values.

  4. Wait until the Job has theCompleted status:

    kubectlwait-ngke-ai-namespace--for=condition=Completejob/mnist-batch-prediction-job--timeout=180s

    The output is similar to the following:

    job.batch/mnist-batch-prediction-job condition met
  5. Check the logs from the TensorFlow container:

    kubectllogs-fjobs/mnist-batch-prediction-job-ctensorflow-ngke-ai-namespace

    The output is the prediction for each image and the model's confidencein the prediction, similar to the following:

    Found 10 files belonging to 1 classes.1/1 [==============================] - 2s 2s/stepThe image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.
Success: You've successfully trained a model and used it to evaluate new data.

Clean up

To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, do one of the following:

Delete the Kubernetes resources in the cluster and the Google Cloud resources

  1. Delete the Kubernetes namespace and the workloads that you deployed:

    kubectl-ngke-ai-namespacedelete-fsrc/gke-config/standard-tf-mnist-batch-predict.yamlkubectldeletenamespacegke-ai-namespace
  2. Delete the Cloud Storage bucket:

    1. Go to theBuckets page:

      Go to Buckets

    2. Select the checkbox forPROJECT_ID-gke-gpu-bucket.

    3. ClickDelete.

    4. To confirm deletion, typeDELETE and clickDelete.

  3. Delete the Google Cloud service account:

    1. Go to theService accounts page:

      Go to Service accounts

    2. Select your project.

    3. Select the checkbox forgke-ai-sa@PROJECT_ID.iam.gserviceaccount.com.

    4. ClickDelete.

    5. To confirm deletion, clickDelete.

Delete the GKE cluster and the Google Cloud resources

  1. Delete the GKE cluster:

    1. Go to theClusters page:

      Go to Clusters

    2. Select the checkbox forgke-gpu-cluster.

    3. ClickDelete.

    4. To confirm deletion, typegke-gpu-cluster and clickDelete.

  2. Delete the Cloud Storage bucket:

    1. Go to theBuckets page:

      Go to Buckets

    2. Select the checkbox forPROJECT_ID-gke-gpu-bucket.

    3. ClickDelete.

    4. To confirm deletion, typeDELETE and clickDelete.

  3. Delete the Google Cloud service account:

    1. Go to theService accounts page:

      Go to Service accounts

    2. Select your project.

    3. Select the checkbox forgke-ai-sa@PROJECT_ID.iam.gserviceaccount.com.

    4. ClickDelete.

    5. To confirm deletion, clickDelete.

Delete the project

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.