Train a model with GPUs on GKE Standard mode
This quickstart tutorial shows you how to deploy a training model with GPUs inGoogle Kubernetes Engine (GKE) and store the predictions in Cloud Storage.This tutorial uses a TensorFlow model and GKE Standard clusters.You can also run these workloads on Autopilot clusters with fewersetup steps. For instructions, seeTrain a model with GPUs on GKE Autopilot mode.
This document is intended for GKE administrators who have existingStandard clusters and want to run GPU workloads for the first time.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
If you're using an existing project for this guide,verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.
Verify that billing is enabled for your Google Cloud project.
Enable the Kubernetes Engine and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
If you're using an existing project for this guide,verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.
Verify that billing is enabled for your Google Cloud project.
Enable the Kubernetes Engine and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Required roles
To get the permissions that you need to train a model on GPUs, ask your administrator to grant you the following IAM roles on your project:
- Manage GKE clusters:Kubernetes Engine Admin (
roles/container.admin) - Manage Cloud Storage buckets:Storage Admin (
roles/storage.admin) - Grant IAM roles on the project:Project IAM Admin (
roles/resourcemanager.projectIamAdmin) - Create and grant roles on IAM service accounts:Service Account Admin (
roles/iam.serviceAccountAdmin)
For more information about granting roles, seeManage access to projects, folders, and organizations.
You might also be able to get the required permissions throughcustom roles or otherpredefined roles.
Clone the sample repository
In Cloud Shell, run the following command:
gitclonehttps://github.com/GoogleCloudPlatform/ai-on-gke/ai-on-gkecdai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpuCreate a Standard mode cluster and a GPU node pool
Use Cloud Shell to do the following:
Create a Standard cluster that usesWorkload Identity Federation for GKEand installs theCloud Storage FUSE driver:
gcloudcontainerclusterscreategke-gpu-cluster\--addonsGcsFuseCsiDriver\--location=us-central1\--num-nodes=1\--workload-pool=PROJECT_ID.svc.id.googReplace
PROJECT_IDwith your Google Cloudproject ID.Cluster creation might take several minutes.
Create a GPU node pool:
gcloudcontainernode-poolscreategke-gpu-pool-1\--accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default\--machine-type=n1-standard-16--num-nodes=1\--location=us-central1\--cluster=gke-gpu-cluster
Create a Cloud Storage bucket
In the Google Cloud console, go to theCreate a bucket page:
In theName your bucket field, enter the following name:
PROJECT_ID-gke-gpu-bucketClickContinue.
ForLocation type, selectRegion.
In theRegion list, select
us-central1 (Iowa)and clickContinue.In theChoose a storage class for your data section, clickContinue.
In theChoose how to control access to objects section, forAccess control,selectUniform.
ClickCreate.
In thePublic access will be prevented dialog, ensure that theEnforce public access prevention on this bucket checkbox is selected, andclickConfirm.
Configure your cluster to access the bucket using Workload Identity Federation for GKE
To let your cluster access the Cloud Storage bucket, you do the following:
- Create a Google Cloud service account.
- Create a Kubernetes ServiceAccount in your cluster.
- Bind the Kubernetes ServiceAccount to the Google Cloud service account.
Create a Google Cloud service account
In the Google Cloud console, go to theCreate service account page:
In theService account ID field, enter
gke-ai-sa.ClickCreate and continue.
In theRole list, select theCloud Storage> Storage Insights Collector Service role.
ClickAdd another role.
In theSelect a role list, select theCloud Storage> Storage Object Admin role.
ClickContinue, and then clickDone.
Create a Kubernetes ServiceAccount in your cluster
In Cloud Shell, do the following:
Create a Kubernetes namespace:
kubectlcreatenamespacegke-ai-namespaceCreate a Kubernetes ServiceAccount in the namespace:
kubectlcreateserviceaccountgpu-k8s-sa--namespace=gke-ai-namespace
Bind the Kubernetes ServiceAccount to the Google Cloud service account
In Cloud Shell, run the following commands:
Add an IAM binding to the Google Cloud service account:
gcloudiamservice-accountsadd-iam-policy-bindinggke-ai-sa@PROJECT_ID.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:PROJECT_ID.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"The
--memberflag provides the full identity of the Kubernetes ServiceAccountin Google Cloud.Annotate the Kubernetes ServiceAccount:
kubectlannotateserviceaccountgpu-k8s-sa\--namespacegke-ai-namespace\iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com
Verify that Pods can access the Cloud Storage bucket
In Cloud Shell, create the following environment variables:
exportK8S_SA_NAME=gpu-k8s-saexportBUCKET_NAME=PROJECT_ID-gke-gpu-bucketReplace
PROJECT_IDwith your Google Cloudproject ID.Create a Pod that has aTensorFlow container:
envsubst <src/gke-config/standard-tensorflow-bash.yaml|kubectl--namespace=gke-ai-namespaceapply-f-This command substitutes the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace
$K8S_SA_NAMEand$BUCKET_NAMEwith thecorresponding values.Create a sample file in the bucket:
touchsample-filegcloudstoragecpsample-filegs://PROJECT_ID-gke-gpu-bucketWait for your Pod to become ready:
kubectlwait--for=condition=Readypod/test-tensorflow-pod-n=gke-ai-namespace--timeout=180sWhen the Pod is ready, the output is the following:
pod/test-tensorflow-pod condition metOpen a shell in the TensorFlow container:
kubectl-ngke-ai-namespaceexec--stdin--ttytest-tensorflow-pod--containertensorflow--/bin/bashTry to read the sample file that you created:
ls/dataThe output shows the sample file.
Check the logs to identify the GPU attached to the Pod:
python3-mpipinstall'tensorflow[and-cuda]'python-c"import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"The output shows the GPU attached to the Pod, similar to the following:
...PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')Exit the container:
exitDelete the sample Pod:
kubectldelete-fsrc/gke-config/standard-tensorflow-bash.yaml\--namespace=gke-ai-namespace
Train and predict using theMNIST dataset
In this section, you run a training workload on theMNIST example dataset.
Copy the example data to the Cloud Storage bucket:
gcloudstoragecpsrc/tensorflow-mnist-examplegs://PROJECT_ID-gke-gpu-bucket/--recursiveCreate the following environment variables:
exportK8S_SA_NAME=gpu-k8s-saexportBUCKET_NAME=PROJECT_ID-gke-gpu-bucketReview the training Job:
# Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:batch/v1kind:Jobmetadata:name:mnist-training-jobspec:template:metadata:name:mnistannotations:gke-gcsfuse/volumes:"true"spec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-tesla-t4tolerations:-key:"nvidia.com/gpu"operator:"Exists"effect:"NoSchedule"containers:-name:tensorflowimage:tensorflow/tensorflow:latest-gpucommand:["/bin/bash","-c","--"]args:["cd/data/tensorflow-mnist-example;pipinstall-rrequirements.txt;pythontensorflow_mnist_train_distributed.py"]resources:limits:nvidia.com/gpu:1cpu:1memory:3GivolumeMounts:-name:gcs-fuse-csi-volmountPath:/datareadOnly:falseserviceAccountName:$K8S_SA_NAMEvolumes:-name:gcs-fuse-csi-volcsi:driver:gcsfuse.csi.storage.gke.ioreadOnly:falsevolumeAttributes:bucketName:$BUCKET_NAMEmountOptions:"implicit-dirs"restartPolicy:"Never"Deploy the training Job:
envsubst <src/gke-config/standard-tf-mnist-train.yaml|kubectl-ngke-ai-namespaceapply-f-This command substitutes the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace
$K8S_SA_NAMEand$BUCKET_NAMEwith thecorresponding values.Wait until the Job has the
Completedstatus:kubectlwait-ngke-ai-namespace--for=condition=Completejob/mnist-training-job--timeout=180sThe output is similar to the following:
job.batch/mnist-training-job condition metCheck the logs from the TensorFlow container:
kubectllogs-fjobs/mnist-training-job-ctensorflow-ngke-ai-namespaceThe output shows the following events occur:
- Install required Python packages
- Download the MNIST dataset
- Train the model using a GPU
- Save the model
- Evaluate the model
...Epoch 12/12927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954Learning rate for epoch 12 is 9.999999747378752e-06938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446Training finished. Model savedDelete the training workload:
kubectl-ngke-ai-namespacedelete-fsrc/gke-config/standard-tf-mnist-train.yaml
Deploy an inference workload
In this section, you deploy an inference workload that takes a sample dataset asinput and returns predictions.
Copy the images for prediction to the bucket:
gcloudstoragecpdata/mnist_predictgs://PROJECT_ID-gke-gpu-bucket/--recursiveReview the inference workload:
# Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:batch/v1kind:Jobmetadata:name:mnist-batch-prediction-jobspec:template:metadata:name:mnistannotations:gke-gcsfuse/volumes:"true"spec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-tesla-t4tolerations:-key:"nvidia.com/gpu"operator:"Exists"effect:"NoSchedule"containers:-name:tensorflowimage:tensorflow/tensorflow:latest-gpucommand:["/bin/bash","-c","--"]args:["cd/data/tensorflow-mnist-example;pipinstall-rrequirements.txt;pythontensorflow_mnist_batch_predict.py"]resources:limits:nvidia.com/gpu:1cpu:1memory:3GivolumeMounts:-name:gcs-fuse-csi-volmountPath:/datareadOnly:falseserviceAccountName:$K8S_SA_NAMEvolumes:-name:gcs-fuse-csi-volcsi:driver:gcsfuse.csi.storage.gke.ioreadOnly:falsevolumeAttributes:bucketName:$BUCKET_NAMEmountOptions:"implicit-dirs"restartPolicy:"Never"Deploy the inference workload:
envsubst <src/gke-config/standard-tf-mnist-batch-predict.yaml|kubectl-ngke-ai-namespaceapply-f-This command substitutes the environment variables that you created into thecorresponding references in the manifest. You can also open the manifest ina text editor and replace
$K8S_SA_NAMEand$BUCKET_NAMEwith thecorresponding values.Wait until the Job has the
Completedstatus:kubectlwait-ngke-ai-namespace--for=condition=Completejob/mnist-batch-prediction-job--timeout=180sThe output is similar to the following:
job.batch/mnist-batch-prediction-job condition metCheck the logs from the TensorFlow container:
kubectllogs-fjobs/mnist-batch-prediction-job-ctensorflow-ngke-ai-namespaceThe output is the prediction for each image and the model's confidencein the prediction, similar to the following:
Found 10 files belonging to 1 classes.1/1 [==============================] - 2s 2s/stepThe image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.
Clean up
To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, do one of the following:
- Keep the GKE cluster: Delete the Kubernetes resources inthe cluster and the Google Cloud resources
- Keep the Google Cloud project: Delete the GKE clusterand the Google Cloud resources
- Delete the project
Delete the Kubernetes resources in the cluster and the Google Cloud resources
Delete the Kubernetes namespace and the workloads that you deployed:
kubectl-ngke-ai-namespacedelete-fsrc/gke-config/standard-tf-mnist-batch-predict.yamlkubectldeletenamespacegke-ai-namespaceDelete the Cloud Storage bucket:
Go to theBuckets page:
Select the checkbox for
PROJECT_ID-gke-gpu-bucket.ClickDelete.
To confirm deletion, type
DELETEand clickDelete.
Delete the Google Cloud service account:
Go to theService accounts page:
Select your project.
Select the checkbox for
gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com.ClickDelete.
To confirm deletion, clickDelete.
Delete the GKE cluster and the Google Cloud resources
Delete the GKE cluster:
Go to theClusters page:
Select the checkbox for
gke-gpu-cluster.ClickDelete.
To confirm deletion, type
gke-gpu-clusterand clickDelete.
Delete the Cloud Storage bucket:
Go to theBuckets page:
Select the checkbox for
PROJECT_ID-gke-gpu-bucket.ClickDelete.
To confirm deletion, type
DELETEand clickDelete.
Delete the Google Cloud service account:
Go to theService accounts page:
Select your project.
Select the checkbox for
gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com.ClickDelete.
To confirm deletion, clickDelete.
Delete the project
What's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.