Fine-tune Gemma open models using multiple GPUs on GKE Stay organized with collections Save and categorize content based on your preferences.
This tutorial shows you how to fine-tuneGemmalarge language model (LLM), family of open models, using graphical processingunits (GPUs) on Google Kubernetes Engine (GKE) with theTransformerslibrary fromHugging Face. Fine-tuning is a supervisedlearning process that improves a pre-trained model's ability to perform specifictasks by updating its parameters with a new dataset. In this tutorial, youdownload the 2B-parameter pretrained Gemma family models fromHugging Face and fine-tune them on a GKEAutopilotorStandardcluster.
This guide is a good starting point if you need the granular control, scalability,resilience, portability, and cost-effectiveness of managed Kubernetes whenfine-tuning an LLM.
Try ourVertex AI solution if you need a unified managed AI platform to rapidly build and serve ML models cost effectively.
Background
By serving Gemma using GPUs on GKE with the transformers library, youcan implement a robust, production-ready inference serving solution with all thebenefits of managedKubernetes, including efficient scalability and higheravailability. This section describes the key technologies used in this guide.
Gemma
Gemma isa set of openly available, lightweight generative artificial intelligence(AI) models released under an open license. These AI models are available to runin your applications, hardware, mobile devices, or hosted services.
In this guide we introduce Gemma for text generation. You can also tune these models to specialize in performing specific tasks.
The dataset you use in this document isb-mc2/sql-create-context.
To learn more, see theGemma documentation.
GPUs
GPUs let you accelerate specific workloads running on your nodes such as machinelearning and data processing. GKE provides a range of machinetype options for node configuration, including machine types with NVIDIA H100,L4, and A100 GPUs.
Before you use GPUs in GKE, consider completing the following learning path:
- Learn aboutcurrent GPU version availability
- Learn aboutGPUs in GKE
Hugging Face Transformers
With theTransformers library from Hugging Face, you can access cutting-edge pretrained models. The Transformers library lets you reduce time, resources, and computational costs associated with the complete model training.
In this tutorial, you use the Hugging Face APIs and tools to download and fine-tune these pretrained models.
Objectives
This guide is intended for new or existing users of GKE, MLEngineers, MLOps (DevOps) engineers, or platform administrators who areinterested in using Kubernetes container orchestration capabilities tofine-tune LLMs on H100, A100, and L4 GPU hardware.
By the end of this guide, you should be able to perform the following steps:
- Prepare your environment with a GKE cluster inAutopilot mode.
- Create a fine-tune container.
- Use GPUs to fine-tune the Gemma 2B model and upload themodel to Hugging Face.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- In theSelect a role list, select a role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- Create aHugging Face account, if you don't already have one.
- Ensure your project has sufficient quota for L4 GPUs. To learn more, seeAbout GPUs andAllocation quotas.
Get access to the model
To get access to the Gemma models for deployment toGKE, you must first sign the license consent agreement then generate aHugging Face access token.
Sign the license consent agreement
You must sign the consent agreement to use Gemma. Follow these instructions:
- Access themodel consent page on Kaggle.com.
- Verify consent using your Hugging Face account.
- Accept the model terms.
Generate an access token
To access the model through Hugging Face, you'll need aHugging Facetoken.
Follow these steps to generate a new token if you don't have one already:
- ClickYour Profile > Settings > Access Tokens.
- SelectNew Token.
- Specify a Name of your choice and a Role of at least
Write. - SelectGenerate a token.
- Copy the generated token to your clipboard.
Prepare your environment
In this tutorial, you useCloud Shell to manage resources hosted onGoogle Cloud. Cloud Shell comes preinstalled with the software you'll needfor this tutorial, includingkubectl andgcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in theGoogle Cloud console. This launches a session in thebottom pane of Google Cloud console.
Set the default environment variables:
gcloudconfigsetprojectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportCONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATIONexportCLUSTER_NAME=CLUSTER_NAMEexportHF_TOKEN=HF_TOKENexportHF_PROFILE=HF_PROFILEReplace the following values:
PROJECT_ID: your Google Cloudproject ID.CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports the acceleratortype you want to use, for example,us-central1for L4 GPUs.CLUSTER_NAME: the name of your cluster.HF_TOKEN: the Hugging Face tokenyou generated earlier.HF_PROFILE: the Hugging Face Profile ID that youcreated earlier.
Clone the sample code repository from GitHub:
gitclonehttps://github.com/GoogleCloudPlatform/kubernetes-engine-samplescdkubernetes-engine-samples/ai-ml/llm-finetuning-gemma
Create and configure Google Cloud resources
Follow these instructions to create the required resources.
Create a GKE cluster and node pool
You can serve Gemma on GPUs in a GKE Autopilot or Standard cluster. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.
Use Autopilot for a fully managed Kubernetes experience.
Autopilot
In Cloud Shell, run the following command:
gcloudcontainerclusterscreate-autoCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--release-channel=rapid\--cluster-version=1.29Replace the following values:
PROJECT_ID: your Google Cloudproject ID.CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports the acceleratortype you want to use, for example,us-central1for L4 GPUs.CLUSTER_NAME: the name of your cluster.
GKE creates an Autopilot cluster with CPU and GPUnodes as requested by the deployed workloads.
Standard
In Cloud Shell, run the following command to create a Standardcluster:
gcloudcontainerclusterscreateCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--workload-pool=PROJECT_ID.svc.id.goog\--release-channel=rapid\--num-nodes=1Replace the following values:
PROJECT_ID: your Google Cloudproject ID.CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports the acceleratortype you want to use, for example,us-central1for L4 GPUs.CLUSTER_NAME: the name of your cluster.
The cluster creation might take several minutes.
Run the following command to create anode pool for your cluster:
gcloudcontainernode-poolscreategpupool\--acceleratortype=nvidia-l4,count=8,gpu-driver-version=latest\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--node-locations=CONTROL_PLANE_LOCATION-a\--cluster=CLUSTER_NAME\--machine-type=g2-standard-96\--num-nodes=1GKE creates a single node pool containing two L4 GPUs foreach node.
Create a Kubernetes secret for Hugging Face credentials
In Cloud Shell, do the following:
Configure
kubectlto communicate with your cluster:gcloudcontainerclustersget-credentialsCLUSTER_NAME\--location=CONTROL_PLANE_LOCATIONReplace the following values:
CONTROL_PLANE_LOCATION: the Compute Engineregion of the controlplane of your cluster.CLUSTER_NAME: the name of your cluster.
Create a Kubernetes Secret that contains the Hugging Face token:
kubectlcreatesecretgenerichf-secret\--from-literal=hf_api_token=$HF_TOKEN\--dry-run=client-oyaml|kubectlapply-f-Replace
$HF_TOKENwith the Hugging Face token you generated earlier, or use the environment variable if you set it.
Create a fine-tuning container with Docker and Cloud Build
This container uses the PyTorch and Hugging Face Transformers code to fine-tune the existing pre-trained Gemma model.
Create a Artifact Registry Docker Repository:
gcloudartifactsrepositoriescreategemma\--project=PROJECT_ID\--repository-format=docker\--location=us\--description="Gemma Repo"Replace
PROJECT_IDwith your Google Cloudproject ID.Build and push the image:
gcloudbuildssubmit.Export the
IMAGE_URLfor later use in this tutorial.exportIMAGE_URL=us-docker.pkg.dev/PROJECT_ID/gemma/finetune-gemma-gpu:1.0.0
Run a fine-tuning Job on GKE
In this section, you deploy the Gemma fine-tuning Job. A Job controller in Kubernetes creates one or more Pods and ensures that they successfully execute a specific task.
Open the
finetune.yamlfile.apiVersion:batch/v1kind:Jobmetadata:name:finetune-jobnamespace:defaultspec:backoffLimit:2template:metadata:annotations:kubectl.kubernetes.io/default-container:finetunerspec:terminationGracePeriodSeconds:600containers:-name:finetunerimage:$IMAGE_URLresources:limits:nvidia.com/gpu:"8"env:-name:MODEL_NAMEvalue:"google/gemma-2b"-name:NEW_MODELvalue:"gemma-2b-sql-finetuned"-name:LORA_Rvalue:"8"-name:LORA_ALPHAvalue:"16"-name:TRAIN_BATCH_SIZEvalue:"1"-name:EVAL_BATCH_SIZEvalue:"2"-name:GRADIENT_ACCUMULATION_STEPSvalue:"2"-name:DATASET_LIMITvalue:"1000"-name:MAX_SEQ_LENGTHvalue:"512"-name:LOGGING_STEPSvalue:"5"-name:HF_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-mountPath:/dev/shmname:dshmvolumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-l4restartPolicy:OnFailureApply the manifest to create the fine-tuning job:
envsubst <finetune.yaml|kubectlapply-f-This instruction replaces the
IMAGE_URLwith the variable in the manifest.Monitor the Job by running the following command:
watchkubectlgetpodsCheck the logs of the job by running the following command:
kubectllogsjob.batch/finetune-job-fThe Job resource downloads the model data then fine-tunes the model across alleight GPUs. This process can take up to 20 minutes.
After the Job is complete, go to your Hugging Face account. A new model named
HF_PROFILE/gemma-2b-sql-finetunedappears in your Hugging Face profile.
Serve the fine-tuned model on GKE
In this section, you deploy thevLLM container to serve the Gemmamodel. This tutorial uses a Kubernetes Deployment to deploy thevLLM container. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Create the following
serve-gemma.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-gemma-deploymentspec:replicas:1selector:matchLabels:app:gemma-servertemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-2bai.gke.io/inference-server:vllmexamples.ai.gke.io/source:user-guidespec:containers:-name:inference-serverimage:docker.io/vllm/vllm-openai:v0.10.0resources:requests:cpu:"2"memory:"7Gi"ephemeral-storage:"10Gi"nvidia.com/gpu:1limits:cpu:"2"memory:"7Gi"ephemeral-storage:"10Gi"nvidia.com/gpu:1command:["python3","-m","vllm.entrypoints.openai.api_server"]args:---model=$(MODEL_ID)---tensor-parallel-size=1env:-name:LD_LIBRARY_PATHvalue:${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64-name:MODEL_IDvalue:google/gemma-2b-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-mountPath:/dev/shmname:dshmvolumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-l4---apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:gemma-servertype:ClusterIPports:-protocol:TCPport:8000targetPort:8000Create the environment variable for new
MODEL_ID:exportMODEL_ID=HF_PROFILE/gemma-2b-sql-finetunedReplace
HF_PROFILEwith the Hugging Face Profile ID that you created earlier.Replace
MODEL_IDin the manifest:sed-i"s|google/gemma-2b|$MODEL_ID|g"serve-gemma.yamlApply the manifest:
kubectlapply-fserve-gemma.yamlA Pod in the cluster downloads the model weights from Hugging Face and startsthe serving engine.
Wait for the Deployment to be available:
kubectlwait--for=condition=Available--timeout=700sdeployment/vllm-gemma-deploymentView the logs from the running Deployment:
kubectllogs-f-lapp=gemma-server
The Deployment resource downloads the model data. This process can take afew minutes. The output is similar to the following:
INFO 01-26 19:02:54 model_runner.py:689] Graph capturing finished in 4 secs.INFO: Started server process [1]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)Make sure the model is fully downloaded before proceeding to the next section.
Serve the model
In this section, you interact with the model.
Set up port forwarding
Once the model is deployed, run the following command to set up port forwardingto the model:
kubectlport-forwardservice/llm-service8000:8000The output is similar to the following:
Forwarding from 127.0.0.1:8000 -> 8000Interact with the model using curl
In a new terminal session, usecurl to chat with your model:
The following example command is for TGI:
USER_PROMPT="Question: What is the total number of attendees with age over 30 at kubecon eu? Context: CREATE TABLE attendees (name VARCHAR, age INTEGER, kubecon VARCHAR)"curl-XPOSThttp://localhost:8000/generate\-H"Content-Type: application/json"\-d@-<<EOF{"prompt":"${USER_PROMPT}","temperature":0.1,"top_p":1.0,"max_tokens":24}EOFThe following output shows an example of the model response:
{"generated_text":" Answer: SELECT COUNT(age) FROM attendees WHERE age > 30 AND kubecon = 'eu'\n"}Depending on your query, you might have to change themax_token to get a better result. You can also use theinstruction tunded model for better chat experience.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following command:
gcloudcontainerclustersdeleteCLUSTER_NAME\--location=CONTROL_PLANE_LOCATIONReplace the following values:
CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports the acceleratortype you want to use, for example,us-central1for L4 GPUs.CLUSTER_NAME: the name of your cluster.
What's next
- Learn more aboutGPUs inGKE.
- Learn how to use Gemma with TGI on other accelerators,including A100 and H100 GPUs, byviewing the sample code in GitHub.
- Learn how todeploy GPU workloads in Autopilot.
- Learn how todeploy GPU workloads in Standard.
- Explore theVertex AI Model Garden.
- Discover how to run optimized AI/ML workloads withGKEplatform orchestrationcapabilities.
- Learn how to useAssured Workloads to apply controls to a folder in Google Cloudto meet regulatory requirements.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.