Fine-tune Gemma 3 on an A4 GKE cluster

This tutorial shows you how to fine-tune aGemma 3 large language model (LLM) on a multi-node, multi-GPU GKE cluster on Google Cloud.This cluster uses an A4 virtual machine (VM) instance which has 8 NVIDIA B200 GPUs.

The two main processes described in this tutorial are as follows:

  1. Deploy a high-performance GKE cluster by usingGKE Autopilot. As part of this deployment,you create a custom VM image with the necessary software pre-installed.
  2. After the cluster is deployed, you run a distributed fine-tuning job byusing the set of scripts that accompany this tutorial. The job leveragestheHugging Face Accelerate library.
Important: To complete this tutorial, you must have reserved the capacity tocreate an A4 VM. Tolearn more about your options for reserving capacity inAI Hypercomputer for a future date and time, seeChoose a consumption option.

This tutorial is intended for machine learning (ML) engineers, researchers,platform administrators and operators, and for data and AI specialists who areinterested in deploying GKE clusters on Google Cloud to train LLMs.

Objectives

  • Access the Gemma 3 model by using Hugging Face.

  • Prepare your environment.

  • Create and deploy an A4 GKE cluster.

  • Fine-tune the Gemma 3 model by using the Hugging FaceAccelerate library with fully sharded data parallel (FSDP).

  • Monitor your job.

  • Clean up.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.

  3. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  4. Toinitialize the gcloud CLI, run the following command:

    gcloudinit
  5. Create or select a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
    • Create a Google Cloud project:

      gcloud projects createPROJECT_ID

      ReplacePROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set projectPROJECT_ID

      ReplacePROJECT_ID with your Google Cloud project name.

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the required API:

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    gcloudservicesenablegcloud services enable compute.googleapis.com container.googleapis.comfile.googleapis.com logging.googleapis.com cloudresourcemanager.googleapis.com servicenetworking.googleapis.com
  8. Install the Google Cloud CLI.

  9. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  10. Toinitialize the gcloud CLI, run the following command:

    gcloudinit
  11. Create or select a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
    • Create a Google Cloud project:

      gcloud projects createPROJECT_ID

      ReplacePROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set projectPROJECT_ID

      ReplacePROJECT_ID with your Google Cloud project name.

  12. Verify that billing is enabled for your Google Cloud project.

  13. Enable the required API:

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    gcloudservicesenablegcloud services enable compute.googleapis.com container.googleapis.comfile.googleapis.com logging.googleapis.com cloudresourcemanager.googleapis.com servicenetworking.googleapis.com
  14. Grant roles to your user account. Run the following command once for each of the following IAM roles:roles/compute.admin, roles/iam.serviceAccountUser, roles/cloudbuild.builds.editor, roles/artifactregistry.admin, roles/storage.admin,roles/serviceusage.serviceUsageAdmin

    gcloudprojectsadd-iam-policy-bindingPROJECT_ID--member="user:USER_IDENTIFIER"--role=ROLE

    Replace the following:

    • PROJECT_ID: Your project ID.
    • USER_IDENTIFIER: The identifier for your user account. For example,myemail@example.com.
    • ROLE: The IAM role that you grant to your user account.
  15. Enable the default service account for your Google Cloud project:
    gcloudiamservice-accountsenablePROJECT_NUMBER-compute@developer.gserviceaccount.com\--project=PROJECT_ID

    ReplacePROJECT_NUMBER with your project number. To review your project number, see Get an existing project.

  16. Grant the Editor role (roles/editor) to the default service account:
    gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com"\--role=roles/editor
  17. Create local authentication credentials for your user account:
    gcloudauthapplication-defaultlogin
    Note: If you use a local shell and an external identity provider (IdP), and you encounter an authentication error after running the preceding command, thensign in to the gcloud CLI with your federated identity.
  18. Enable OS Login for your project:
    gcloudcomputeproject-infoadd-metadata--metadata=enable-oslogin=TRUE
  19. Sign in to or create a Hugging Face account.

Access Gemma 3 by using Hugging Face

To use Hugging Face to access Gemma 3, do the following:

  1. Sign in to Hugging Face
  2. Create a Hugging Faceread access token.
    ClickYour Profile > Settings > Access tokens > +Create new token
  3. Copy and save theread access token value. You use it later inthis tutorial.

Prepare your environment

To prepare your environment, set the following:

gcloud config set projectPROJECT_NAMEgcloud config set billing/quota_projectPROJECT_NAMEexport RESERVATION=YOUR_RESERVATION_IDexport PROJECT_ID=$(gcloud config get project)export REGION=CLUSTER_REGIONexport CLUSTER_NAME=CLUSTER_NAMEexport HF_TOKEN=YOUR_TOKENexport NETWORK=default

Replace the following:

  • PROJECT_NAME: the name of the Google Cloud projectwhere you want to create the GKE cluster.

  • YOUR_RESERVATION_ID: the identifier for your reserved capacity.

  • CLUSTER_REGION: the region where you want to create yourGKE cluster. You can only create the cluster in the regionwhere you reservation exists.

  • CLUSTER_NAME: the name of the GKE clusterto create.

  • HF_TOKEN: the Hugging Face access token thatyou created in the previous section.

Create a GKE cluster in Autopilot mode

To create a GKE cluster in Autopilot mode, run thefollowing command:

gcloud container clusters create-auto ${CLUSTER_NAME} \    --project=${PROJECT_ID} \    --location=${REGION} \    --release-channel=rapid

Creating the GKE cluster might take some time to complete. Toverify that Google Cloud has finished creating your cluster, go toKubernetes clusterson the Google Cloud console.

Create a Kubernetes secret for Hugging Face credentials

To create a Kubernetes secret for Hugging Face credentials, follow these steps:

  1. Configurekubectl to communicate with your GKE cluster:

    gcloud container clusters get-credentials $CLUSTER_NAME \    --location=$REGION
  2. Create a Kubernetes secret to store your Hugging Face token:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \    --location=${REGION}kubectl create secret generic hf-secret \    --from-literal=hf_api_token=${HF_TOKEN} \    --dry-run=client -o yaml | kubectl apply -f -

Prepare your workload

To prepare your workload, you do the following:

  1. Create workload scripts.

  2. Use Docker and Cloud Build to create a fine-tuning container.

Create workload scripts

To create the scripts that your fine-tuning workload uses, do the following:

  1. Create a directory for the workload scripts. Use this directory as your workingdirectory.

    mkdirllm-finetuning-gemmacdllm-finetuning-gemma
  2. Create thecloudbuild.yaml file to use Google Cloud Build. Thisfile creates your workload container and stores it in Artifact Registry:

    steps:-name:'gcr.io/cloud-builders/docker'args:['build','-t','us-docker.pkg.dev/$PROJECT_ID/gemma/finetune-gemma-gpu:1.0.0','.']images:-'us-docker.pkg.dev/$PROJECT_ID/gemma/finetune-gemma-gpu:1.0.0'
  3. Create aDockerfile file to execute the fine-tuning job:

    FROMnvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04RUNapt-getupdate &&\apt-get-yinstallpython3python3-devgccpython3-pippython3-venvgitcurlvimRUNpython3-mvenv/opt/venvENVPATH="/opt/venv/bin:/usr/local/nvidia/bin:$PATH"ENVLD_LIBRARY_PATH="/usr/local/nvidia/lib64:$LD_LIBRARY_PATH"RUNpip3installsetuptoolswheelpackagingninjaRUNpip3installtorchtorchvisiontorchaudio--index-urlhttps://download.pytorch.org/whl/cu128RUNpip3install\transformers==4.53.3\datasets==4.0.0\accelerate==1.9.0\evaluate==0.4.5\bitsandbytes==0.46.1\trl==0.19.1\peft==0.16.0\tensorboard==2.20.0\protobuf==6.31.1\sentencepiece==0.2.0COPYfinetune.py/finetune.pyCOPYaccel_fsdp_gemma3_config.yaml/accel_fsdp_gemma3_config.yamlCMDacceleratelaunch--config_fileaccel_fsdp_gemma3_config.yamlfinetune.py
    Note: This tutorial doesn't use the latest versions of NVIDIA and Pytorch dependencies. If you need newer dependencies, then see theNVIDIA documentation and thePytorch documentation.
  4. Create theaccel_fsdp_gemma3_config.yaml file. This configuration filedirects Hugging Face Accelerate to split the tuning job across multiple GPUs.

    compute_environment: LOCAL_MACHINEdebug: falsedistributed_type: FSDPdowncast_bf16: 'no'enable_cpu_affinity: falsefsdp_config:  fsdp_activation_checkpointing: false  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP  fsdp_cpu_ram_efficient_loading: true  fsdp_offload_params: false  fsdp_reshard_after_forward: true  fsdp_state_dict_type: FULL_STATE_DICT  fsdp_transformer_layer_cls_to_wrap: Gemma3DecoderLayer  fsdp_version: 2machine_rank: 0main_training_function: mainmixed_precision: bf16num_machines: 1num_processes: 8rdzv_backend: staticsame_network: truetpu_env: []tpu_use_cluster: falsetpu_use_sudo: falseuse_cpu: false
  5. Create thefinetune.yaml file:

    apiVersion: batch/v1kind: Jobmetadata:  name: finetune-job  namespace: defaultspec:  backoffLimit: 2  template:    metadata:      annotations:        kubectl.kubernetes.io/default-container: finetuner    spec:      terminationGracePeriodSeconds: 600      containers:      - name: finetuner        image: $IMAGE_URL        command: ["accelerate","launch"]        args:        - "--config_file"        - "accel_fsdp_gemma3_config.yaml"        - "finetune.py"        - "--model_id"        - "google/gemma-3-12b-pt"        - "--output_dir"        - "gemma-12b-text-to-sql"        - "--per_device_train_batch_size"        - "8"        - "--gradient_accumulation_steps"        - "8"        - "--num_train_epochs"        - "3"        - "--learning_rate"        - "1e-5"        - "--save_strategy"        - "steps"        - "--save_steps"        - "100"        resources:          limits:            nvidia.com/gpu: "8"        env:        - name: HF_TOKEN          valueFrom:            secretKeyRef:              name: hf-secret              key: hf_api_token        volumeMounts:        - mountPath: /dev/shm          name: dshm      volumes:      - name: dshm        emptyDir:          medium: Memory      nodeSelector:        cloud.google.com/gke-accelerator: nvidia-b200        cloud.google.com/reservation-name: $RESERVATION        cloud.google.com/reservation-affinity: "specific"        cloud.google.com/gke-gpu-driver-version: latest      restartPolicy: OnFailure
  6. Create thefinetune.py file:

    importtorchimportargparseimportsubprocessfromdatasetsimportload_datasetfromtransformersimportAutoTokenizer,AutoModelForCausalLM,BitsAndBytesConfig,AutoConfigfrompeftimportLoraConfig,prepare_model_for_kbit_training,get_peft_modelfromtrlimportSFTTrainer,SFTConfigfromhuggingface_hubimportlogindefget_args():parser=argparse.ArgumentParser()parser.add_argument("--model_id",type=str,default="google/gemma-3-12b-pt",help="Hugging Face model ID")parser.add_argument("--hf_token",type=str,default=None,help="Hugging Face token for private models")parser.add_argument("--trust_remote",type=bool,default="False",help="Trust remote code when loading tokenizer")parser.add_argument("--use_fast",type=bool,default="True",help="Determines if a fast Rust-based tokenizer should be used")parser.add_argument("--dataset_name",type=str,default="philschmid/gretel-synthetic-text-to-sql",help="Hugging Face dataset name")parser.add_argument("--output_dir",type=str,default="gemma-12b-text-to-sql",help="Directory to save model checkpoints")# LoRA argumentsparser.add_argument("--lora_r",type=int,default=16,help="LoRA attention dimension")parser.add_argument("--lora_alpha",type=int,default=16,help="LoRA alpha scaling factor")parser.add_argument("--lora_dropout",type=float,default=0.05,help="LoRA dropout probability")# SFTConfig argumentsparser.add_argument("--max_seq_length",type=int,default=512,help="Maximum sequence length")parser.add_argument("--num_train_epochs",type=int,default=3,help="Number of training epochs")parser.add_argument("--per_device_train_batch_size",type=int,default=8,help="Batch size per device during training")parser.add_argument("--gradient_accumulation_steps",type=int,default=1,help="Gradient accumulation steps")parser.add_argument("--learning_rate",type=float,default=1e-5,help="Learning rate")parser.add_argument("--logging_steps",type=int,default=10,help="Log every X steps")parser.add_argument("--save_strategy",type=str,default="steps",help="Checkpoint save strategy")parser.add_argument("--save_steps",type=int,default=100,help="Save checkpoint every X steps")parser.add_argument("--push_to_hub",action='store_true',help="Push model back up to HF")parser.add_argument("--hub_private_repo",type=bool,default="True",help="Push to a private repo")returnparser.parse_args()defmain():args=get_args()# --- 1. Setup and Login ---ifargs.hf_token:login(args.hf_token)# --- 2. Create and prepare the fine-tuning dataset ---# The `create_conversation` function is no longer needed.# The SFTTrainer will use the `formatting_func` to apply the chat template.dataset=load_dataset(args.dataset_name,split="train")dataset=dataset.shuffle().select(range(12500))dataset=dataset.train_test_split(test_size=2500/12500)# --- 3. Configure Model and Tokenizer ---iftorch.cuda.is_available()andtorch.cuda.get_device_capability()[0] >=8:torch_dtype_obj=torch.bfloat16torch_dtype_str="bfloat16"else:torch_dtype_obj=torch.float16torch_dtype_str="float16"tokenizer=AutoTokenizer.from_pretrained(args.model_id,trust_remote_code=args.trust_remote,use_fast=args.use_fast)tokenizer.pad_token=tokenizer.eos_tokengemma_chat_template=("""")tokenizer.chat_template=gemma_chat_template# --- 4. Define the Formatting Function ---# This function will be used by the SFTTrainer to format each sample# from the dataset into the correct chat template format.defformatting_func(example):# The create_conversation logic is now implicitly handled by this.# We need to construct the messages list here.system_message="You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA."user_prompt="Given the <USER_QUERY> and the <SCHEMA>, generate the corresponding SQL command to retrieve the desired data, considering the query's syntax, semantics, and schema constraints.\n\n<SCHEMA>\n{context}\n</SCHEMA>\n\n<USER_QUERY>\n{question}\n</USER_QUERY>\n"messages=[{"role":"user","content":user_prompt.format(question=example["sql_prompt"][0],context=example["sql_context"][0])},{"role":"assistant","content":example["sql"][0]}]returntokenizer.apply_chat_template(messages,tokenize=False)# --- 5. Load Model and Apply PEFT ---config=AutoConfig.from_pretrained(args.model_id)config.use_cache=False# We'll be loading this model full precision because we're planning to do FSDP# Load the base model with quantizationprint("Loading base model...")model=AutoModelForCausalLM.from_pretrained(args.model_id,config=config,attn_implementation="eager",torch_dtype=torch_dtype_obj,)# Prepare the model for k-bit trainingmodel=prepare_model_for_kbit_training(model)# Configure LoRA.peft_config=LoraConfig(lora_alpha=args.lora_alpha,lora_dropout=args.lora_dropout,r=args.lora_r,bias="none",target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],task_type="CAUSAL_LM",)# Apply the PEFT config to the modelprint("Applying PEFT configuration...")model=get_peft_model(model,peft_config)model.print_trainable_parameters()# --- 6. Configure Training Arguments ---training_args=SFTConfig(output_dir=args.output_dir,max_seq_length=args.max_seq_length,num_train_epochs=args.num_train_epochs,per_device_train_batch_size=args.per_device_train_batch_size,gradient_accumulation_steps=args.gradient_accumulation_steps,learning_rate=args.learning_rate,logging_steps=args.logging_steps,save_strategy=args.save_strategy,save_steps=args.save_steps,packing=False,label_names=["domain"],gradient_checkpointing=True,gradient_checkpointing_kwargs={"use_reentrant":False},optim="adamw_torch",fp16=Trueiftorch_dtype_obj==torch.float16elseFalse,bf16=Trueiftorch_dtype_obj==torch.bfloat16elseFalse,max_grad_norm=0.3,warmup_ratio=0.03,lr_scheduler_type="constant",push_to_hub=True,report_to="tensorboard",dataset_kwargs={"add_special_tokens":False,"append_concat_token":True,})# --- 7. Create Trainer and Start Training ---trainer=SFTTrainer(model=model,args=training_args,train_dataset=dataset["train"],eval_dataset=dataset["test"],formatting_func=formatting_func,)print("Starting training...")trainer.train()print("Training finished.")# --- 8. Save the final model ---print(f"Saving final model to{args.output_dir}")model.cpu()trainer.save_model(args.output_dir)torch.distributed.destroy_process_group()if__name__=="__main__":main()

Use Docker and Cloud Build to create a fine-tuning container

  1. Create an Artifact Registry Docker Repository:

    gcloudartifactsrepositoriescreategemma\--project=${PROJECT_ID}\--repository-format=docker\--location=us\--description="Gemma Repo"
  2. In thellm-finetuning-gemma directory that you created in an earlier step,run the following command to create the fine-tuning container and push it toArtifact Registry.

    gcloudbuildssubmit.
  3. Export the image URL. You use it at a later step in this tutorial:

    export IMAGE_URL=us-docker.pkg.dev/${PROJECT_ID}/gemma/finetune-gemma-gpu:1.0.0

Start your fine-tuning workload

To start your fine-tuning workload, do the following:

  1. Apply the finetune manifest to create the fine-tuning job:

    envsubst <finetune.yaml|kubectlapply-f-

    Because you're using clusters in GKE Autopilotmode, it might take a few minutes to start your GPU enabled node.

  2. Monitor the job by running the following command:

    ewatchkubectlgetpods
  3. Check the logs of the job by running the following command:

    kubectllogsjob.batch/finetune-job-f

    The job resource downloads the model data then fine-tunes the model acrossall eight of the GPUs. The download takes around five minutes to complete.After the download is complete, the fine-tuning process takes approximatelytwo hours and 30 minutes to complete.

Monitor your workload

You can monitor the use of the GPUs in your GKE cluster to verify that yourfine-tuning job is efficiently running. To do so, open the following link inyour browser:

https://console.cloud.google.com/kubernetes/clusters/details/us-central1/[CLUSTER_NAME]/observability?mods=monitoring_api_prod&project=[YOUR_PROJECT_ID]]&pageState=("timeRange":("duration":"PT1H"),"nav":("section":"gpu"),"groupBy":("groupByType":"namespacesTop5"))

When you monitor your workload, you can see the following:

  • GPUs usage: for a healthy fine-tuning job, you can expect to seethe usage of all of your 8 GPUs rise and stabilize to a high levelthroughout your training.
  • Job duration: the job should take approximately 10 minutes tocomplete on the specified A4 cluster.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete your project

Caution: Deleting a project has the following effects:
  • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
  • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

Delete a Google Cloud project:

gcloud projects deletePROJECT_ID

Delete your resources

  1. To delete your fine-tuning job, run the following command:

    kubectldeletejobfinetune-job
  2. To delete your GKE cluster, run the following command:

    gcloud container clusters delete $CLUSTER_NAME \    --region=$REGION

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.