Storage Stay organized with collections Save and categorize content based on your preferences.
Choosing the right storage configuration is critical for the performance andstability of your training cluster. Theservice integrates with two distinct, high-performance storage solutions:
- Filestore: A required managed file service that provides the shared
/homedirectories for all nodes in the cluster. - Google Cloud Managed Lustre: An optional parallel file system designed forextreme I/O performance, ideal for training on massive datasets.
This page provides an overview of their key uses and outlines the specificnetworking and deployment requirements for a successful integration with yourcluster.
Storage integration for training clusters
Vertex AI training clusters relies on specific, networked storage solutionsfor itsoperation. Filestore is required to provide the shared/home directoriesfor the cluster, while Managed Lustre is an optional high-performance filesystem for demanding workloads.
It's critical to configure the networking for these storage services correctlybefore deploying your cluster.
Filestore for home directories
This service uses a Filestore instance to provide the shared/homedirectory for the cluster. To ensure proper connectivity, you must create yourcloud resources in this specific order:
- Create the VPC Network: First, deploy a VPC network configured with therecommended MTU (for example, 8896).
- Create the Filestore instance: Next, deploy the Filestoreinstanceinto the VPC you just created.
- Create the training cluster:Finally, deploy the cluster, which will then be able to connect to theFilestore instance within the same network.
Google Cloud Managed Lustre for high-performance workloads
For workloads that require maximum I/O performance, you can attach a ManagedLustre file system. This service connects to your VPC using Private ServiceAccess.
Critical networking limitation: No transitive peering
A critical limitation for both Filestore and Google Cloud Managed Lustreis that they don't support transitive peering. This means only resources withinthe directly connected VPC can access the storage service. For example,if your cluster's VPC (N1) is peered with the storage service, anotherVPC (N2) that is peered with N1 won't have access.
Storage integration for training clusters
Vertex AI training clusters relies on specific, networkedstorage solutions for its operation. Filestore is required to providethe shared/home directories for the cluster, while Google Cloud Managed Lustre is anoptional high-performance file system for demanding workloads.It's critical to configure the networking for these storage services correctlybefore deploying your cluster.
Filestore
Key uses of Filestore with training clusters
Beyond its role as the mandatory home directory, Filestore provides a flexibleway to share data with your cluster.Additional shared storage: You can attach one or more additional Filestore instances to any node pool. This is useful for providing shared datasets, application binaries, or other common files to your training jobs. When specified in the node pool configuration, training clusters automatically mounts these instances to the/mnt/filestore directory on each node.
Filestore Requirements
A successful Filestore integration with training clusters requires the following configuration:
- Enable the API: The Filestore API must be enabled in your Google Cloud project before you can create the cluster.
- Mandatory
/homeDirectory: Every training cluster requires a dedicated Filestore instance to serve as the shared/homedirectory. This instance has specific configuration requirements:- Network: It must reside in the same VPC network as the cluster's compute and login nodes.
- Location: It must be located in the same region or zone as the cluster.
- Configuration: You must specify the full resource name of this instance in the
field when creating the cluster via the API.orchestrator_spec.slurm_spec.home_directory_storage
Configure Filestore storage
Create a zonal or regional Filestore instance in the zone where you want tocreate the cluster. Vertex AI API requires aFilestore to be attached to the cluster to serve as the/home directory.This Filestore has to be in the same zone or region and in the same networkas all the compute nodes and login nodes. In the example below, 172.16.10.0/24 is used for theFilestore deployment.
SERVICE_TIER=ZONAL# Can use BASIC_SSD# Create reserved IP address rangegcloudcomputeaddressescreateCLUSTER_IDfs-ip-range \--project=PROJECT_ID \--global \--purpose=VPC_PEERING \--addresses=172.16.10.0 \--prefix-length=24 \--description="Filestore instance reserved IP range" \--network=NETWORK# Get the CIDR rangeFS_IP_RANGE=$(gcloudcomputeaddressesdescribeCLUSTER_IDfs-ip-range \--global \--format="value[separator=/](address, prefixLength)")# Create the Filestore instancegcloudfilestoreinstancescreateFS_INSTANCE_ID \--project=PROJECT_ID \--location=ZONE \--tier=ZONAL \--file-share=name="nfsshare",capacity=1024 \--network=name=NETWORK,connect-mode=DIRECT_PEERING,reserved-ip-range="${FS_IP_RANGE}"
Lustre
Google Cloud Managed Lustre delivers a high-performance, fully managed parallel file systemoptimized for AI and HPC applications. With multi-petabyte-scale capacity and up to 1 TBpsthroughput, Managed Lustre facilitates the migration of demanding workloads tothe cloud.
Managed Lustre instances live inzones within regions. A region is aspecific geographical location where you can run your resources. Each region is subdivided intoseveral zones. For example, theus-central1 region in the central United States has zonesus-central1-a,us-central1-b,us-central1-c, andus-central1-f. For more information, seeGeography and regions.
To decrease network latency, we recommend creating a Managed Lustre instance in a region andzone that's close to where you plan to use it.
When creating a Managed Lustre instance, you must define the following properties:
- The name of the instance used by Google Cloud.
- The file system name used by client-side tools, for example
lfs. - The storage capacity in gibibytes (GiB). Capacity can range from 9,000 GiB to ~8 PiB (7,632,000 GiB). The maximum size of an instance depends on its performance tier.
- Managed Lustre offersperformance tiers ranging from 125 MBps per TiB to 1000 MBps per TiB.
- For best performance, create your instance in the same zone as your training cluster.
- The VPC network for this instance must be the same one your training cluster uses.
Managed Lustre offers4 performance tiers,each with a different maximum throughput speed per TiB. Performance tiers also affect the minimumand maximum instance size, and the step size between acceptable capacity values. You cannot changean instance's performance tier after it's been created.
Deploying Managed Lustre requires Private Service Access, which establishes VPC peering betweenthe training cluster's VPC and the VPC hostingManaged Lustre, using a dedicated /20 subnet.
Configure Managed Lustre instance (optional)
Use Google Cloud Managed Lustre only if you want to use the Managed Lustre in Model Development Service.
Google Cloud Managed Lustre is a fully managed, high-performance parallel file system service on Google Cloud. It's specifically designed to accelerate demanding workloads in AI/Machine Learning and High-Performance Computing (HPC).
For optimal performance when using training clusters, Google Cloud Managed Lustre should be deployed from the same VPC and zone as your training cluster using VPC peering toservices networking.
Create Lustre instance
gcloud lustre instances createLUSTRE_INSTANCE_ID \ --project=PROJECT_ID \ --location=ZONE \ --filesystem=lustrefs \ --per-unit-storage-throughput=500 \ --capacity-gib=36000 \ --network=NETWORK_NAME
Cloud Storage mounting
As a prerequisite, make sure that the VM service account has theStorage Object User role.
Default mount
Vertex AI training clusters uses Cloud Storage FUSE to dynamically mount yourCloud Storagebuckets on all login and compute nodes, making them accessible under the/gcs directory. Dynamically mounted buckets can't be listed from the rootmount point/gcs. You can access the dynamically mounted buckets assubdirectories:
/gcs. The bucket name must be specified as part of the operation.user@testcluster:$ls/gcs/your-bucket-nameuser@testcluster:$cd/gcs/your-bucket-nameCustom mount
To mount a specific Cloud Storage bucket to a local directory with customoptions, use the following command structure by either passing it as part ofthe startup script on cluster creation, or directly running on the node afterthe cluster is created.
sudo mkdir -p $MOUNT_DIRecho "$GCS_BUCKET $MOUNT_DIR gcsfuse $OPTION_1,$OPTION_2,..." | sudo tee -a /etc/fstabsudo mount -aFor example, to mount the bucketmtdata to the/data directory, usethe following command:
sudomkdir-p/dataecho"mtdata /data gcsfuse defaults,_netdev,implicit_dirs,allow_other,dir_mode=777,file-mode=777,metadata_cache_negative_ttl_secs=0,metadata_cache_ttl_secs=-1,stat_cache_max_size_mb=-1,type_cache_max_size_mb=-1,enable_streaming_writes=true"|sudotee-a/etc/fstabsudomount-aFor a fully automated and consistent setup, include your custom mount scriptswithin the cluster's startup scripts. This practice ensures that yourCloud Storage buckets are automatically mounted across all nodes on startup,eliminating the need for manual configuration.
For additional configuration recommendations tailored to AI/ML workloads,see thePerformance tuning best practices guide.It provides specific guidance for optimizing Cloud Storage FUSE for training,inference, and checkpointing.
What's next
The next steps focus on using your cluster effectively for large-scale training.
- Adapt your code for distributed training: To take full advantage of amulti-node cluster and high-performance storage, adapt your training code fora distributed environment.
- Orchestrate your jobs with Vertex AI Pipelines: For production workflows,automate the process of data preparation, job submission, and modelregistration using Vertex AI Pipelines.
- Monitor and debug your training jobs: Track the progress and resourceutilization of your distributed training jobs to identify and resolve issues.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-17 UTC.