Design storage for AI and ML workloads in Google Cloud

Last reviewed 2025-09-16 UTC

This document provides design guidance on how to choose and integrateGoogle Cloud storage services for your AI and ML workloads. Each stage inthe ML lifecycle has different storage requirements. For example, when youupload the training dataset, you might prioritize storage capacity for trainingand high throughput for large datasets. Similarly, the training, tuning,serving, and archiving stages have different requirements.

This document helps you assess your capacity, latency, and throughputrequirements to make informed choices to determine the appropriate storagesolution. This document assumes that you've selected a compute platform thatmeets the requirements of your workload. For AI and ML workloads, we recommendthat you use eitherCompute Engine orGoogle Kubernetes Engine (GKE).For more information about selecting a compute platform, seeHosting Applications on Google Cloud.

The following tabs provide a brief summary of the recommended storage choicesfor each stage of the ML workflow. For more information, seeChoose appropriate storage.

Prepare

In the preparation stage of the ML workflow, you do the following:

  1. Upload and ingest data.
  2. Transform the data into the correct format before you train the model.

To optimize storage costs by using multiple storage classes, we recommend thatyou use the Cloud StorageAutoclass featureorobject lifecycle management.

Train

In the training stage of the ML workflow, you do the following:

  1. Model development: Develop your model by using notebooks andapplying iterative trial and error.
  2. Model training:
    • Use small-scale to large-scale numbers of machine acceleratorsto repeatedly read the training dataset.
    • Apply an iterative process to model development and training.
  3. Checkpointing and restarting:
    • Save state periodically during model training by creating acheckpoint so that the training can restart after a node failure.
    • Make the checkpointing selection based on the I/O pattern andthe amount of data that needs to be saved at the checkpoint.

For training, we recommend that you use Managed Lustre for most workloads. When you choose a storage option, consider your workload characteristics:

  • Use Managed Lustre if your workload has these characteristics:
    • Training data that consists of small files of less than 50 MB totake advantage of low latency capabilities.
    • A latency requirement of less than 1 millisecond to meet storagerequirements for random I/O and metadata access.
    • A requirement to perform frequent high-performance checkpointing.
    • A desktop-like experience with full POSIX support to view andmanage the data for your users.
  • Use Cloud Storage with Cloud Storage FUSE andAnywhere Cache if your workload has these characteristics:
    • Training data that consists of large files of 50 MB or more.
    • Tolerance for higher storage latency in the tens of milliseconds.
    • A priority of data durability and high availability over storageperformance.

To optimize costs, we recommend that you use the same storage servicethroughout all of the stages of model training.

Serve

In the serving stage of the ML workflow, you do the following:

  1. Store the model.
  2. Load the model into an instance that runs machine accelerators at startup.
  3. Store results of model inference, such as generated images.
  4. Optionally, store and load the dataset used for model inference.

For serving, we recommend that you use Cloud Storage with Cloud Storage FUSE and Anywhere Cache for most workloads. When you choose a storage option, consider your workload characteristics:

  • Cloud Storage with Cloud Storage FUSE andAnywhere Cache if your workload has these characteristics:
    • A requirement for a dynamic environment where the number of inferencing nodes can change.
    • Infrequent updates to your model.
    • A requirement to serve models from multiple zones andregions within a continent.
    • A priority for high availability and durability for your models,even in the event of regional disruptions.
  • Managed Lustre if your workload has these characteristics:
    • Your training and checkpointing workload usesManaged Lustre.
    • A requirement to serve models from a single zone.
    • A requirement for reliable high throughput and consistent, low-latency I/O for performance-sensitive models.
    • Frequent updates to your model.

Archive

In the archiving stage of ML workloads, you retain the training data and themodel for extended time periods.

To optimize storage costs with multiple storage classes, we recommend that youuse Cloud StorageAutoclass orobject lifecycle management.

Overview of the design process

To determine the appropriate storage options for your AI and ML workload inGoogle Cloud, you do the following:

  1. Consider the characteristics of your workload, performanceexpectations, and cost goals.
  2. Review the recommended storage services and features in Google Cloud.
  3. Based on your requirements and the available options, you choose thestorage services and features that you need for each stage in the MLworkflow: Prepare, train, serve, and archive.

This document focuses on the stages of the ML workflow where carefulconsideration of storage options is most critical, but it doesn't cover theentirety of theML lifecycle, processes, and capabilities.

The following provides an overview of the three-phase design process forchoosing storage for your AI and ML workload:

  1. Define your requirements:
    • Workload characteristics
    • Security constraints
    • Resilience requirements
    • Performance expectations
    • Cost goals
  2. Review storage options:
    • Managed Lustre
    • Cloud Storage
  3. Choose appropriate storage: Choosestorage services, features, and design options based on your workloadcharacteristics at each stage of the ML workflow.

Define your requirements

Before you choose storage options for your AI and ML workload inGoogle Cloud, you must define the storage requirements for the workload.To define storage requirements, you should consider factors such as computeplatform, capacity, throughput, and latency requirements.

To help you choose a storage option for your AI and ML workloads, consider thecharacteristics of your workload:

  • Are your I/O request sizes and file sizes small (KBs), medium, or large(MBs or GBs)?
  • Does your workload primarily exhibit sequential or random file accesspatterns?
  • Are your AI and ML workloads sensitive to I/O latency and time to firstbyte (TTFB)?
  • Do you require high read and write throughput for single clients,aggregated clients, or both?
  • What is the largest number ofGraphics Processing Units (GPUs) orTensor Processing Units (TPUs) that your single largest AI and ML training workload requires?

You use your answers to these questions tochoose appropriate storage later in thisdocument.

Review storage options

Google Cloud offers storage services for all of the primary storageformats: block, file, parallel file system, and object. The following tabledescribes options that you can consider for your AI and ML workload onGoogle Cloud. The table includes the Google-managed storage optionsthat this document focuses on for your AI and ML workloads. However, if you havespecific requirements that aren't addressed by these offerings, considerpartner-managed storage solutions that are available inGoogle Cloud Marketplace.

Review and evaluate the features, design options, and relative advantages of theservices available for each storage format.

Storage serviceStorage typeFeatures
Managed LustreParallel file system
Cloud StorageObject

Managed Lustre

Managed Lustre is a fully managed file system in Google Cloud.Managed Lustre provides persistent, zonal instances that arebuilt on theDDN EXAScaler Lustre file system.Managed Lustre is ideal for AI and ML workloads that need toprovide low-latency access of less than one millisecond with high throughput andhighinput/output operations per second (IOPS).Managed Lustre can maintain high throughput and high IOPS for afew VMs or for thousands of VMs.

Managed Lustre provides the following benefits:

  • POSIX compliance: Support for the POSIX standard, which helps to ensurecompatibility with many existing applications and tools.
  • Lower total cost of ownership (TCO) for training: Accelerate training time by efficientlydelivering data to compute nodes. This acceleration helps to reduce the TCOfor AI and ML model training.
  • Lower TCO for serving: Enable faster model loading and optimizedinference serving in comparison to Cloud Storage. Thesecapabilities help to lower compute costs and help to improve resourceutilization.
  • Efficient resource utilization: Combine checkpointing and trainingwithin a single instance. This resource utilization helps to maximize theefficient use of read and write throughput in a single, high-performancestorage system.

Cloud Storage

Cloud Storage is a fully managed object storage service that's suitable for AI and MLworkloads of any scale. Cloud Storage excels at handling unstructureddata for all of the phases of the AI and ML workflow.

Cloud Storage provides the following benefits:

  • Massive scalability: Gain unlimited storage capacity that scales toexabytes on a global basis.
  • High throughput: Scale up to 1 TB/s with required planning.
  • Flexible location options: Choose from regional, multi-region, anddual-region storage options for AI and ML workloads.
  • Cost-effectiveness: Benefit from a range ofstorage classes that are designed to optimize costs based on your data access patterns.

Cloud Storage excels in scale and cost-effectiveness, but it'simportant to consider its latency and I/O characteristics. Expect latency in thetens of milliseconds, which is higher than other storage options. To maximizethroughput, you need to use hundreds or thousands of threads, large files, andlarge I/O requests. Cloud Storage provides client libraries in variousprogramming languages, and it providesCloud Storage FUSE andAnywhere Cache.

Cloud Storage FUSE is an open sourceFUSE adapter that's supported by Google. Cloud Storage FUSE lets you mountCloud Storage buckets as local drives. Cloud Storage FUSE isn't fullycompliant with POSIX. Therefore, it's important that you understand theCloud Storage FUSElimitations and differences from traditional file systems.With Cloud Storage FUSE, you can access your training data, models, andcheckpoints with the scale, affordability, and performance ofCloud Storage.

Cloud Storage FUSE caching provides the following benefits:

  • Portability: Mount and access Cloud Storage buckets byusing standard file system semantics, which makes your applications moreportable.
  • Compatibility: Eliminate the need to refactor applications to usecloud-specific APIs, which saves you time and resources.
  • Reduced idle time: Start training jobs quickly by directly accessingdata in Cloud Storage, which minimizes idle time for your GPUs andTPUs.
  • High throughput: Take advantage of the built-in scalability andperformance of Cloud Storage, which is optimized for read-heavy MLworkloads with GPUs or TPUs.
  • Client-localfile cache:Accelerate training with a client-local cache that speeds up repeated filereads. This acceleration can be further enhanced when you use it with the 6TiB local SSD that's bundled with A3 machine types.

Anywhere Cache is a Cloud Storage feature that providesup to 1 PiB of SSD-backed zonal read cache for Cloud Storage buckets.Anywhere Cache is designed to accelerate data-intensiveapplications by providing a local, fast-access layer for frequently read datawithin a specific zone.

Anywhere Cache provides the following benefits:

  • Accelerated throughput: Automatically scale cache capacity andbandwidth to deliver high throughput, exceeding regional bandwidth quotas,with consistent and predictable latencies.
  • Reduced cost: Avoid data transfer egress fees or storage classretrieval fees for cached data. Anywhere Cache automaticallysizes the cache and available bandwidth to meet your workload needs.

Partner storage solutions

For workload requirements that the preceding storage services don't meet, you can use the following partner solutions, which are available in Cloud Marketplace:

These partner solutions arenot managed by Google. You need to manage the deployment and operational tasks to ensure optimal integration and performance within your infrastructure.

Comparative analysis

The following table shows the key capabilities of Managed Lustreand Cloud Storage.

Managed LustreCloud Storage
Capacity18 TiB - 8 PiBNo lower or upper limit.
ScalingScalableScales automatically based on usage.
SharingMountable on multiple Compute Engine VMs andGKE clusters.
  • Read/write from anywhere.
  • Integrates withCloud CDN and third-party CDNs.
Encryption key optionGoogle-owned and Google-managed encryption keys
  • Google-owned and Google-managed encryption keys
  • Customer-managed
  • Customer-supplied
PersistenceLifetime of the Managed Lustre instance.Lifetime of the bucket
AvailabilityZonal
PerformanceLinear scaling with provisionedcapacity and multiple performance tier optionsAutoscaling read-write rates, anddynamic load redistribution
ManagementFully managed, POSIX compliantFully managed

Data transfer tools

This section describes your options for moving data between storage services onGoogle Cloud. When you perform AI and ML tasks, you might need to moveyour data from one location to another. For example, if your data starts inCloud Storage, you might move it elsewhere to train the model, and thencopy the checkpoint snapshots or trained model back to Cloud Storage.

You can use the following methods to transfer data to Google Cloud:

  • Transfer data online by usingStorage Transfer Service:Automate the transfer of large amounts of data between object and filestorage systems, including Cloud Storage, Amazon S3, Azure storageservices, and on-premises data sources. Storage Transfer Service lets you copyyour data securely from the source location to the target location andperform periodic transfers of changed data. It also provides data integrityvalidation, automatic retries, and load balancing.
  • Upload data to Cloud Storage: Upload data online toCloud Storage buckets by using the Google Cloud console,gcloud CLI,Cloud Storage APIs, or client libraries.

When you choose a data transfer method, consider factors like the data size,time constraints, bandwidth availability, cost goals, and security andcompliance requirements. For information about planning and implementing datatransfers to Google Cloud, seeMigrate to Google Cloud: Transfer your large datasets.

Choose appropriate storage

AI and ML workloads typically involve four primary stages: prepare, train,serve, and archive. Each of these stages have unique storage requirements, andchoosing the right solution can significantly impact performance, cost, andoperational efficiency. A hybrid orlocally optimized approach lets you tailoryour storage choices to the specific demands of each stage for your AI and MLworkload. However, if your priorities are unified management and ease ofoperation, then aglobally simplified approach that uses a consistentsolution across all of the stages can be beneficial for workloads of any scale.The effectiveness of the storage choice depends on the dataset properties, thescale of the required compute and storage resources, latency, and theworkload requirements that you defined earlier.

The following sections provide details about the primary stages of AI and MLworkloads and the factors that might influence your storage choice.

Prepare

The preparation stage sets the foundation for your AI and ML application. Itinvolves uploading raw data from various sources into your cloud environment andtransforming the data into a usable format for training your AI and ML model.This process includes tasks like cleaning, processing, and converting data typesto ensure compatibility with your chosen AI and ML framework.

Cloud Storage is well-suited for the preparation stage due to itsscalability, durability, and cost-efficiency, particularly for large datasetscommon in AI. Cloud Storage offers seamless integration with otherGoogle Cloud services that lets you take advantage of potentialoptimizations for data-intensive training.

During the data preparation phase, you can reorganize your data into largechunks to improve access efficiency and avoid random read requests. To furtherreduce the I/O performance requirements on the storage system, you can increasethe number of I/O threads by using pipelining, training optimization, or both.

Train

The training stage is the core of model development, where your AI and ML modellearns from the provided data. This stage involves two key aspects that havedistinct requirements: efficient data loading for accessing training data andreliable checkpointing for saving model progress. The following sections providerecommendations and the factors to consider to choose appropriate storageoptions for data loading and checkpointing.

Data loading

During data loading, GPUs or TPUs repeatedly import batches of data to trainthe model. In this phase, you can use a cache to optimize data-loading tasks,depending on the size of the batches and the order in which you request them.Your goal during data loading is to train the model with maximum efficiency butat the lowest cost.

If the size of your training data scales to petabytes, the data might need to bere-read multiple times. Such a scale requires intensive processing by a GPU orTPU accelerator. However, you need to ensure that your GPUs and TPUs aren'tidle, and ensure that they process your data actively. Otherwise, you pay for anexpensive, idle accelerator while you copy the data from one location toanother.

To optimize performance and cost for data loading, consider the followingfactors:

  • Dataset size: The size of your overall training data corpus, and thesize of each training dataset.
  • Access patterns: Which one of the following options best categorizesyour training workload I/O access pattern:
    • Parallel and sequential access: A file is assigned to asingle node and is read sequentially.
    • Parallel and random access: A file is assigned to a singlenode and is read randomly to create a batch of samples.
    • Fully random access: A node can read any range from any fileto create a batch.
  • File size: The typical read request sizes.
Managed Lustre for data loading

We generally recommend that you use Managed Lustre for training and checkpointing. Additionally, use Managed Lustre if any of thefollowing conditions apply:

  • Your training data consists of small files of less than 50 MB to takeadvantage of low latency capabilities.
  • You have a latency requirement of less than 1 millisecond to meet storagerequirements for random I/O and metadata access.
  • You need a desktop-like experience with full POSIX support to view andmanage the data for your users.

You can use Managed Lustre as a high-performance cache on top ofCloud Storage to accelerate AI and ML workloads that require extremelyhigh throughput and low latency I/O operations with a fully managed parallelfile system. To minimize latency during training, you canimport and export data to Managed Lustre and from Cloud Storage. If you use GKE as yourcompute platform, you can use the

UseGKE Managed Lustre CSI driver to pre-populate PersistentVolumesClaims with data from Cloud Storage.After the training is complete, you can minimize your long-term storage expensesby exporting your data to a lower-cost Cloud Storage class.

Cloud Storage for data loading

You should choose Cloud Storage with Cloud Storage FUSE and Anywhere Cache to load your data if any ofthe following conditions apply:

  • Your training data consists of large files of 50 MB or more.
  • You can tolerate higher storage latency in the tens of milliseconds.
  • You prioritize data durability and high availability over storageperformance.

Cloud Storage offers a scalable solution forstoring massive datasets, and Cloud Storage FUSE lets you access the data as alocal file system. Cloud Storage FUSE accelerates data access during trainingby keeping the training data close to the machine accelerators, which increasesthroughput.

For workloads that demand over 1 TB/s throughput, Anywhere Cacheaccelerates read speeds by caching data and scaling beyond regional bandwidthquotas. Anywhere Cache provides low latency for cache hits, which eliminates the need to read from the Cloud Storage bucket. To assess if Anywhere Cache is suitable for yourworkload, use theAnywhere Cache recommender to analyze your data usage and storage.

To enhance data access and organization,create Cloud Storage buckets with hierarchical namespaces enabled. Hierarchical namespaces let you organize data in a file system structure, which improves performance, ensures consistency, and simplifies management for AI and ML workloads. Hierarchical namespaces enable higher initialqueries per second (QPS) and fast atomic directory renames

Checkpointing and restore

Forcheckpointing and restore,training jobs need to periodically save their state so that they can recoverquickly from instance failures. When the failure happens, jobs must restart,ingest the latest checkpoint, and then resume training. The exact mechanismthat's used to create and ingest checkpoints is typically specific to aframework. To learn about checkpoints and optimization techniques forTensorFlow Core, seeTraining checkpoints.To learn about checkpoints and optimization techniques for PyTorch, seeSaving and Loading Models.

You only need to save a few checkpoints at any one point in time. Checkpointworkloads usually consist of mostly writes, several deletes, and, ideally,infrequent reads when failures occur.

To optimize checkpointing and restore performance, consider the followingfactors:

  • Model size: The number of parameters that are in your AI and ML model.The size of your model directly impacts the size of its checkpoint files,which can range from GiB to TiB.
  • Checkpoint frequency: How often your model saves checkpoints.Frequent saves provide better fault tolerance, but increase storage costsand can impact training speed.
  • Checkpoint recovery time: The recovery time that you want for loadingcheckpoints and resuming training. If you need to restore a checkpoint, consider that your model training is paused until the recovery is complete. To minimize recovery time, consider factors like checkpoint size, storage performance, and network bandwidth.
  • Accelerator idle time: The time that accelerators aren't processing data because they are waiting for a checkpoint write or restore operation to complete. To minimize this idle time, select a storage solution that offers high throughput and low latency.
Managed Lustre for checkpointing

You should choose Managed Lustre for checkpointing if any of thefollowing conditions apply:

  • Your training workload already uses Managed Lustre fordata loading.
  • You perform frequent high-performance checkpointing.

To maximize resource utilization and minimize accelerator idle time, useManaged Lustre for training and checkpointing.Managed Lustre can achieve fast checkpoint writes that achieve ahigh per-VM throughput. You can keep checkpoints in the persistentManaged Lustre instance, or you can optimize costs byperiodically exporting checkpoints to Cloud Storage. Managed Lustre can be used alongside Cloud Storage FUSE during training. You can use Cloud Storage FUSE for data loading and training and use Managed Lustre for increased performance during checkpointing.

Cloud Storage for checkpointing

You should choose Cloud Storage for checkpointing if any of thefollowing conditions apply:

  • Your training workload uses Cloud Storage FUSE.
  • You prioritize data durability and high availability over storageperformance.

To improve checkpointing performance, use Cloud Storage FUSE withhierarchical namespaces enabled to take advantage of the fast atomic rename operation and to save checkpoints asynchronously. To prevent accidental exposure of sensitive information from your training dataset during serving, you need to storecheckpoints in a separate Cloud Storage bucket. To help reduce tail-endwrite latencies for stalled uploads, Cloud Storage FUSE attempts a retry after10 seconds.

Serve

When you serve your model, which is also known asinference, the primary I/Opattern is read-only in order to load the model into GPU or TPU memory. Yourgoal at the serving stage is to run your model in production. The model is muchsmaller than the training data, which means that you can replicate and scale themodel across multiple instances. When you serve data, it's important that youhave high availability and protection against zonal and regional failures.Therefore, you must ensure that your model is available for a variety of failurescenarios.

For many generative AI and ML use cases, the input data to the model might bequite small and the data might not need to be stored persistently. In othercases, you might need to run large volumes of data over the model (for example,scientific datasets). To run large volumes of data, choose a storage option thatcan minimize GPU or TPU idle time during the analysis of the dataset, and use apersistent location to store the inference results.

Model load times directly affect accelerator idle time, which incurs substantialcosts. An increase in per-node model load time can be amplified across manynodes, which can lead to a significant cost increase. Therefore, to achievecost-efficiency in serving infrastructure, it's important that you optimize forrapid model loading.

To optimize serving performance and cost, consider the following factors:

  • Model size: The size of your model in GiB or TiB. Larger modelsrequire more computational resources and memory, which can increase latency.
  • Model load frequency: How often you plan to update your model. Frequentloading and unloading consume computational resources and increase latency.
  • Number of serving nodes: How many nodes will be serving your model.More nodes generally reduce latency and increase throughput, but they alsoincrease infrastructure costs.

Cloud Storage for serving

You should choose Cloud Storage with Cloud Storage FUSE and Anywhere Cache for serving your model if any of thefollowing conditions apply:

  • You require a dynamic environment where the number of inferencing nodes can change.
  • You make infrequent updates to your model.
  • You serve models from multiple zones and regions within a continent.
  • You prioritize high availability and durability for your models, even inthe event of zonal disruptions.

With multi-region or dual-region architecture, Cloud Storage provideshigh availability and protects your workload from zonal and regional failures.To accelerate model loading, you can use Cloud Storage FUSE withparallel downloads enabled so that parts of the model are fetched in parallel.

To achieve model serving with over 1 TB/s throughput, or for deploymentsexceeding a hundred serving nodes, use Anywhere Cache with amulti-region bucket. This combination provides high-performance, redundantstorage across regions, and flexibility. Anywhere Cache alsoeliminates data egress and storage class retrieval fees on cached data.

Managed Lustre for serving

You should choose Managed Lustre for serving your model if any ofthe following conditions apply:

  • Your training and checkpointing workload uses Managed Lustre.
  • You serve models from a single zone.
  • You have a requirement for reliable high throughput and consistent, low-latency I/O for performance-sensitive models.
  • You make frequent updates to your model.

If you're already using Managed Lustre for training andcheckpointing, it can be a cost-effective and high-performance option forserving your models. Managed Lustre offers high per-VM throughputand aggregate cluster throughput that helps to reduce model load time. You canuse Managed Lustre for any number of serving VMs.

Archive

The archive stage has an I/O pattern of "write once, read rarely." Your goal isto store the different sets of training data and the different versions ofmodels that you generated. You can use these incremental versions of data andmodels for backup and disaster-recovery purposes. You must also store theseitems in a durable location for a long period of time. Although you might notrequire access to the data and models very often, you want the items to beavailable when you need them.

Because of its extreme durability, expansive scale, and low cost, the bestGoogle Cloud option for storing object data over a long period of time isCloud Storage. Depending on how often you access the dataset, model,and backup files, Cloud Storage offers cost optimization throughdifferentstorage classes.You can select a storage class based on how often you expect to access yourarchived data:

  • Frequent data access: Standard storage
  • Monthly data access: Nearline storage
  • Quarterly data access: Coldline storage
  • Annual data access: Archive storage

Usingobject lifecycle management,you can create policies to automatically move data to longer-term storageclasses or to delete data based on specific criteria. If you're not sure howoften you might access your data, you can use theAutoclass feature to move data between storage classes automatically based on your accesspatterns.

What's next

For more information about storage options and AI and ML workloads, see thefollowing resources:

Contributors

Author:Samantha He | Technical Writer

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-09-16 UTC.