Parallel file systems for HPC workloads

Last reviewed 2025-05-19 UTC

This document introduces the storage options in Google Cloud forhigh performance computing (HPC) workloads, and explains when to use parallel file systems for HPC workloads.In a parallel file system, several clients use parallel I/O paths to accessshared data that's stored across multiple networked storage nodes.

The information in this document is intended for architects and administratorswho design, provision, and manage storage for data-intensive HPC workloads. Thedocument assumes that you have a conceptual understanding ofnetwork file systems (NFS),parallel file systems,POSIX,and the storage requirements of HPC applications.

What is HPC?

HPC systems solve large computational problems fast by aggregating multiplecomputing resources. HPC drives research and innovation across industries suchas healthcare, life sciences, media, entertainment, financial services, andenergy. Researchers, scientists, and analysts use HPC systems to performexperiments, run simulations, and evaluate prototypes. HPC workloads such asseismic processing, genomics sequencing, media rendering, and climate modelinggenerate and access large volumes of data at ever increasing data rates and everdecreasing latencies. High-performance storage and data management are criticalbuilding blocks of HPC infrastructure.

Storage options for HPC workloads in Google Cloud

Setting up and operating HPC infrastructure on-premises is expensive, and theinfrastructure requires ongoing maintenance. In addition, on-premisesinfrastructure typically can't be scaled quickly to match changes in demand.Planning, procuring, deploying, and decommissioning hardware on-premises takesconsiderable time, resulting in delayed addition of HPC resources orunderutilized capacity. In the cloud, you can efficiently provision HPCinfrastructure that uses the latest technology, and you can scale your capacityon-demand.

Google Cloud and our technology partners offer cost-efficient, flexible,and scalable storage options for deploying HPC infrastructure in the cloud andfor augmenting your on-premises HPC infrastructure. Scientists, researchers, andanalysts can quickly access additional HPC capacity for their projects when theyneed it.

To deploy an HPC workload in Google Cloud, you can choose from thefollowing storage services and products, depending on the requirements of yourworkload:

Workload typeRecommended storage services and products
Workloads that need low-latency access to data but don't require extreme I/Oto shared datasets, and that have limited data sharing between clients.Use NFS storage. Choose from the following options:
Workloads that generate complex, interdependent, and large-scale I/O, suchas tightly coupled HPC applications that use theMessage-PassingInterface (MPI) for reliable inter-process communication.Use a parallel file system. Choose from the following options:
For more information about the workload requirements that parallel filesystems can support, seeWhen to use parallel file systems.
Note: For workloads that don't require low latency or concurrent write access,you can use low-costCloud Storage,which supports parallel read access and automatically scales to meet yourworkload's capacity requirement.

When to use parallel file systems

In a parallel file system, several clients store and access shared data acrossmultiple networked storage nodes by using parallel I/O paths. Parallel filesystems are ideal for tightly coupled HPC workloads such as data-intensiveartificial intelligence (AI) workloads and analytics workloads that useSAS applications. Consider using a parallel file system likeManaged Lustre for latency-sensitive HPC workloads that have anyof the following requirements:

  • Tightly coupled data processing: HPC workloads like weathermodeling and seismic exploration need to process data repetitively by usingmany interdependent jobs that run simultaneously on multiple servers. Theseprocesses typically use MPI to exchange data at regular intervals, and theyusecheckpointing to recover quickly from failures. Parallel file systems enableinterdependent clients to store and access large volumes of shared dataconcurrently over a low-latency network.
  • Support for POSIX I/O API and for semantics: Parallel file systemslike Managed Lustre are ideal for workloads that need both the POSIX APIandsemantics. A file system's API and its semantics are independentcapabilities. For example, NFS supports the POSIX API, which is howapplications read and write data by using functions likeopen(),read(),andwrite(). But the way NFS coordinates data access between differentclients is not the same as POSIX semantics for coordinating data accessbetween different threads on a machine. For example, NFS doesn't supportPOSIX read-after-write cache consistency between clients; it relies on weakconsistency in NFSv3 andclose-to-open consistency in NFSv4.
  • Petabytes of capacity: Parallel file systems can be scaled tomultiple petabytes of capacity in a single file system namespace.NetApp Volumes supports up to 1 PB, andFilestore Regional and Zonal support up to 100 TiB perfile system. Cloud Storage offers low-cost and reliablecapacity that scales automatically, but might not meet the data-sharingsemantics and low-latency requirements of HPC workloads.
  • Low latency and high bandwidth: For HPC workloads that needhigh-speed access to very large files or to millions of small files,parallel file systems can outperform NFS and object storage. Thesub-millisecond latency that parallel file systems provide is significantlylower than object storage, which can affect the maximum IOPS. In addition,the maximum bandwidth that's supported by parallel file systems can beorders of magnitude higher than in NFS-based systems, which can saturate aVM's NIC.
  • Extreme clientscaling: NFS storage can support thousands ofclients. Parallel file systems can scale to support concurrent access toshared data from over 10,000 clientsand can provide high throughputregardless of the number of clients.

Examples of tightly coupled HPC applications

This section describes examples of tightly coupled HPC applications that needthe low-latency and high-throughput storage provided by parallel file systems.

AI-enabled molecular modeling

Pharmaceutical research is an expensive and data-intensive process. Modern drugresearch organizations rely on AI to reduce the cost of research anddevelopment, to scale operations efficiently, and to accelerate scientificresearch. For example, researchers use AI-enabled applications to simulate theinteractions between the molecules in a drug and to predict the effect ofchanges to the compounds in the drug. These applications run on powerful,parallelized GPU processors that load, organize, and analyze an extreme amountof data to complete simulations quickly. Parallel file systems provide thestorage IOPS and throughput that's necessary to maximize the performance of AIapplications.

Credit risk analysis using SAS applications

Financial services institutions like mortgage lenders and investment banks needto constantly analyze and monitor the credit-worthiness of their clients and oftheir investment portfolios. For example, large mortgage lenders collectrisk-related data about thousands of potential clients every day. Teams ofcredit analysts use analytics applications to collaboratively review differentparts of the data for each client, such as income, credit history, and spendingpatterns. The insights from this analysis help the credit analysts make accurateand timely lending recommendations.

To accelerate and scale analytics for large datasets, financial servicesinstitutions use Grid computing platforms such asSAS Grid Manager. Parallel file systems like Managed Lustre supportthe high-throughput and low-latency storage requirements of multi-threadedSAS applications.

Weather forecasting

To predict weather patterns in a given geographic region, meteorologists dividethe region into severalcells, and deploy monitoring devices such as groundradars and weather balloons in every cell. These devices observe and measureatmospheric conditions at regular intervals. The devices stream datacontinuously to a weather-prediction application running in an HPC cluster.

The weather-prediction application processes the streamed data by usingmathematical models that are based on known physical relationships between themeasured weather parameters. A separate job processes the data from each cell inthe region. As the application receives new measurements, every job iteratesthrough the latest data for its assigned cell, and exchanges output with thejobs for the other cells in the region. To predict weather patterns reliably,the application needs to store and share terabytes of data that thousands ofjobs running in parallel generate and access.

CFD for aircraft design

Computational fluid dynamics (CFD) involves the use of mathematical models,physical laws, and computational logic to simulate the behavior of a gas orliquid around a moving object. When aircraft engineers design the body of anairplane, one of the factors that they consider is aerodynamics. CFD enablesdesigners to quickly simulate the effect of design changes on aerodynamicsbefore investing time and money in building expensive prototypes. Afteranalyzing the results of each simulation run, the designers optimize attributessuch as the volume and shape of individual components of the airplane's body,and re-simulate the aerodynamics. CFD enables aircraft designers tocollaboratively simulate the effect of hundreds of such design changes quickly.

To complete design simulations efficiently, CFD applications need submillisecondaccess to shared data and the ability to store large volumes of data at speedsof up to 100 GBps.

Overview of parallel file system options

This section provides a high-level overview of the options that are available inGoogle Cloud for parallel file systems.

Google Cloud Managed Lustre

Managed Lustre is a Google-managed service that provides high-throughput and low-latencystorage for tightly coupled HPC workloads. It significantly accelerates HPCworkloads and AI training and inference by providing high-throughput,low-latency access to massive datasets. For information about usingManaged Lustre for AI and ML workloads, seeDesign storage for AI and ML workloads in Google Cloud.Managed Lustre distributes data across multiple storage nodes,which enables concurrent access by many VMs. This parallel access eliminatesbottlenecks that occur with conventional file systems and it enables workloadsto rapidly ingest and process the vast amounts of data required.

DDN Infinia

If you need advanced AI data orchestration, you can useDDN Infinia, which isavailable in Google Cloud Marketplace. Infinia provides an AI-focused dataintelligence solution that's optimized for inference, training, and real-timeanalytics. It enables ultra-fast data ingestion, metadata-rich indexing, andseamless integration with AI frameworks like TensorFlow and PyTorch.

The following are the key features of DDN Infinia:

  • High performance: Delivers sub-millisecond latency and multiple TB/sthroughput.
  • Scalability: Supports scaling from terabytes to exabytes and can accommodateup to 100,000+ GPUs and one million simultaneous clients in a single deployment.
  • Multi-tenancy with predictable quality of service (QoS): Offers secure,isolated environments for multiple tenants with predictable QoS for consistentperformance across workloads.
  • Unified data access: Enables seamless integration with existing applicationsand workflows through built-in multi-protocol support, including forAmazon S3-compatible, CSI, and Cinder.
  • Advanced security: Features built-in encryption, fault-domain-aware erasurecoding, and snapshots that help to ensure data protection and compliance.

Sycomp Intelligent Data Storage Platform

Sycomp Intelligent Data Storage Platform, which is available inGoogle Cloud Marketplace, lets you run your high performance computing (HPC),AI and ML, and big data workloads in Google Cloud. With Sycomp Storage youcan concurrently access data from thousands of VMs, reduce costs byautomatically managing tiers of storage, and run your application on-premises orin Google Cloud. Sycomp Storage can be deployed quickly and it supportsaccess to your data through NFS and the IBM Storage Scale client.

IBM Storage Scale is a parallel file system that helps to securely manage largevolumes (PBs) of data. Sycomp Storage Scale is a parallel file system that'swell suited for HPC, AI, ML, big data, and other applications that require aPOSIX-compliant shared file system. With adaptable storage capacity andperformance scaling, Sycomp Storage can support small to large HPC, AI, and MLworkloads.

After you deploy a cluster in Google Cloud, you decide how you want to useit. Choose whether you want to use the cluster only in the cloud or in hybridmode by connecting to existing on-premises IBM Storage Scale clusters,third-party NFS NAS solutions, or other object-based storage solutions.

Contributors

Author:Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-05-19 UTC.