Parallel file systems explained: Metadata, striping, and throughput

Parallel file systems can be confusing. They are needed in high-performance computing (HPC) to provide sufficiently high aggregate I/O bandwidth and low-latency access to shared data for thousands or tens of thousands of compute nodes (or millions of cores) working on the same problem simultaneously. Such file systems optimize the compute hardware use and make the processing application run faster.

A parallel file system can deliver many files in parallel, or deliver many parts of one file, stripes, in parallel.

The well-established HPC parallel file systems ship many parts of individual files faster from storage to compute by pumping out stripes of files in parallel and getting these to HPC processors so they can work in parallel and complete jobs faster.

The two variants, the striping and non-striping, are quite different in their implementation.

Non-striping parallel file systems store an incoming file on a single node. There is no striping and, hence, the metadata burden is less. The client front end operates at file-level granularity, not stripe granularity. 

Striping parallel file systems have to accept an incoming file, split it into shards, and store these on (stripe them across) different storage (data server) nodes, each with their own network link. This means it has to maintain a node-stripe storage map for the file, listing which stripes are on which node. For example, File A (stripe 1: node M, stripe 2: Node N, stripe 3: Node O, etc). That means a lot of file striping map metadata.

In order for the stripes to be delivered in parallel, the requesting client front end has to know about them. It has to know, when it receives a request for the file, that the storage backend has striped the parts or shards across several nodes, so that it can send read requests to each of these nodes. In other words, it has to have access to the file’s striping map. That requires a client front end to backend storage system link and coordination.

If large files are to be delivered, then parallel access at the stripe level will deliver the file faster. If smaller files are to be delivered then parallel access at the file level will work better.¹ 

High-level example

As a simplistic, high-level exercise, imagine 10 TB of file data to be delivered through a network pipe to a client system from a data server system. We’ll have 10 x 100 GB files on a non-striping parallel file system and a 10 TB file on a striping parallel file system with 10 nodes. The 10 TB of file data should get delivered in the same time by either system. 

Now let’s give the backend data server configuration 100 nodes, and have 10 x 10 TB files, 100 TB in total. The non-striped parallel file system stores them on 10 data server nodes, which can operate in parallel. The striped parallel file system stores, using 100 GB stripes, the 10 x 10 TB files across the 100 nodes, all of which can operate in parallel. 

A client then requests the full 100 TB of file data from each system.

The 100 TB is delivered from the non-striping parallel file system by 10 nodes, each of which ships out their 10 TB file through the network pipe. But the striping parallel file system delivers 100 x 100 GB stripes in parallel through the pipe, and accomplishes the 100 TB file data delivery in a fraction of the time.

Parallel file system products

GlusterFS, Quobyte, and Qumulo operate at file-level parallelism. BeeGFS, DDN’s Lustre, Spectrum Scale, and DAOS are striping parallel file systems. WEKA and VAST Data also have striping parallel file system functionality.

Analyst Chris Evans tells us: “In the context of [pre-pNFS PowerScale/Isilon] and, of course, NetApp legacy stuff, the performance of a single node is effectively the performance of the system. Isilon and NetApp benefit from a legacy of delivering many small file operations that can be scattered across the nodes. They have always struggled with a workload that represents a smaller number of big file operations.  

“Interestingly, this is where object storage has the advantage. If you don’t need the locking capabilities of file, object gets you massive parallel performance across multiple nodes (if the implementation is right), generally because the data is widely spread across a scale-out design.”

In his view: “I think the addition of pNFS and the new parallel architectures are a defensive play by the legacy vendors to address a new workload mix that works better with object storage.”

Regarding pNFS (parallel NFS), this can be either a striping parallel file system, using the FlexFile Layout, or a non-striping one without it. The FlexFile Layout is one key, and front-end client software, with access to file striping maps, is the other.

FlexFile layout types

Parallel NFS (pNFS) is defined in NFS v4.1 and extended in NFS v4.2. It separates a control path (metadata) from the data path. A FlexFiles layout enables striping-level parallelism.

A Hammerspace pNFStech brief states: “To use this [pNFS] architecture, a compatible NFS client that needs to access data first contacts the metadata server. The metadata server provides a layout, which includes byte-range and storage location information. While a client holds the layout, it may directly access the corresponding storage locations. The layout is an abstraction that saves the client from having to understand the details of the underlying storage.”

The compatible NFS client is a standard v4.2 client.

Hammerspace diagram showing pNFS with the FlexFiles layout

In effect, the metadata server (MDS) tells the client: “Here is a layout that describes exactly where the bytes of this file live and how to access them directly.” The client then reads/writes data directly to the storage nodes, bypassing the MDS for data traffic.

There are four official layout types in the pNFS standard, as defined inRFC 8435 and part of the NFS v4.2 specification. Each layout type describes a different way of mapping file bytes to storage servers: 

Project Lightning: Extreme parallelism

Dell’s Project Lightning is taking PowerScale parallelism a step further by adding a client-side drive optimized for serving data to thousands of GPUs simultaneously in an AI Factory-type configuration. It is an extension of PowerScale.

This part of Project Lightning requires, as a Dell spokesperson told us, “a proprietary client-side driver for representing the entirety of the file system as ‘local’ to the client. This requires more resource consumption on the client, and management of said proprietary driver. The benefit means extreme direct-to-drive performance to fully saturate the network–even with fully random reads. Lightning is a ground-up Parallel File System with a targeted application to satisfy the very highest performance requirements where the infrastructure team has the ability to optimize the full stack.”

Dell is opting for a proprietary client-side driver to “represent the entirety of the file system as ‘local’ to the client.” This isn’t a strict requirement of pNFS itself (which can use open-source, NFS v4.2-compliant, Linux kernel modules), but is a deliberate architectural choice by Dell.

This is because, for AI workloads such as training large models on Nvidia DGX SuperPODs, the standard client still has overhead. It must interpret FlexFile layouts, manage multiple connections – e.g. via nconnect for multi-pathing – and traverse protocol layers. This can limit throughput to ~400 Gbps per client and increase latency for random I/O. When you have thousands of GPUs, such overheads mount up.

As we understand it, Dell’s forthcoming driver will enable direct client-to-device access, skipping traditional file system traversal on both client and server sides. It provides access to the data stored on drives managed by Project Lightning. We understand this is somewhat similar to how Lustre or GPFS clients fuse a cluster of nodes into a single virtual namespace. This aggregates I/O from all cluster nodes through one NFS mount, using RDMA for near-line-rate efficiency, to offer 97 percent network saturation.

It will include custom Dell multipath optimizations, building on Linuxnconnect, to dynamically balance loads and handle RDMA striping. It requires installation as a kernel module, and management to enforce these, consuming more client CPU and memory for layout caching and failover.

The benefit is consistent high-speed file access, with no central metadata bottlenecks, enabling massive I/O concurrency – between 500 and 900 GB/s per client in test runs. This is crucial for Dell’s Ethernet-based AI factories.

It will enable Dell to position the Lightning-extended PowerScale as “the world’s fastest parallel file system” for AI, and avoids the deployment complexities of full Lustre-like clients while closing the gap with rivals like WEKA or VAST.

As of November 2025, it’s in private preview for select customers/partners (e.g. Cambridge University, WWT), with general availability targeted for late 2025/early 2026.

Footnotes

  1. Dell comments that parallel file systems typically make small file=stripe so it can deliver both small files parallelly and large files in parallel faster. They have more metadata management overhead aka distributed locking overhead.
  2. Dell says OneFS has always striped files across nodes, and hence has always been in the Flex Files Layout category (last row of the table) in the sense of the data layout. The part that has been missing from OneFS (prior to adding support for pNFS) is the support for providing clients a layout that allows them to take advantage of this in the standard NFS protocol client-server interaction.