Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork366
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
I'm looking for ways to improve the performance of my dataloading pipeline and I found Zarr. To get an idea about throughput, I started a small benchmark script in python. To get a baseline I also run tests using numpy memory mapped arrays. I'm working with 4D arrays which are quite large. One of my criterias is that I need to access them as a key-value store. From each value, I access randomly on the first axis. I created some dummy arrays to test throughput. Here is my complete benchmarking code that compares Zarr to accessing raw Numpy arrays on disk: importnumpyasnpfrompathlibimportPathimporttimefromtqdmimporttqdmimportosimportrandomimportzarrdefwrite_numpy_dataset(data_path,raw_data):Path(data_path).mkdir(parents=True,exist_ok=True)forsample_name,sample_dataintqdm(raw_data.items()):mmap=np.memmap(f"{data_path}/{sample_name}.npy",dtype=sample_data.dtype,mode='w+',shape=sample_data.shape )mmap[:]=sample_data[:]mmap.flush()defwrite_zarr_dataset(data_path,raw_data):zarr_store=zarr.DirectoryStore(f'{data_path}/data.zarr')zarr_group=zarr.group(zarr_store,overwrite=True)fordata_key,sampleintqdm(raw_data.items()):zarr_group.create_dataset(data_key,data=sample,chunks=[28,28,28],# I also tried sample.shape[1:] and sample.shape hereoverwrite=True,compressor=None# remove compression because I want as much throughput as possible )defload_samples_numpy(data_path):numpy_files= [f"{data_path}/{f}"forfinos.listdir(data_path)]whileTrue:fornumpy_fileinnumpy_files:index=random.randint(0,499)# open a new memory map for every access. This way I will never cache samples in RAMvolume=np.memmap(numpy_file,mode='r',dtype=np.float32 ).reshape([500,-1,28,28,28])yieldvolume[index]defload_samples_zarr(data_path):zarr_store=zarr.DirectoryStore(f"{data_path}/data.zarr")zarr_group=zarr.open_group(zarr_store,mode="r")data_keys= [sforsinzarr_group]whileTrue:fordata_keyindata_keys:index=random.randint(0,499)yieldzarr_group[data_key][index]defbenchmark(dataset_path,load_fn):total_size=0start_time=time.time()print("\nStarting benchmark")print(f"reading from:{dataset_path}")sample_sums= []fori,sampleintqdm(enumerate(load_fn(dataset_path))):ifi==2000:breaktotal_size+=sample.nbytessample_sums.append(sample.sum())# do some dummy calculation to force the data into RAMend_time=time.time()throughput= (total_size/ (end_time-start_time))/ (1024*1024)print(f"Throughput:{throughput:.2f} MB/s")if__name__=='__main__':data_path_numpy="/mnt/cache/dataset_numpy"data_path_zarr="/mnt/cache/dataset_zarr"# the second axis of my data samples is not fixedraw_data= {'sample0':np.random.normal(size=(500,14,28,28,28)).astype(np.float32),'sample1':np.random.normal(size=(500,16,28,28,28)).astype(np.float32),'sample2':np.random.normal(size=(500,12,28,28,28)).astype(np.float32),'sample3':np.random.normal(size=(500,11,28,28,28)).astype(np.float32),'sample4':np.random.normal(size=(500,13,28,28,28)).astype(np.float32),'sample5':np.random.normal(size=(500,19,28,28,28)).astype(np.float32),'sample6':np.random.normal(size=(500,20,28,28,28)).astype(np.float32),'sample7':np.random.normal(size=(500,15,28,28,28)).astype(np.float32),'sample8':np.random.normal(size=(500,18,28,28,28)).astype(np.float32),'sample9':np.random.normal(size=(500,17,28,28,28)).astype(np.float32), }write_numpy_dataset(data_path_numpy,raw_data)benchmark(data_path_numpy,load_samples_numpy)write_zarr_dataset(data_path_zarr,raw_data)benchmark(data_path_zarr,load_samples_zarr) It turns out that accessing Numpy arrays outperforms Zarr by a factor of ~6-7 This difference is so big that I feel like I must be missing something. I tried different chunk sizes and disabled compression completely. Still, Zarr never reaches a throughput that comes even close to Numpy's memory mapped arrays. Does anyone have a hint on what I'm missing? Is Zarr generally slow for my usecase of accessing large 4D arrays? |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 5 comments 10 replies
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
It's important to distinguish between zarr the format and As for We are trying to make indexing faster in |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Thank you for this quick response. Indeed, performance is my main criteria here. Previously I've used the TFRecords format which worked great for me so far but only allows sequential read access to my data instead of (random) index based access, which would be really beneficial for my use case. My datasets exceed 50TB nowadays and since I got some new hardware to play with, data loading became my main bottleneck. I'm currently benchmarking on a Linux system with a SSD that gives me 500MB/s read access, but later I want to read data from a NFS that has much higher throughput but also adds latency when opening filehandles. This is why I don't think I can afford storing each sample as a single file on disk, so I group them into chunks of 500 samples that have common shapes, as you can see in the code example that I gave above. Still, I think I'm loading far too many filehandles when storing my data like this. I thought about just grouping even more samples into single files and access those with memory mapping. Can you think of any better approach than this? My main criterias are:
Other stuff like prefetching and batching is an afterthought after I figured out how to max out my throughput on a single process. I'm willing to redo my whole data pipeline if necessary and I'm happy for any hints. So far I've tried TFRecords, HDF5, DALI, memory mapped Numpy arrays and Zarr. Thanks again for the response |
BetaWas this translation helpful?Give feedback.
All reactions
-
I think thetensorstore tutorial is a good place to start. You would probably not want to use the format in that example (n5); instead, you can use tensorstore to read and write zarr v2 and v3 arrays (to write zarr groups you can just use |
BetaWas this translation helpful?Give feedback.
All reactions
-
Do you also have by chance any guidance on how to set up a Linux system to improve reading speed through caching? I'm asking because my current benchmarks seems to improve if they run multiple times and it seems like I'm running into some filesystem caching mechanism on Linux even if my files exceed my available RAM. Running some new benchmarks on my system multiple times shows this: $ python benchmark.pyBytes read: 20580.00 MBThroughput: 530.77 MB/s$ python benchmark.pyBytes read: 20580.00 MBThroughput: 775.01 MB/s$ python benchmark.pyBytes read: 20580.00 MBThroughput: 996.57 MB/s I've been reading this StackExchange post: but I'm not sure if this applies to my NFS with filesystem type nfs4. Filehandle caching could also be helpful to me Anyway, thanks a lot for the help! We can also close this discussion since it's moving away from zarr related questions. |
BetaWas this translation helpful?Give feedback.
All reactions
-
unfortunately I know basically nothing about tuning local file systems for throughput -- most of the data I work with is on cloud storage, where we have different challenges :) That being said, if you come up with a good story for how to get the most out of zarr on local disk(s), I'd be interested in reading about it (and it could potentially end up in our docs) |
BetaWas this translation helpful?Give feedback.
All reactions
-
So what turned out to work really well for me is saving my data as .HDF5 with LZF compression and an appropriate chunk size. I group a ton of samples into single files to avoid file handle overhead. This way I can read 13.4Gbit/s of data on my test setup, even though my bandwidth would only allow for 10Gbit/s. With Zarr I can't get to similar performance and it seems to me this is only because of the format that Zarr stores on disk. HDF5 puts everything into single files. Zarr stores chunks as separate files instead. Maybe this is too specific to the setup that I'm using but I think there is a usecase for a storage option that minimizes the amount of files on disk. |
BetaWas this translation helpful?Give feedback.
All reactions
-
zarr v3 introduces a compression strategy (sharding) that can pack multiple "read chunks" inside a larger file, called a "shard", to address the problem of too many files. The upcoming |
BetaWas this translation helpful?Give feedback.
All reactions
-
Sooo 5 months later and I'm still here 😓 Since you mentioned you are using cloud storages for your zarr datasets I wondered which kind of cloud technology you use? Is it as simple as some S3 buckets or are you using anything that is tuned for latency and throughput? Is there anything you can recommend for the purpose of reading data from a network as fast as possible? |
BetaWas this translation helpful?Give feedback.
All reactions
-
i tend to use s3, and reading from s3 or any other network-bound storage efficiently typically requires making a lot of concurrent requests |
BetaWas this translation helpful?Give feedback.
All reactions
-
I see. Do you host your own S3 or do you use some publically available cloud? What kind of throughput can you expect if you store huge datasets and access small chunks like 0.5MB per sample? I have a storage system attached that could theoretically run S3. To feed my GPUs I'd need to reach something like 4GB/s while accessing only small chunks (like 500KB per sample, but a batch size of 1024). This means I would've to reach a throughput of 8.000 samples per second. The complete dataset would be above 20TB if uncompressed. I guess I would have to run thousands of requests in parallel? Just trying to avoid another dead-end here because I've wasted to much time on other approaches. |
BetaWas this translation helpful?Give feedback.
All reactions
-
If you already have a fast hard drive on your system, it's very unlikely that you would ever get better performance by putting an additional S3 layer on top of this. That would only introduce overhead. We should return to your original problem and understand why your throughput is not as expected. |
BetaWas this translation helpful?Give feedback.
All reactions
-
I have many of those datasets and I already mount my storage as NFS because I don't have local hard drives on my system that have the required size. Basically I'm killed by random access. I can read my dataset sequentially and I'll reach my desired throughput, but it starts to break down once I do random access into those datasets. HDF5 starts to become really slow with random indexing. I also tried saving those datasets as raw numpy arrays on disk and memory map them, but even if I access them with many processes / threads at once I can't reach more than 1GB/s with random access. I can give more details on my previous approaches if you think there's still something to be gained. I'm totally open to the idea that I'm overlooking something obvious |
BetaWas this translation helpful?Give feedback.
All reactions
-
I just tried running the original example on my macbook using Zarr 3.0.5. I made a few changes
In this example, Zarr was about 4x slower than memmapped numpy. Memmapping is very efficient, so I'm not super surprised by this. full codeInterestingly, when I ran the exact same code from a jupyter notebook in VScode, the timing was almost identical! I wonder if there is some clue here. |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Alright, digging deeper again. The folder I'm using is located on a NFS that is mounted to my Linux system and is connected with 100Gbit/s Ethernet defload_samples_numpy(data_path,dataset_len):whileTrue:index=random.randint(0,dataset_len-1)volume=np.memmap(data_path,mode='r',dtype=np.float32,shape=(dataset_len,10,28,28,28) )yieldvolume[index]defbenchmark(dataset_path,dataset_len):total_size=0start_time=time.time()print("\nStarting benchmark")print(f"reading from:{dataset_path}")sample_sums= []loader=load_samples_numpy(dataset_path,dataset_len)fori,sampleintqdm(enumerate(loader)):ifi==5000:breaktotal_size+=sample.nbytessample_sums.append(sample.sum())end_time=time.time()throughput= (total_size/ (end_time-start_time))/ (1024*1024)print(f"Throughput:{throughput:.2f} MB/s") dataset_path_numpy='...'dataset_len=100benchmark(data_path_numpy,dataset_len) 5000it [00:02, 2106.75it/s]Throughput: 1758.60 MB/s looks good so far, however, this dataset is much bigger than 100 samples dataset_path_numpy='...'dataset_len=1800000benchmark(data_path_numpy,dataset_len) 5000it [00:31, 157.78it/s]Throughput: 132.09 MB/s seems like using the actual filesize is killing my performance. dataset_path_numpy='...'dataset_len=10000benchmark(data_path_numpy,dataset_len) 5000it [00:10, 470.81it/s]Throughput: 393.96 MB/s Still quite slow. From playing around, it seems like I should split my files into something like ~500 samples per file. Otherwise I lose too much throughput. However, I'm actually using float16 instead of float32, which is killing my performance even more (less bytes per read). Also, this is only a single dataset. I will need to use at least 15 of those and I need to sample from them with a predefined probability. I'll reach 20000 files easily if I only save 500 samples per file |
BetaWas this translation helpful?Give feedback.
All reactions
-
I dug into this example a bit more with the goal of understanding the impact of memmap. I wrote a very naive flat binary storage layer which uses regular filesystem calls codeimportnumpyasnpfrompathlibimportPathimporttimefromtqdmimporttqdmimportosimportrandomimportzarrimporttempfileimportzarr.codecsdefwrite_numpy_dataset(data_path,raw_data):Path(data_path).mkdir(parents=True,exist_ok=True)forsample_name,sample_dataintqdm(raw_data.items()):mmap=np.memmap(f"{data_path}/{sample_name}.npy",dtype=sample_data.dtype,mode='w+',shape=sample_data.shape )mmap[:]=sample_data[:]mmap.flush()defwrite_raw_dataset(data_path,raw_data):forsample_name,sample_dataintqdm(raw_data.items()):Path(f"{data_path}/{sample_name}").mkdir(parents=True,exist_ok=True)forninrange(sample_data.shape[0]):raw=sample_data[n].tobytes()withopen(f"{data_path}/{sample_name}/{n:02d}.raw",'wb')asf:f.write(raw)defwrite_zarr_dataset(data_path,raw_data):zarr_store=f'{data_path}/data.zarr'zarr_group=zarr.group(zarr_store,zarr_format=3,overwrite=True)fordata_key,sampleintqdm(raw_data.items()):a=zarr_group.create_array(data_key,dtype=sample.dtype,shape=sample.shape,chunks=(1,)+tuple(sample.shape[1:]),compressors=None, )a[:]=sampledefload_samples_numpy(data_path):numpy_files= [f"{data_path}/{f}"forfinos.listdir(data_path)]whileTrue:fornumpy_fileinnumpy_files:index=random.randint(0,99)# open a new memory map for every access. This way I will never cache samples in RAMvolume=np.memmap(numpy_file,mode='r',dtype=np.float32 ).reshape([100,-1,28,28,28])yieldvolume[index]defload_samples_raw(data_path):sample_names=os.listdir(data_path)whileTrue:forsample_nameinsample_names:index=random.randint(0,99)fname=f"{data_path}/{sample_name}/{index:02d}.raw"withopen(fname,'rb')asf:data=np.frombuffer(f.read(),dtype=np.float32)data=data.reshape([-1,28,28,28])yielddatadefload_samples_zarr(data_path):zarr_store=f"{data_path}/data.zarr"zarr_group=zarr.open_group(zarr_store,mode="r")data_keys= [sforsinzarr_group]whileTrue:fordata_keyindata_keys:index=random.randint(0,99)yieldzarr_group[data_key][index]defbenchmark(dataset_path,load_fn):total_size=0start_time=time.time()print("\nStarting benchmark")print(f"reading from:{dataset_path}")sample_sums= []fori,sampleintqdm(enumerate(load_fn(dataset_path))):ifi==2000:breaktotal_size+=sample.nbytes#assert sample.shape == (1, 1, 28, 28, 28), f"Unexpected shape: {sample.shape}"sample_sums.append(sample.sum())# do some dummy calculation to force the data into RAMend_time=time.time()throughput= (total_size/ (end_time-start_time))/ (1024*1024)print(f"Throughput:{throughput:.2f} MB/s")defmain():data_path_numpy=tempfile.TemporaryDirectory()data_path_raw=tempfile.TemporaryDirectory()data_path_zarr=tempfile.TemporaryDirectory()raw_data= {'sample0':np.random.normal(size=(100,14,28,28,28)).astype(np.float32),'sample1':np.random.normal(size=(100,16,28,28,28)).astype(np.float32),'sample2':np.random.normal(size=(100,12,28,28,28)).astype(np.float32),'sample3':np.random.normal(size=(100,11,28,28,28)).astype(np.float32),'sample4':np.random.normal(size=(100,13,28,28,28)).astype(np.float32),'sample5':np.random.normal(size=(100,19,28,28,28)).astype(np.float32),'sample6':np.random.normal(size=(100,20,28,28,28)).astype(np.float32),'sample7':np.random.normal(size=(100,15,28,28,28)).astype(np.float32),'sample8':np.random.normal(size=(100,18,28,28,28)).astype(np.float32),'sample9':np.random.normal(size=(100,17,28,28,28)).astype(np.float32), }write_numpy_dataset(data_path_numpy.name,raw_data)write_raw_dataset(data_path_raw.name,raw_data)write_zarr_dataset(data_path_zarr.name,raw_data)benchmark(data_path_numpy.name,load_samples_numpy)benchmark(data_path_raw.name,load_samples_raw)benchmark(data_path_zarr.name,load_samples_zarr)if__name__=="__main__":main() I feel that Zarr should be able to at least match the "Numpy raw" performance. I wonder if this is related to the unnecessary memory copies issue described in#2904. |
BetaWas this translation helpful?Give feedback.