Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

.zarr.zip on s3#1613

jeffpeck10x started this conversation inGeneral
Dec 18, 2023· 7 comments· 6 replies
Discussion options

I searched and could not find an example of accessing a.zarr.zip from an s3 endpoint without having to first download it entirely. The providedzarr.storage.ZipStore only works on a local path (right?).

I experimented and found that this works:

imports3fsimportzarrfromfsspecimportFSMapfromfsspec.implementations.zipimportZipFileSystemdefget_s3_zarr_zip(s3_path):s3=s3fs.S3FileSystem()f=s3.open(s3_path)fs=ZipFileSystem(f,mode="r")store=FSMap("",fs,check=False)# optionally use lru cache or otherwise return zarr.group(store=store)cache=zarr.storage.LRUStoreCache(store,max_size=2**28)returnzarr.group(store=cache)

I am wondering if this is alright. Is there anything that could be improved with this approach?

My use-case is read-only. I understand that this approach would not be able to handle updates without updating the entire.zarr.zip.

And more generally, if someone else is searching for a solution like this, I hope this helps!

You must be logged in to vote

Replies: 7 comments 6 replies

Comment options

This is great@jeffpeck10x! Would you be willing to add this to the tutorial?

I wonder whether the built-inZipStore has any use for reading from S3. Or is fsspec's ZipFileSystem always required here?

You must be logged in to vote
2 replies
@jeffpeck10x
Comment options

I would be very happy to contribute this@rabernat . Please point me to where it would best fit in the tutorial and I will submit a PR.

I think what is novel here is that theZipFileSystem will handle fetching parts of the zip file without pulling down the whole thing. Ands3fs will translate those fetches into ranged requests.

My understanding is that the caveat to this formally being part ofZipStore is that itmust be read-only. (Actually, I am not sure what would happen if I tried to passmode="w" and tried to make an update...)

An alternative might be to update the FSStore to look for the.zip extension? 🤷

Either way, that this works with only a few lines of code perhaps means that it does not require its own storage class, as long as it is documented.

@rabernat
Comment options

Please point me to where it would best fit in the tutorial and I will submit a PR.

https://github.com/zarr-developers/zarr-python/blob/main/docs/tutorial.rst

There is already a section addressing Zip stores here -https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives - but only on file storage, not object storage.

Zarr's built-inZipStore class does support writing mode. However, that would never work on object storage, because object storage doesn't support partial writes to a single object. Instead the workflow for creating a zip store in the cloud seems to be:

  • Create the data on a local disk
  • Zip it up
  • Upload to object storage
  • Access using the method you described above

It would be nice to address this full life cycle in the documentation.

Comment options

An additional option would be to usefsspec's handler chaining:

        store = zarr.storage.FSStore(                url=f"zip::file://{filepath}",  mode="r"        )

With thanks to@caviere (seehttps://github.com/zarr-developers/outreachy_2022_testing_zipstore/blob/main/real%20%20world%20data/main.py#L54)

Regardless, 👍 for your version and/or this one being in the tutorial location that would have helped you find them! Thanks.

You must be logged in to vote
0 replies
Comment options

made a small pr to update docs:#1615

You must be logged in to vote
0 replies
Comment options

With Zarr v3, I used the following code to create a read-only Zarr store from a zipped Zarr in S3:

importzarrimports3fsclassS3ZipStore(zarr.storage.ZipStore):def__init__(self,path:s3fs.S3File)->None:super().__init__(path="",mode="r")self.path=paths3=s3fs.S3FileSystem(anon=True,endpoint_url=S3_ENDPOINT,asynchronous=False)file=s3.open(f"s3://{S3_BUCKET}/{ZIP_PATH}")zarr_store=S3ZipStore(file)

It relies on that ZipStore only checks thatpath is a string or a Path in__init__ and stores it inself.path for use later withzipfile.ZipFile() which also accepts the file-like s3fs.S3File. I'm a bit worried aboutasynchronous=False perhaps negatively affecting performance? With a non-zipped Zarr in S3 I can useasynchronous=True.

Edit: Aload time benchmark shows that ZipStore created from a local zip file suffers a similar performance hit compared to LocalStore:

---config:    xyChart:        width: 900        height: 300    themeVariables:        xyChart:            backgroundColor: "#000"            titleColor: "#fff"            xAxisLabelColor: "#fff"            xAxisTitleColor: "#fff"            xAxisTickColor: "#fff"            xAxisLineColor: "#fff"            yAxisLabelColor: "#fff"            yAxisTitleColor: "#fff"            yAxisTickColor: "#fff"            yAxisLineColor: "#fff"            plotColorPalette: "#fff8, #000"---xychart-beta    title "Random Sentinel 2 patch time series load time benchmark (5100 m x 5100 m, 1 year)"    x-axis ["S3 Zarr", "S3 zipped Zarr", "NVMe Zarr", "NVMe zipped Zarr"]    y-axis "Mean load time (s)" 0 --> 26    bar [7.53, 23.9, 1.12, 3.21]    bar [0, 0, 0, 0, 0, 0, 0, 0, 0]
Loading
You must be logged in to vote
2 replies
@joshmoore
Comment options

cc:@jwindhager

@remithiebaut
Comment options

Thank you for this solution I had exactly the same issue.
I had some trouble initializing zip store with a dummy path, it worked for me to rewrite the entire init without the assertion :

import zipfilefrom zarr.storage._zip import ZipStoreAccessModeLiteral, ZipStorefrom zarr.abc.store import Storeclass S3ZipStore(ZipStore):    """    Custom ZipStore that accepts an s3fs.S3File object for zipfile-based zarr storage.    """    def __init__(        self,        path: Path | str | S3File,        *,        mode: ZipStoreAccessModeLiteral = "r",        read_only: bool | None = None,        compression: int = zipfile.ZIP_STORED,        allowZip64: bool = True,    ) -> None:        if read_only is None:            read_only = mode == "r"        Store.__init__(self, read_only=read_only)        if isinstance(path, str):            path = Path(path)        self.path = path  # root?        self._zmode = mode        self.compression = compression        self.allowZip64 = allowZip64
Comment options

Here's a rudimentary async fsspec read-only file system for uncompressed ("store" compression) zip files that minimizes the number of reads (Initially 2 reads, then 1 read per file). With this async file system, reading from a zipped Zarr v3 in S3 or local storage is no longer slower than reading from a Zarr. I think it could be made even faster by combining reads for files that are consecutive in the zip, but the FsspecStore doesn't receive any_get_many() calls, probably because ofthis issue.

The code isalso in our benchmark repo.

---config:    xyChart:        width: 900        height: 300    themeVariables:        xyChart:            backgroundColor: "#000"            titleColor: "#fff"            xAxisLabelColor: "#fff"            xAxisTitleColor: "#fff"            xAxisTickColor: "#fff"            xAxisLineColor: "#fff"            yAxisLabelColor: "#fff"            yAxisTitleColor: "#fff"            yAxisTickColor: "#fff"            yAxisLineColor: "#fff"            plotColorPalette: "#fff8, #000"---xychart-beta    title "Random Sentinel 2 patch time series load time benchmark (5100 m x 5100 m, 1 year)"    x-axis ["S3 Zarr", "S3 zipped Zarr (async)", "NVMe Zarr", "NVMe zipped Zarr (async)"]    y-axis "Mean load time (s)" 0 --> 8    bar [7.716204173564911, 7.546070046424866, 1.1104609894752502, 1.3312029433250427]    bar [0, 0, 0, 0]
Loading

Usage:

# Async S3 zipped Zarrimports3fss3=s3fs.S3FileSystem(anon=True,endpoint_url=S3_ENDPOINT,asynchronous=True)zipfs=ReadOnlyZipFileSystem(s3,f"{S3_BUCKET}/{ZIP_PATH}")zarr_store=zarr.storage.FsspecStore(fs=zipfs,read_only=True,path="")# Async local zipped Zarrfromfsspec.implementations.localimportLocalFileSystemfromfsspec.implementations.asyn_wrapperimportAsyncFileSystemWrapperlocal_fs=LocalFileSystem()async_local_fs=AsyncFileSystemWrapper(local_fs)zipfs=ReadOnlyZipFileSystem(async_local_fs,ZIP_PATH)zarr_store=zarr.storage.FsspecStore(fs=zipfs,read_only=True,path="")
Code
fromfsspec.asynimportAsyncFileSystemimportasyncioimportposixpathfromtypingimportList,OptionalimportstructclassReadOnlyZipFileSystem(AsyncFileSystem):"""An async read-only file system for uncompressed zip files using fsspec.    Mounts an uncompressed zip file as a read-only file system. Supports ZIP64.    Only supports ZIP_STORED (uncompressed). Multi-disk zip files are not    supported. Written for reading a zipped Zarr from S3 storage and therefore    only includes methods used by Zarr.    Reads 64KB from the end of the zip file to capture the ZIP64 end of central    directory (EOCD), the ZIP64 EOCD locator, and the standard EOCD, using    _cat_file with negative start offset and therefore not needing to query the    file size first. Reads the CD using _cat_file with positive start offset.    Parses the CD for the file names and the header offsets. Requires that the    CD contains an entry of each directory (except for the autogenerated root    dir) before the entries of their files and subdirectories. Requires that    the CD entries appear in the same order as the file headers and the files.    Assumes that there are no gaps between file headers and files. This way the    file headers need not be read, because the file offset will be the offset    of the next file header (or CD for the last file) minus the file size. File    datetimes are not available.    While not currently supported, support for compressed zip files could be    implemented by reading the file headers (ending the read at the next header    or CD) and by decompression.    """protocol="zipfs"MAX_ZIP_TAIL_READ=64*1024def__init__(self,fs:AsyncFileSystem,path:str,**kwargs):"""Initialize the ReadOnlyZipFileSystem.        Args:            fs: The underlying AsyncFileSystem containing the zip file.            path: Path to the zip file in the underlying file system.            **kwargs: Additional arguments passed to AsyncFileSystem.        """super().__init__(**kwargs)self.asynchronous=Trueself.fs=fsself.path=pathself._files=Noneself._lock=asyncio.Lock()asyncdef_initialize(self):"""Initialize self._files by reading and parsing the central directory.        All other methods that require self._files first await this method.        They never modify self._files. Locking is used to ensure that the        initialization is thread-safe. The other methods need not lock        explicitly.        """asyncwithself._lock:ifself._filesisnotNone:return# Read tail of file (up to MAX_ZIP_TAIL_READ) from the enddata=awaitself.fs._cat_file(self.path,start=-self.MAX_ZIP_TAIL_READ,end=None)iflen(data)<22:# Minimum size for standard EOCDraiseValueError(f"EOCD doesn't fit in{self.path}:{len(data)} bytes")# EOCD variables and their lengths (order matters)# These dicts use exact wording from https://pkwaredownloads.blob.core.windows.net/pkware-general/Documentation/APPNOTE-6.3.9.TXTeocd_var_lengths= {'end of central dir signature':4,# Signature (0x06054b50)'number of this disk':2,# Disk number'number of the disk with the start of the central directory':2,# Disk number of the central directory'total number of entries in the central directory on this disk':2,# Number of entries on this disk'total number of entries in the central directory':2,# Total number of entries'size of the central directory offset of start of central directory with respect to':4,# Size of the central directory'the starting disk number':4,# Offset of the start of the central directory'.ZIP file comment length':2# Length of the comment            }# ZIP64 EOCD variables and their lengths (order matters)zip64_eocd_var_lengths= {'zip64 end of central dir signature':4,'size of zip64 end of central directory record':8,'version made by':2,'version needed to extract':2,'number of this disk':4,'number of the disk with the start of the central directory':4,'total number of entries in the central directory on this disk':8,'total number of entries in the central directory':8,'size of the central directory':8,'offset of start of central directory with respect to the starting disk number':8            }# Map lengths to struct formatsvar_length_to_format= {2:'H',# Unsigned short (2 bytes)4:'L',# Unsigned long (4 bytes)8:'Q'# Unsigned long long (8 bytes)            }# Parse EOCDeocd= {}is_zip64=Falseeocd_pos=data[:-20+4].rfind(b'\x50\x4b\x05\x06')# 20 bytes for EOCD, 4 bytes for EOCD signatureifeocd_pos==-1:raiseValueError(f"No EOCD in the last{self.MAX_ZIP_TAIL_READ} bytes of{self.path}")pos=eocd_posforvar,lengthineocd_var_lengths.items():eocd[var]=struct.unpack_from(f'<{var_length_to_format[length]}',data,pos)[0]# Check for ZIP64 values 0xFF or 0xFFFFFFFFifeocd[var]==2**(length*8)-1:eocd[var]=Noneis_zip64=Truepos+=lengthifis_zip64:iflen(data)-22<56+20:# 56 bytes for ZIP64 EOCD + 20 bytes for locatorraiseValueError(f"ZIP64 EOCD and ZIP64 EOCD locator do not fit in{self.path}")# Find ZIP64 EOCDzip64_eocd_pos=data[:eocd_pos-56+4].rfind(b'\x50\x4b\x06\x06')# 20 bytes for EOCD, 56 bytes for ZIP64 EOCD, 4 bytes for ZIP64 EOCD signatureifzip64_eocd_pos==-1:raiseValueError(f"No ZIP64 EOCD in the last{self.MAX_ZIP_TAIL_READ} bytes of{self.path}")pos=zip64_eocd_posforvar,lengthinzip64_eocd_var_lengths.items():eocd[var]=struct.unpack_from(f'<{var_length_to_format[length]}',data,pos)[0]pos+=length# Require single-disk zipifeocd['number of this disk']!=0oreocd['number of the disk with the start of the central directory']!=0oreocd['total number of entries in the central directory on this disk']!=eocd['total number of entries in the central directory']:raiseValueError(f"Unsupported multi-disk central directory in{self.path}")# Convenience variablescd_size=eocd['size of the central directory']cd_offset=eocd['offset of start of central directory with respect to the starting disk number']cd_entries=eocd['total number of entries in the central directory']# Read and parse central directoryifcd_size==0:# No central directory, empty zip filereturncd_data=awaitself.fs._cat_file(self.path,start=cd_offset,end=cd_offset+cd_size)# Save data to fileiflen(cd_data)!=cd_size:raiseValueError(f"Failed to read central directory: expected{cd_size} bytes, got{len(cd_data)}")# Central directory file header variables and their lengthscd_file_header_var_lengths= {'central file header signature':4,'version made by':2,'version needed to extract':2,'general purpose bit flag':2,'compression method':2,'last mod file time':2,'last mod file date':2,'crc-32':4,'compressed size':4,'uncompressed size':4,'file name length':2,'extra field length':2,'file comment length':2,'disk number start':2,'internal file attributes':2,'external file attributes':4,'relative offset of local header':4            }# Central or local file header ZIP64 extended information extra field variables and their lengths# These share names with the standard fields but have different lengthszip64_file_header_var_lengths= {'uncompressed size':8,'compressed size':8,'relative offset of local header':8,'disk number start':4            }# Autocreate root dirself._files= {'': {'children': []                }            }# Parse central directory entriespos=0previous_file=Noneforfile_indexinrange(cd_entries):ifpos+46>len(cd_data):# 46 bytes for the central directory file headerraiseValueError(f"Truncated central directory entry in{self.path}")cd_file_header= {}forvar,lengthincd_file_header_var_lengths.items():cd_file_header[var]=struct.unpack_from(f'<{var_length_to_format[length]}',cd_data,pos)[0]# Check for ZIP64 values 0xFF or 0xFFFFFFFFifcd_file_header[var]==2**(length*8)-1:cd_file_header[var]=Nonepos+=lengthifcd_file_header['central file header signature']!=0x02014b50:raiseValueError(f"Invalid central directory header signature in{self.path}")ifcd_file_header['compression method']!=0orcd_file_header['compressed size']!=cd_file_header['uncompressed size']:raiseValueError(f"File in{self.path} is not stored (uncompressed)")utf8=cd_file_header['general purpose bit flag']&0x800!=0# Bit 11# Convenience variablesfname_len=cd_file_header['file name length']extra_len=cd_file_header['extra field length']comment_len=cd_file_header['file comment length']# Read filenameifpos+fname_len>len(cd_data):raiseValueError(f"Truncated filename in{self.path}")ifutf8:fname=cd_data[pos:pos+fname_len].decode('utf-8')else:fname=cd_data[pos:pos+fname_len].decode('ascii')pos+=fname_len# Parse extra fieldextra_end=pos+extra_lenifextra_end>len(cd_data):raiseValueError(f"Truncated extra field in{self.path}")whilepos<extra_end:ifpos+4>extra_end:raiseValueError(f"Truncated extra field in{self.path}")tag,size=struct.unpack_from('<HH',cd_data,pos)pos+=4ifpos+size>extra_end:raiseValueError(f"Truncated extra field in{self.path}")backup_pos=posiftag==0x0001:# ZIP64 extended information extra fieldforvar,lengthinzip64_file_header_var_lengths.items():# Only parse variables that were marked as ZIP64ifcd_file_header[var]isNone:cd_file_header[var]=struct.unpack_from(f'<{var_length_to_format[length]}',cd_data,pos)[0]pos+=lengthelse:iftag==0x5455:# Extended Timestamp, not handledNoneeliftag==0x7875:# Info-ZIP New Unix Extra Field, skip itNoneelse:# Unknown extra field, skip itNonepos+=sizeifbackup_pos+size!=pos:raiseValueError(f"Invalid extra field size in{self.path}")pos=backup_pos+size# Convenience variablessize=cd_file_header['compressed size']offset=cd_file_header['relative offset of local header']# Skip commentpos+=comment_len# Detect directoryis_dir=fname.endswith('/')ifis_dir:# Remove trailing slashfname=fname[:-1]# Store file infoifis_dir:# Dirs can be recognized by key 'children' existingfile= {'children': []                    }else:# It's a filefile= {'size':size,'offset':offset                    }self._files[fname]=file# Add file as the parent directory's childiffname!='':parent=posixpath.dirname(fname)ifparentnotinself._files:raiseNotImplementedError("Autocreation of parent folder {parent} not implemented, in {self.path}")self._files[parent]['children'].append(fname)# Calculate offset of previous fileifprevious_fileisnotNoneand'children'notinprevious_file:# We require the files to be stored in the order of the central directory entries,# although sorting would be an optionifoffset<=previous_file['offset']:raiseValueError(f"Non-ascending order of local header offsets in{self.path}")# Previous file's offset = current file's offset - previous file's size# This should work unless there is empty space in the zip file, which is unlikely.previous_file['offset']=offset-previous_file['size']previous_file=file# Calculate offset of previous fileifprevious_fileisnotNoneand'children'notinprevious_file:ifcd_offset<=previous_file['offset']:raiseValueError(f"Non-ascending order of local header offsets in{self.path}")# Last file ends where central directory beginsprevious_file['offset']=cd_offset-previous_file['size']asyncdef_ls(self,path:str,detail:bool=True,**kwargs)->List:""" List files and directories in the given path.        If the path points to a file, list just the file.        """# Always await self._initialize() in functions needing self._filesawaitself._initialize()# Internally we don't use a root slash, so strip it. Also strip any trailing slash.path=posixpath.normpath(path).lstrip('/').rstrip('/')# Helper function to get file name or detailsdefget_file_listing(file,fname,detail):ifdetail:if'children'infile:return {'name':f'/{fname}','type':'directory','size':0,'created':None,'islink':False                                    }else:return {'name':f'/{fname}','type':'file','size':file['size'],'created':None,'islink':False                    }else:returnf'/{fname}'# List childrenresults= []ifpathnotinself._files:raiseFileNotFoundError(f"Path{path} not found")file=self._files[path]if'children'infile:# Path points to a dirforchildinfile['children']:results.append(get_file_listing(self._files[child],child,detail))else:# Path points to a fileresults= [get_file_listing(file,path,detail)]returnresultsasyncdef_cat_file(self,path:str,start:Optional[int]=None,end:Optional[int]=None,**kwargs)->bytes:"""Read the contents of a file in the zip."""# Always await self._initialize() in functions needing self._filesawaitself._initialize()# Internally we don't use a root slash, so strip it. Also strip any trailing slash.path=posixpath.normpath(path).lstrip('/').rstrip('/')# Check if the file is availableifpathnotinself._files:raiseFileNotFoundError(f"File{path} not found")elif'children'inself._files[path]:raiseFileNotFoundError(f"{path} is a directory")# Get offset and size of the file in the zip fileinfo=self._files[path]offset=info['offset']size=info['size']# Set start to beginning of file if not specifiedstart=startor0# Convert negative start (relative to end of file) to positive startifstart<0:start=max(0,size+start)# Clamp too large negative start to the beginning of file# Set end to end of file if not specifiedend=endorsize# Convert negative start (relative to end of file) to positive startifend<0:end=max(0,size+end)# Clamp too large negative start to the beginning of file# For start beyond the end of the file or the end, return empty dataifstart>=sizeorend<=start:returnb''# Calculate zip file read start and read sizeread_start=offset+startread_size=min(end,size)-start# Clamp too large end at size# Read datadata=awaitself.fs._cat_file(self.path,start=read_start,end=read_start+read_size)returndata
You must be logged in to vote
0 replies
Comment options

Hi,

My two cents on this issue.
I had a similar problem for access azipped Zarr overhttp/byte-range.

I ended up with the following code :

classHttpZipStore(ZipStore):def__init__(self,path)->None:super().__init__(path="",mode="r")self.path=pathdef_load_zip_zarr(**kwargs):fs=HTTPFileSystem(asynchronous=False,block_size=10000)zipfile=fs.open(LOCAL_ZARR_ZIP)store=HttpZipStore(zipfile)returnopen_datatree(store,engine="zarr",**kwargs)

I think this is an important use case to support:

Zarr files are hard to handle / move around if not zipped.

It's easy to drop a zipped Zarr on some cloud services / Zenodo repository.
Most modern servers support byte-range.

As a user, I would not have to deal with custom store setup.
I would expect something like this to work out of the box :

open_datatree("https://myserver/dataset.zarr.zip")

Or at least, there should be some example / code snippet for this use case in the documentation.

Thanks for your work and this great project !

You must be logged in to vote
1 reply
@d-v-b
Comment options

thanks for this example! I agree that this kind of thing would be great to include in the docs and / or in the zarr repo as a stand-alone piece of code.

Comment options

I am curious how many byte range requests reading a zip file over HTTP or S3 requires versus a Zarr array with a single shard. From above, it seems like making large requests and making some assumptions can speed up loading from a zip file. How often are those assumptions true and what do the common and worse case scenarios look like?

You must be logged in to vote
1 reply
@d-v-b
Comment options

d-v-bAug 9, 2025
Maintainer

How often are those assumptions true and what do the common and worse case scenarios look like?

I think it would be extremely valuable for someone to do these experiments, and recommend ways to improve zarr python's support for single file archives. As this discussion indicates, zarr python is pretty limited today and has a lot of room for improvement here.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
General
Labels
None yet
8 participants
@jeffpeck10x@joshmoore@rabernat@raphaeljolivet@d-v-b@mkitti@remithiebaut@olli4

[8]ページ先頭

©2009-2025 Movatter.jp