Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-51067: Addremove() andrepack() toZipFile#134627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
danny0838 wants to merge37 commits intopython:main
base:main
Choose a base branch
Loading
fromdanny0838:gh-51067-2

Conversation

danny0838
Copy link

@danny0838danny0838 commentedMay 24, 2025
edited
Loading

This is a revised version ofPR #103033, implementing two new methods inzipfile.ZipFile:remove() andrepack(), as suggested inthis comment.

Features

ZipFile.remove(zinfo_or_arcname)

  • Removes a file entry (by providing astr path orZipInfo) from the central directory.
  • If there are multiple file entries with the same path, only one is removed when astr path is provided.
  • Returns the removedZipInfo instance.
  • Supported in modes:'a','w','x'.

ZipFile.repack(removed=None)

  • Physically removes stale local file entry data that is no longer referenced by the central directory.
  • Shrinks the archive file size.
  • Ifremoved is passed (as a sequence of removedZipInfos), only their corresponding local file entry data are removed.
  • Only supported in mode'a'.

Rationales

Heuristics Used inrepack()

Sincerepack() does not immediately clean up removed entries at the time aremove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signaturePK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space,repack() assumes that in a normal ZIP file, local file entries arestored consecutively:

  • File entries must not overlap.
    • If any entry’s data overlaps with the next, aBadZipFile error is raised and no changes are made.
  • There should be no extra bytes between entries (or between the last entry and the central directory):
    1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
    2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of_ZipRepacker.repack() (which is internally called byZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode'w' and'x' (e. g.#51067 (comment)).

This isNOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode'w' and'x' may be used on an unseekable file and will be broken by such change. OTOH, mode'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.



📚 Documentation preview 📚:https://cpython-previews--134627.org.readthedocs.build/

@bedevere-app
Copy link

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

sharktide

This comment was marked as off-topic.

@danny0838danny0838 requested a review fromsharktideMay 24, 2025 17:29
Copy link
Contributor

@sharktidesharktide left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

@danny0838
Copy link
Author

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resemblesopen() andwrite(), which raises aValueError in various cases. Furthermore there has been a change from raisingRuntimeError since Python 3.6:

Changed in version 3.6: Callingopen() on a closed ZipFile will raise aValueError. Previously, aRuntimeError was raised.

Changed in version 3.6: Callingwrite() on a ZipFile created with mode 'r' or a closed ZipFile will raise aValueError. Previously, aRuntimeError was raised.

@danny0838danny0838 requested a review fromsharktideMay 24, 2025 17:58
@danny0838
Copy link
Author

danny0838 commentedMay 24, 2025
edited
Loading

Nicely inform@ubershmekel,@barneygale,@merwok, and@wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

- Separate individual validation tests.- Check underlying repacker not called in validation.- Use `unlink` to prevent FileNotFoundError.- Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking.- Move the protection logic to `ZipFile.repack()`.
@sharktide
Copy link
Contributor

Sorry! I didn’t type that. I stepped away from the laptop and my friend who has a bad sense of humor but knows his way around GitHub typed that. Deleting those comments. I just found out from these emails!

@emmatypingemmatyping self-requested a reviewMay 26, 2025 05:12
@emmatyping
Copy link
Member

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned aboutrepack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

Copy link
Contributor

@sharktidesharktide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips?

Maybe, but I think that would just add extra complexity and a whole new plethora of issues. LGTM so far!

@danny0838 Thanks for the PR!

@danny0838
Copy link
Author

danny0838 commentedMay 26, 2025
edited
Loading

@emmatyping

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned aboutrepack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

Added support ofrepack(removed) to pass a sequence of removedZipInfos, and reclaims space only for corresponding local file headers. So one can do something like:

with zipfile.ZipFile('archive.zip', 'a') as zh:    zinfos = [zh.remove(n) for n in zh.namelist() if n.startswith('folder/')]    zh.repack(zinfos)

Though the result should not change in most cases, except for being more performant by eliminating the need to scan file entries in the bytes.

emmatyping reacted with hooray emoji

- According to ZIP spec, both uncompressed and compressed size should be 0xffffffff when zip64 is used.
- The previous implementation might cause [archive decryption header] and/or [archive extra data record] preceeding [central directory] be stripped.
@danny0838
Copy link
Author

danny0838 commentedMay 31, 2025
edited
Loading

Is there still any problem about this PR?

Here are something that may need further consideration/discussion:

1. Shouldstrict_descriptor option forrepack() default to True or False?

Summary of trade-offs

OptionPros ✅Cons ❌
strict_descriptor=True- Correctly strips any entry with an unsigned data descriptor
- Better strict to ZIP spec
- ~150× slower in worst cases
- Might open a hole for DoS (if attacker crafts offensive entries)
- Slightly higher false-positive risk on random bytes
strict_descriptor=False- Much faster
- Safer against DoS scenarios
- Lower false-positive risk
- Cannot strip unsigned descriptors
- Less strict to ZIP spec (but doesn't violate it)

Background

When a local file entry has the flag bit indicating usage of data descriptor:

  1. This method first attempts to scan for a signed data descriptor.
  2. If no valid one is found:
    1. For supported compression methods (ZIP_DEFLATED,ZIP_BZIP2, orZIP_ZSTANDARD), it decompresses the data to find its end offset.
    2. Otherwise it performs a byte-by-byte scan for an unsigned data descriptor.

This option only affects case2.2, which is used when neither signed descriptor nor decompression-based validation is applicable.

Performance comparison

Based on the benchmark (see tests intest_zipfile64.py):

  • 8 GiBZIP_STORED file with signed data descriptor: ~56.7s
  • 400 MiBZIP_STORED file with unsigned data descriptor: ~270s

The latter is over150× slower due to the byte-by-byte scanning for a valid data descriptor.

This may also raise a security concern sincestrict_descriptor=False may open a path for a DoS Attack (if an attacker crafts a ZIP file with offensive entries).

False-positive risk

It's not possible to guarantee the "real" file size of a local entry with a data descriptor without the information from the central directory.

If a local file entry spans 100 MiB, it's theoretically possible that multiple byte ranges (e.g., the first 20 MiB, 30 MiB, etc.) could each appear as valid data + data descriptor segments with differing CRCs and compressed sizes. (Currently, the algorithm validates only the compressed size. Checking for CRCs could reduce false positives but would significantly deteriate performance.) The byte-by-byte validation can increase the risk of a false positive compared to the signature or decompression based validation, which only checks for certain points and has more prerequisites.

A false positive should be unlikely to happen in practice. If it were to happen, a stale local file entry is stripped incompletely (e.g. a 30MiB entry be treated as 20MiB, leaving 10MiB random bytes over) and cause following entries not stripped (since the algorithm requires consecutive entries). However, the ZIP file remains uncorrupted.

Spec compliance

According to theZIP file format specification: Applications SHOULD write signed descriptors and SHOULD support both forms when reading.

Unsigned descriptors are thus consideredlegacy, but it is unclear whether they are still used widely.

strict_descriptor=True adheres less strictly to the spec, but doesnot violate it — because stripping is neither reading nor writing, and a suboptimal stripping does not corrupt the ZIP archive.

2. Should we also implementcopy() forZipFile?

Currently, copying an entry within a ZIP file is cumbersome due to the lack of support for simultaneous reading and writing. The implementer must either:

  • Read the entire entry and write afterwards (which is memory-intensive and inefficient for large files), or
  • Use a temporary file for buffered copying.

Both approaches are more complex and less performant, due to the need to decompress and recompress data.

If would be much more performant and friendly by implementing acopy() method, using the similar internal buffered copying technique that_ZipRepacker has used.

Additionally, this also opens the door to support an efficientmove() operation, composed ofcopy(),remove(), and optionallyrepack().

And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?

@gpshead
Copy link
Member

gpshead commentedMay 31, 2025
edited
Loading

A higher level question: Why would we want to maintain advanced features that are not used by most people within the standard library at all? This code existing in the stdlib creates a maintenance burden and is the slowest possible way to get features and their resulting trail of bugfixes to people who might want them. All features have an ongoing cost.

Is there a realneed for these zipfile features to not simply be advanced ones available in a PyPI package?@jaraco FYI

I appreciate the enthusiasm for implementing something interesting. I'm not sure we actually want to maintain that within the CPython project though. Are there compelling use cases for these features to be part of Python rather than external?

(edit: posted this on the Issue, leaving here for posterity)

@danny0838
Copy link
Author

danny0838 commentedMay 31, 2025
edited
Loading

@gpshead Don't you think your question should be raised in the issue thread rather than here? 🫠

gpshead reacted with thumbs up emoji

@gpshead
Copy link
Member

good point, moved, thanks!(where to have what discussions is an area where I consider github's UX to be... not great)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@sharktidesharktidesharktide approved these changes

@emmatypingemmatypingAwaiting requested review from emmatyping

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@danny0838@merwok@emmatyping@sharktide@gpshead

[8]ページ先頭

©2009-2025 Movatter.jp