Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-51067: Addremove()
andrepack()
toZipFile
#134627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Uh oh!
There was an error while loading.Please reload this page.
sharktide left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have
This behavior simply resembles
|
danny0838 commentedMay 24, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Nicely inform@ubershmekel,@barneygale,@merwok, and@wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces. The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted. |
- Separate individual validation tests.- Check underlying repacker not called in validation.- Use `unlink` to prevent FileNotFoundError.- Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking.- Move the protection logic to `ZipFile.repack()`.
Sorry! I didn’t type that. I stepped away from the laptop and my friend who has a bad sense of humor but knows his way around GitHub typed that. Deleting those comments. I just found out from these emails! |
Uh oh!
There was an error while loading.Please reload this page.
@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles. I'm a bit concerned about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips?
Maybe, but I think that would just add extra complexity and a whole new plethora of issues. LGTM so far!
@danny0838 Thanks for the PR!
danny0838 commentedMay 26, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Added support of
Though the result should not change in most cases, except for being more performant by eliminating the need to scan file entries in the bytes. |
- According to ZIP spec, both uncompressed and compressed size should be 0xffffffff when zip64 is used.
- The previous implementation might cause [archive decryption header] and/or [archive extra data record] preceeding [central directory] be stripped.
danny0838 commentedMay 31, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Is there still any problem about this PR? Here are something that may need further consideration/discussion: 1. Should |
Option | Pros ✅ | Cons ❌ |
---|---|---|
strict_descriptor=True | - Correctly strips any entry with an unsigned data descriptor - Better strict to ZIP spec | - ~150× slower in worst cases - Might open a hole for DoS (if attacker crafts offensive entries) - Slightly higher false-positive risk on random bytes |
strict_descriptor=False | - Much faster - Safer against DoS scenarios - Lower false-positive risk | - Cannot strip unsigned descriptors - Less strict to ZIP spec (but doesn't violate it) |
Background
When a local file entry has the flag bit indicating usage of data descriptor:
- This method first attempts to scan for a signed data descriptor.
- If no valid one is found:
- For supported compression methods (
ZIP_DEFLATED
,ZIP_BZIP2
, orZIP_ZSTANDARD
), it decompresses the data to find its end offset. - Otherwise it performs a byte-by-byte scan for an unsigned data descriptor.
- For supported compression methods (
This option only affects case2.2, which is used when neither signed descriptor nor decompression-based validation is applicable.
Performance comparison
Based on the benchmark (see tests intest_zipfile64.py
):
- 8 GiB
ZIP_STORED
file with signed data descriptor: ~56.7s - 400 MiB
ZIP_STORED
file with unsigned data descriptor: ~270s
The latter is over150× slower due to the byte-by-byte scanning for a valid data descriptor.
This may also raise a security concern sincestrict_descriptor=False
may open a path for a DoS Attack (if an attacker crafts a ZIP file with offensive entries).
False-positive risk
It's not possible to guarantee the "real" file size of a local entry with a data descriptor without the information from the central directory.
If a local file entry spans 100 MiB, it's theoretically possible that multiple byte ranges (e.g., the first 20 MiB, 30 MiB, etc.) could each appear as valid data + data descriptor segments with differing CRCs and compressed sizes. (Currently, the algorithm validates only the compressed size. Checking for CRCs could reduce false positives but would significantly deteriate performance.) The byte-by-byte validation can increase the risk of a false positive compared to the signature or decompression based validation, which only checks for certain points and has more prerequisites.
A false positive should be unlikely to happen in practice. If it were to happen, a stale local file entry is stripped incompletely (e.g. a 30MiB entry be treated as 20MiB, leaving 10MiB random bytes over) and cause following entries not stripped (since the algorithm requires consecutive entries). However, the ZIP file remains uncorrupted.
Spec compliance
According to theZIP file format specification: Applications SHOULD write signed descriptors and SHOULD support both forms when reading.
Unsigned descriptors are thus consideredlegacy, but it is unclear whether they are still used widely.
strict_descriptor=True
adheres less strictly to the spec, but doesnot violate it — because stripping is neither reading nor writing, and a suboptimal stripping does not corrupt the ZIP archive.
2. Should we also implementcopy()
forZipFile
?
Currently, copying an entry within a ZIP file is cumbersome due to the lack of support for simultaneous reading and writing. The implementer must either:
- Read the entire entry and write afterwards (which is memory-intensive and inefficient for large files), or
- Use a temporary file for buffered copying.
Both approaches are more complex and less performant, due to the need to decompress and recompress data.
If would be much more performant and friendly by implementing acopy()
method, using the similar internal buffered copying technique that_ZipRepacker
has used.
Additionally, this also opens the door to support an efficientmove()
operation, composed ofcopy()
,remove()
, and optionallyrepack()
.
And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?
gpshead commentedMay 31, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
A higher level question: Why would we want to maintain advanced features that are not used by most people within the standard library at all? This code existing in the stdlib creates a maintenance burden and is the slowest possible way to get features and their resulting trail of bugfixes to people who might want them. All features have an ongoing cost. Is there a realneed for these zipfile features to not simply be advanced ones available in a PyPI package?@jaraco FYI I appreciate the enthusiasm for implementing something interesting. I'm not sure we actually want to maintain that within the CPython project though. Are there compelling use cases for these features to be part of Python rather than external? (edit: posted this on the Issue, leaving here for posterity) |
danny0838 commentedMay 31, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@gpshead Don't you think your question should be raised in the issue thread rather than here? 🫠 |
good point, moved, thanks!(where to have what discussions is an area where I consider github's UX to be... not great) |
Uh oh!
There was an error while loading.Please reload this page.
This is a revised version ofPR #103033, implementing two new methods in
zipfile.ZipFile
:remove()
andrepack()
, as suggested inthis comment.Features
ZipFile.remove(zinfo_or_arcname)
str
path orZipInfo
) from the central directory.str
path is provided.ZipInfo
instance.'a'
,'w'
,'x'
.ZipFile.repack(removed=None)
removed
is passed (as a sequence of removedZipInfo
s), only their corresponding local file entry data are removed.'a'
.Rationales
Heuristics Used in
repack()
Since
repack()
does not immediately clean up removed entries at the time aremove()
is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.While local file entries begin with the magic signature
PK\x03\x04
, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.To safely reclaim space,
repack()
assumes that in a normal ZIP file, local file entries arestored consecutively:BadZipFile
error is raised and no changes are made.Check the doc in the source code of
_ZipRepacker.repack()
(which is internally called byZipFile.repack()
) for more details.Supported Modes
There has been opinions that a repacking should support mode
'w'
and'x'
(e. g.#51067 (comment)).This isNOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode
'w'
and'x'
may be used on an unseekable file and will be broken by such change. OTOH, mode'a'
is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.📚 Documentation preview 📚:https://cpython-previews--134627.org.readthedocs.build/