Removes a file entry (by providing astr path orZipInfo) from the central directory.
If there are multiple file entries with the same path, only one is removed when astr path is provided.
Returns the removedZipInfo instance.
Supported in modes:'a','w','x'.

`ZipFile.repack(removed=None)`

Physically removes stale local file entry data that is no longer referenced by the central directory.
Shrinks the archive file size.
Ifremoved is passed (as a sequence of removedZipInfos), only their corresponding local file entry data are removed.
Only supported in mode'a'.

Rationales

Heuristics Used in`repack()`

Sincerepack() does not immediately clean up removed entries at the time aremove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signaturePK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space,repack() assumes that in a normal ZIP file, local file entries arestored consecutively:

File entries must not overlap.
- If any entry’s data overlaps with the next, aBadZipFile error is raised and no changes are made.
There should be no extra bytes between entries (or between the last entry and the central directory):
1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of_ZipRepacker.repack() (which is internally called byZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode'w' and'x' (e. g.#51067 (comment)).

This isNOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode'w' and'x' may be used on an unseekable file and will be broken by such change. OTOH, mode'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.

Issue:remove/delete method for zipfile objects #51067

📚 Documentation preview 📚:https://cpython-previews--134627.org.readthedocs.build/

Addremove() andrepack() toZipFile

6aed859

bedevere-appbot mentioned this pull request

May 24, 2025

remove/delete method for zipfile objects#51067

Open

Copy link

bedevere-appbot commentedMay 24, 2025

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

bedevere-appbot added the awaiting review label

May 24, 2025

blurb-itbotand others added4 commits

May 24, 2025 11:17

📜🤖 Added by blurb_it.

5453dbc

Fix and optimize test code

80ab2e2

Handle common setups withsetUpClass

72c2a66

Add tests for modew andx forremove()

a4b410b

This comment was marked as off-topic.

danny0838 requested a review fromsharktide

May 24, 2025 17:29

sharktide suggested changes

May 24, 2025

View reviewed changes

Copy link

Contributor

sharktide left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

bedevere-appbot added awaiting core review and removed awaiting review labels

May 24, 2025

Copy link

Author

danny0838 commentedMay 24, 2025

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resemblesopen() andwrite(), which raises aValueError in various cases. Furthermore there has been a change from raisingRuntimeError since Python 3.6:

Changed in version 3.6: Callingopen() on a closed ZipFile will raise aValueError. Previously, aRuntimeError was raised.

Changed in version 3.6: Callingwrite() on a ZipFile created with mode 'r' or a closed ZipFile will raise aValueError. Previously, aRuntimeError was raised.

danny0838 requested a review fromsharktide

May 24, 2025 17:58

Copy link

Author

danny0838 commentedMay 24, 2025•
edited
Loading

Nicely inform@ubershmekel,@barneygale,@merwok, and@wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

danny0838 added7 commits

May 25, 2025 10:09

Introduce_calc_initial_entry_offset and refactor

a9e85c6

Optimize_calc_initial_entry_offset by introducing cache

236cd06

Introduce_validate_local_file_entry and refactor

bdc58c7

Introduce_debug and refactor

c3c8345

Introduce_move_entry_data and rework chunk_size passing

1b7d75a

Refactor_validate_local_file_entry

51c9254

Addstrict_descriptor option

0d971d8

danny0838 force-pushed thegh-51067-2 branch frombc865be to0d971d8Compare

May 25, 2025 02:19

danny0838 added6 commits

May 25, 2025 16:18

Fix and improve validation tests

8f0a504

- Separate individual validation tests.- Check underlying repacker not called in validation.- Use `unlink` to prevent FileNotFoundError.- Fix mode 'x' test.

Remove obsolete NameToInfo updating

0cb8682

Usezinfo rather thaninfo

a788a00

Raise on overlapping file blocks

ae01b8c

Rework writing protection

edee203

- Set `_writing` to prevent `open('w').write()` during repacking.- Move the protection logic to `ZipFile.repack()`.

Update doc

555ac78

Copy link

Contributor

sharktide commentedMay 26, 2025

Sorry! I didn’t type that. I stepped away from the laptop and my friend who has a bad sense of humor but knows his way around GitHub typed that. Deleting those comments. I just found out from these emails!

sharktide reviewed

May 26, 2025

View reviewed changes

Doc/library/zipfile.rstShow resolvedHide resolved

emmatyping self-requested a review

May 26, 2025 05:12

Copy link

Member

emmatyping commentedMay 26, 2025

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned aboutrepack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

sharktide approved these changes

May 26, 2025

View reviewed changes

Copy link

Contributor

sharktide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips?

Maybe, but I think that would just add extra complexity and a whole new plethora of issues. LGTM so far!

@danny0838 Thanks for the PR!

danny0838 added5 commits

May 26, 2025 22:43

Fix typo

95fde31

Add test for bytes between file entries

8a448e4

Checktestzip() after zip file closed

4c35eb2

Supportrepack(removed)

926338c

Fix bytes between entries be removed whenremoved is passed

e76f9a1

Copy link

Author

danny0838 commentedMay 26, 2025•
edited
Loading

@emmatyping

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.
I'm a bit concerned aboutrepack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

Added support ofrepack(removed) to pass a sequence of removedZipInfos, and reclaims space only for corresponding local file headers. So one can do something like:

with zipfile.ZipFile('archive.zip', 'a') as zh:    zinfos = [zh.remove(n) for n in zh.namelist() if n.startswith('folder/')]    zh.repack(zinfos)

Though the result should not change in most cases, except for being more performant by eliminating the need to scan file entries in the bytes.

danny0838 added14 commits

May 27, 2025 07:57

Fix bad test code

93f4c25

Revise docstring

9e94209

AddtearDown for tests

3ef72c6

Rename methods and parameters

fbf7588

Adjust parameter order

81a419a

Optimize code and revise comment

c62a455

- According to ZIP spec, both uncompressed and compressed size should be 0xffffffff when zip64 is used.

Improve debug for_ZipRepacker.repack()

a05353c

Rework_validate_local_file_entry_sequence to return size or None

3d0240c

Rework_validate_local_file_entry_sequence to allow passing no `che…

31c4c93

…cked_offsets`

Introduce_scan_data_descriptor_no_sig_by_decompression

f8fade1

Strip only entries immediately following a referenced entry

c80d21b

- The previous implementation might cause [archive decryption header] and/or [archive extra data record] preceeding [central directory] be stripped.

Adjust method names

e1caea9

Add memory usage test

2b23d46

Fix rst

de4f15b

Copy link

Author

danny0838 commentedMay 31, 2025•
edited
Loading

Is there still any problem about this PR?

Here are something that may need further consideration/discussion:

1. Should`strict_descriptor` option for`repack()` default to True or False?

Summary of trade-offs

Option	Pros ✅	Cons ❌
`strict_descriptor=True`	- Correctly strips any entry with an unsigned data descriptor - Better strict to ZIP spec	- ~150× slower in worst cases - Might open a hole for DoS (if attacker crafts offensive entries) - Slightly higher false-positive risk on random bytes
`strict_descriptor=False`	- Much faster - Safer against DoS scenarios - Lower false-positive risk	- Cannot strip unsigned descriptors - Less strict to ZIP spec (but doesn't violate it)

Background

When a local file entry has the flag bit indicating usage of data descriptor:

This method first attempts to scan for a signed data descriptor.
If no valid one is found:
1. For supported compression methods (ZIP_DEFLATED,ZIP_BZIP2, orZIP_ZSTANDARD), it decompresses the data to find its end offset.
2. Otherwise it performs a byte-by-byte scan for an unsigned data descriptor.

This option only affects case2.2, which is used when neither signed descriptor nor decompression-based validation is applicable.

Performance comparison

Based on the benchmark (see tests intest_zipfile64.py):

8 GiBZIP_STORED file with signed data descriptor: ~56.7s
400 MiBZIP_STORED file with unsigned data descriptor: ~270s

The latter is over150× slower due to the byte-by-byte scanning for a valid data descriptor.

This may also raise a security concern sincestrict_descriptor=False may open a path for a DoS Attack (if an attacker crafts a ZIP file with offensive entries).

False-positive risk

It's not possible to guarantee the "real" file size of a local entry with a data descriptor without the information from the central directory.

If a local file entry spans 100 MiB, it's theoretically possible that multiple byte ranges (e.g., the first 20 MiB, 30 MiB, etc.) could each appear as valid data + data descriptor segments with differing CRCs and compressed sizes. (Currently, the algorithm validates only the compressed size. Checking for CRCs could reduce false positives but would significantly deteriate performance.) The byte-by-byte validation can increase the risk of a false positive compared to the signature or decompression based validation, which only checks for certain points and has more prerequisites.

A false positive should be unlikely to happen in practice. If it were to happen, a stale local file entry is stripped incompletely (e.g. a 30MiB entry be treated as 20MiB, leaving 10MiB random bytes over) and cause following entries not stripped (since the algorithm requires consecutive entries). However, the ZIP file remains uncorrupted.

Spec compliance

According to theZIP file format specification: Applications SHOULD write signed descriptors and SHOULD support both forms when reading.

Unsigned descriptors are thus consideredlegacy, but it is unclear whether they are still used widely.

strict_descriptor=True adheres less strictly to the spec, but doesnot violate it — because stripping is neither reading nor writing, and a suboptimal stripping does not corrupt the ZIP archive.

2. Should we also implement`copy()` for`ZipFile`?

Currently, copying an entry within a ZIP file is cumbersome due to the lack of support for simultaneous reading and writing. The implementer must either:

Read the entire entry and write afterwards (which is memory-intensive and inefficient for large files), or
Use a temporary file for buffered copying.

Both approaches are more complex and less performant, due to the need to decompress and recompress data.

If would be much more performant and friendly by implementing acopy() method, using the similar internal buffered copying technique that_ZipRepacker has used.

Additionally, this also opens the door to support an efficientmove() operation, composed ofcopy(),remove(), and optionallyrepack().

And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?

Copy link

Member

gpshead commentedMay 31, 2025•
edited
Loading

A higher level question: Why would we want to maintain advanced features that are not used by most people within the standard library at all? This code existing in the stdlib creates a maintenance burden and is the slowest possible way to get features and their resulting trail of bugfixes to people who might want them. All features have an ongoing cost.

Is there a realneed for these zipfile features to not simply be advanced ones available in a PyPI package?@jaraco FYI

I appreciate the enthusiasm for implementing something interesting. I'm not sure we actually want to maintain that within the CPython project though. Are there compelling use cases for these features to be part of Python rather than external?

(edit: posted this on the Issue, leaving here for posterity)

Copy link

Author

danny0838 commentedMay 31, 2025•
edited
Loading

@gpshead Don't you think your question should be raised in the issue thread rather than here? 🫠

Copy link

Member

gpshead commentedMay 31, 2025

good point, moved, thanks!(where to have what discussions is an area where I consider github's UX to be... not great)

mostafaammer added this tolavitaconnect@MOSTAFAAMMER

May 31, 2025

Labels

awaiting core review

5 participants

Movatterモバイル変換

Uh oh!

gh-51067: Addremove() andrepack() toZipFile#134627

Are you sure you want to change the base?

gh-51067: Addremove() andrepack() toZipFile#134627

Conversation

danny0838 commentedMay 24, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Features

ZipFile.remove(zinfo_or_arcname)

ZipFile.repack(removed=None)

Rationales

Heuristics Used inrepack()

Supported Modes

Uh oh!

bedevere-appbot commentedMay 24, 2025

Uh oh!

This comment was marked as off-topic.

Uh oh!

sharktide left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0838 commentedMay 24, 2025

Uh oh!

danny0838 commentedMay 24, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

sharktide commentedMay 26, 2025

Uh oh!

Uh oh!

emmatyping commentedMay 26, 2025

Uh oh!

sharktide left a comment

Choose a reason for hiding this comment

Uh oh!

danny0838 commentedMay 26, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

danny0838 commentedMay 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

1. Shouldstrict_descriptor option forrepack() default to True or False?

Summary of trade-offs

Background

Performance comparison

False-positive risk

Spec compliance

2. Should we also implementcopy() forZipFile?

Uh oh!

gpshead commentedMay 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

danny0838 commentedMay 31, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

gpshead commentedMay 31, 2025

Uh oh!

Uh oh!

gh-51067: Add`remove()` and`repack()` to`ZipFile`#134627

gh-51067: Add`remove()` and`repack()` to`ZipFile`#134627

danny0838 commentedMay 24, 2025•
edited
Loading

`ZipFile.remove(zinfo_or_arcname)`

`ZipFile.repack(removed=None)`

Heuristics Used in`repack()`

sharktide left a comment•
edited
Loading

danny0838 commentedMay 24, 2025•
edited
Loading

danny0838 commentedMay 26, 2025•
edited
Loading

danny0838 commentedMay 31, 2025•
edited
Loading

1. Should`strict_descriptor` option for`repack()` default to True or False?

2. Should we also implement`copy()` for`ZipFile`?

gpshead commentedMay 31, 2025•
edited
Loading

danny0838 commentedMay 31, 2025•
edited
Loading