Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 706 – Filter for tarfile.extractall

Author:
Petr Viktorin <encukou at gmail.com>
Discussions-To:
Discourse thread
Status:
Final
Type:
Standards Track
Created:
09-Feb-2023
Python-Version:
3.12
Post-History:
25-Jan-2023,15-Feb-2023
Resolution:
Discourse message

Table of Contents

Important

This PEP is a historical document. The up-to-date, canonical documentation can now be found attarfile documentation.

×

SeePEP 1 for how to propose changes.

Abstract

The extraction methods intarfile gain afilter argument,which allows rejecting files or modifying metadata as the archive is extracted.Three built-in named filters are provided, aimed at limiting features thatmight be surprising or dangerous.These can be used as-is, or serve as a base for custom filters.

After a deprecation period, a strict (but safer) filter will become the default.

Motivation

Thetar format is used for several use cases, many of which have differentneeds. For example:

  • A backup of a UNIX workstation should faithfully preserve all kinds ofdetails like file permissions, symlinks to system configuration, and variouskinds of special files.
  • When unpacking a data bundle, it’s much more important that the unpackingwill not have unintended consequences – like exposing a password file bysymlinking it to a public place.

To support all its use cases, thetar format has many features.In many cases, it’s best to ignore or disallow some of them when extractingan archive.

Python allows extractingtar archives usingtarfile.TarFile.extractall(), whose docs warn tonever extract archives from untrusted sources without prior inspection.However, it’s not clear what kind of inspection should be done.Indeed, it’s quite tricky to do such an inspection correctly.As a result, many people don’t bother, or do the check incorrectly, resulting insecurity issues such asCVE-2007-4559.

Sincetarfile was first written, it’s become moreaccepted that warnings in documentation are not enough.Whenever possible, an unsafe operation should beexplicitly requested;potentially dangerous operations shouldlook dangerous.However,TarFile.extractall looks benign in a code review.

Tarfile extraction is also exposed viashutil.unpack_archive(),which allows the user to not care about the kind of archive they’redealing with.The API is very inviting for extracting archives without prior inspection,even though the docs again warn against it.

It has been argued that Python is not wrong – it behaves exactly asdocumented – but that’s beside the point.Let’s improve the situation rather than assign/avoid blame.Python and its docs are the best place to improve things.

Rationale

How do we improve things?Unfortunately, we will need to change the defaults, which impliesbreaking backwards compatibility.TarFile.extractallis what people reach for when they need to extract a tarball.Its default behaviour needs to change.

What would be the best behaviour? That depends on the use case.So, we’ll add several general “policies” to control extraction.They are based onuse cases, and ideally they should have straightforwardsecurity implications:

  • Current behavior: trusting the archive. Suitable e.g. as a building blockfor libraries that do the check themselves, or extracting an archive you justmade yourself.
  • Unpacking a UNIX archive: roughly following GNUtar, e.g. strippingleading/ from filenames.
  • Unpacking a general data archive: theshutil.unpack_archive()use case,where it’s not important to preserve details specific totar orUnix-like filesystems.

After a deprecation period, the last option – the most limitedbut most secure one – will become the default.

Even with better general defaults, users should still verify the archivesthey extract, and perhaps modify some of the metadata.Superficially, the following looks like a reasonable way to do this today:

However, there are some issues with this approach:

  • It’s possible to modifyTarInfo objects, but the changes to themaffect all subsequent operations on the sameTarFile object.This behavior is fine for most uses, but despite that, it would be verysurprising ifTarFile.extractall did this by default.
  • Callinggetmembers can be expensive and itrequires a seekable archive.
  • When verifying members in advance, it may be necessary to track how eachmember would have changed the filesystem, e.g. how symlinks are being set up.This is hard. We can’t expect users to do it.

To solve these issues we’ll:

  • Provide a supported way to “clone” and modifyTarInfo objects.Areplace method, similar todataclasses.replace()ornamedtuple._replaceshould do the trick.
  • Provide a “filter” hook inextractall’s loop that can modify or discardmembers before they are processed.
  • Require that this hook is called just before extracting each member,so it can scan thecurrent state of the disk. This will greatly simplifythe implementation of policies (both in stdlib and user code),at the cost of not being able to do a precise “dry run”.

The hook API will be very similar to the existingfilter argumentforTarFile.add.We’ll also name itfilter.(In some cases “policy” would be a more fitting name,but the API can be used for more than security policies.)

The built-in policies/filters described above will be implemented using thepublic filter API, so they can be used as building blocks or examples.

Setting a precedent

If and when other libraries for archive extraction, such aszipfile,gain similar functionality, they should mimic this API as much as it’sreasonable.

To enable this for simple cases, the built-in filters will have string names;e.g. users can passfilter='data' instead of a specific function that dealswithTarInfo objects.

Theshutil.unpack_archive() function will get afilter argument, which it will pass toextractall.

Adding function-based API that would work across archive formats isout of scope of this PEP.

Full disclosure & redistributor info

The PEP author works for Red Hat, a redistributor of Python with differentsecurity needs and support periods than CPython in general.Such redistributors may want to carry vendor patches to:

  • Allow configuring the defaults system-wide, and
  • Change the default as soon as possible, even in older Python versions.

The proposal makes this easy to do, and it allows users to querythe settings.

Specification

Modifying and forgetting member metadata

TheTarInfo class will gain a new method,replace(), which will work similarly todataclasses.replace.It will return a copy of theTarInfo object with attributesreplaced as specified by keyword-only arguments:

  • name
  • mtime
  • mode
  • linkname
  • uid
  • gid
  • uname
  • gname

Any of these, exceptname andlinkname, will be allowed to be settoNone.Whenextract orextractall encounters such aNone, it will notset that piece of metadata.(Ifuname orgname isNone, it will fall back touid orgidas if the name wasn’t found.)Whenaddfile ortobuf encounters such aNone, it will raise aValueError.Whenlist encounters such aNone, it will print a placeholder string.

The documentation will mention why the method is there:TarInfo objects retrieved fromTarFile.getmembersare “live”; modifying them directly will affect subsequent unrelatedoperations.

Filters

TarFile.extract andTarFile.extractall methodswill grow afilter keyword-only parameter,which takes a callable that can be called as:

filter(/,member:TarInfo,path:str)->TarInfo|None

wheremember is the member to be extracted, andpath is the path towhere the archive is extracted (i.e., it’ll be the same for every member).

When used it will be called on each member as it is extracted,and extraction will work with the result.If it returnsNone, the member will be skipped.

The function can also raise an exception.This can, depending onTarFile.errorlevel,abort the extraction or cause the member to be skipped.

Note

If extraction is aborted, the archive may be left partiallyextracted. It is the user’s responsibility to clean up.

We will also provide a set of defaults for common use cases.In addition to a function, thefilter argument can be oneof the following strings:

  • 'fully_trusted': Current behavior: honor the metadata as is.Should be used if the user trusts the archive completely, or implements theirown complex verification.
  • 'tar': Roughly follow defaults of the GNUtar command(when run as a normal user):
    • Strip leading'/' andos.sep from filenames
    • Refuse to extract files with absolute paths (after the/ strippingabove, e.g.C:/foo on Windows).
    • Refuse to extract files whose absolute path (after following symlinks)would end up outside the destination.(Note that GNUtar instead delays creating some links.)
    • Clear high mode bits (setuid, setgid, sticky) and group/other write bits(S_IWGRP|S_IWOTH).(This is an approximation of GNUtar’s default, which limits the modeby the currentumask setting.)
  • 'data': Extract a “data” archive, disallowing common attack vectorsbut limiting functionality.In particular, many features specific to UNIX-style filesystems (orequivalently, to thetar archive format) are ignored, making this a goodfilter for cross-platform archives.In addition totar:
    • Refuse to extract links (hard or soft) that link to absolute paths.
    • Refuse to extract links (hard or soft) which end up linking to a pathoutside of the destination.(On systems that don’t support links,tarfile will, in most cases,fall back to creating regular files.This proposal doesn’t change that behaviour.)
    • Refuse to extract device files (including pipes).
    • For regular files and hard links:
    • For other files (directories), ignore mode entirely (set it toNone).
    • Ignore user and group info (setuid,gid,uname,gnametoNone).

Any other string will cause aValueError.

The corresponding filter functions will be available astarfile.fully_trusted_filter(),tarfile.tar_filter(), etc., sothey can be easily used in custom policies.

Note that these filters never returnNone.Skipping members this way is a feature for user-defined filters.

Defaults and their configuration

TarFile will gain a new attribute,extraction_filter, to allow configuring the default filter.By default it will beNone, but users can set it to a callablethat will be used if thefilter argument is missing orNone.

Note

String names won’t be accepted here. That would encourage code likemy_tarfile.extraction_filter='data'.On Python versions without this feature, this would do nothing,silently ignoring a security-related request.

If both the argument and attribute areNone:

  • In Python 3.12-3.13, aDeprecationWarning will be emitted andextraction will use the'fully_trusted' filter.
  • In Python 3.14+, it will use the'data' filter.

Applications and system integrators may wish to changeextraction_filterof theTarFile class itself to set a global default.When using a function, they will generally want to wrap it instaticmethod()to prevent injection of aself argument.

Subclasses ofTarFile can also overrideextraction_filter.

FilterError

A new exception,FilterError, will be added to thetarfilemodule.It’ll have several new subclasses, one for each of the refusal reasons above.FilterError’smember attribute will contain the relevantTarInfo.

In the lists above, “refusing” to extract a file means that aFilterErrorwill be raised.As with other extraction errors, if theTarFile.errorlevelis 1 or more, this will abort the extraction; witherrorlevel=0 the errorwill be logged and the member will be ignored, but extraction will continue.Note thatextractall() may leave the archive partially extracted;it is the user’s responsibility to clean up.

Errorlevel, and fatal/non-fatal errors

Currently,TarFile has anerrorlevelargument/attribute, which specifies how errors are handled:

  • Witherrorlevel=0, documentation says that “all errors are ignoredwhen usingextract() andextractall()”.The code only ignoresnon-fatal andfatal errors (see below),so, for example, you still getTypeError if you passNone as thedestination path.
  • Witherrorlevel=1 (the default), allnon-fatal errors are ignored.(They may be logged tosys.stderr by setting thedebugargument/attribute.)Which errors arenon-fatal is not defined in documentation, but code treatsExtractionError as such. Specifically, it’s these issues:
    • “unable to resolve link inside archive” (raised on systems that do notsupport symlinks)
    • “fifo/special devices not supported by system” (not used for failures ifthe system supports these, e.g. for aPermissionError)
    • “could not change owner/mode/modification time”

    Note that, for example,file name too long orout of disk space don’tqualify.Thenon-fatal errors are not very likely to appear on a Unix-like system.

  • Witherrorlevel=2, all errors are raised, includingfatal ones.Which errors arefatal is, again, not defined; in practice it’sOSError.

A filter refusing to extract a member does not fit neatly into thefatal/non-fatal categories.

  • This PEP does not change existing behavior. (Ideas for improvements arewelcome inDiscourse topic 25970.)
  • When a filter refuses to extract a member, the error should not passsilently by default.

To satisfy this,FilterError will be considered afatal error, that is,it’ll be ignored only witherrorlevel=0.

Users that want to ignoreFilterError but not otherfatal errors shouldcreate a custom filter function, and call another filter in atry block.

Hints for further verification

Even with the proposed changes,tarfile will not besuited for extracting untrusted files without prior inspection.Among other issues, the proposed policies don’t prevent denial-of-serviceattacks.Users should do additional checks.

New docs will tell users to consider:

  • extracting to a new empty directory,
  • using external (e.g. OS-level) limits on disk, memory and CPU usage,
  • checking filenames against an allow-list of characters (to filter out controlcharacters, confusables, etc.),
  • checking that filenames have expected extensions (discouraging files thatexecute when you “click on them”, or extension-less files like Windowsspecial device names),
  • limiting the number of extracted files, total size of extracted data,and size of individual files,
  • checking for files that would be shadowed on case-insensitive filesystems.

Also, the docs will note that:

  • tar files commonly contain multiple versions of the same file: later ones areexpected to overwrite earlier ones on extraction,
  • tarfile does not protect against issues with “live” data, e.g. an attackertinkering with the destination directory while extracting (or adding) isgoing on (see theGNU tar manualfor more info).

This list is not comprehensive, but the documentation is a good place tocollect such general tips.It can be moved into a separate document if grows too long or if it needs tobe consolidated withzipfile orshutil(which is out of scope for this proposal).

TarInfo identity, andoffset

With filters that usereplace(), theTarInfo objects handledby the extraction machinery will not necessarily be the same objectsas those present inmembers.This may affectTarInfo subclasses that override methods likemakelink and rely on object identity.

Such code can switch to comparingoffset, the position of the memberheader inside the file.

Note that both the overridable methods andoffset are onlydocumented in source comments.

tarfile CLI

The CLI (python-mtarfile) will gain a--filter optionthat will take the name of one of the provided default filters.It won’t be possible to specify a custom filter function.

If--filter is not given, the CLI will use the default filter('fully_trusted' with a deprecation warning now, and'data' fromPython 3.14 on).

There will be no short option. (-f would be confusingly similar tothe filename option of GNUtar.)

Other archive libraries

If and when other archive libraries, such aszipfile,grow similar functionality, their extraction functions should use afilterargument that takes, at least, the strings'fully_trusted' (which shoulddisable any security precautions) and'data' (which should avoid featuresthat might surprise users).

Standardizing a function-based filter API is out of scope of this PEP.

Shutil

shutil.unpack_archive() will gain afilter argument.If it’s given, it will be passed to the underlying extraction function.Passing it for azip archive will fail for now (untilzipfilegains afilter argument, if it ever does).

Iffilter is not specified (or left asNone), it won’t be passedon, so extracting a tarball will use the default filter('fully_trusted' with a deprecation warning now, and'data' fromPython 3.14 on).

Complex filters

Note that some user-defined filters need, for example,to count extracted members of do post-processing.This requires a more complex API than afilter callable.However, that complex API need not be exposed totarfile.For example, with a hypotheticalStatefulFilter users would write:

withStatefulFilter()asfilter_func:my_tar.extract(path,filter=filter_func)

A simpleStatefulFilter example will be added to the docs.

Note

The need for stateful filters is a reason against allowingregistration of custom filter names in addition to'fully_trusted','tar' and'data'.With such a mechanism, API for (at least) set-up and tear-down would needto be set in stone.

Backwards Compatibility

The default behavior ofTarFile.extractandTarFile.extractallwill change, after raisingDeprecationWarning for 2 releases(shortest deprecation period allowed in Python’sbackwards compatibility policy).

Additionally, code that relies ontarfile.TarInfoobject identity may break, seeTarInfo identity, and offset.

Backporting & Forward Compatibility

This feature may be backported to older versions of Python.

In CPython, we don’t add warnings to patch releases, so the defaultfilter should be changed to'fully_trusted' in backports.

Other than that,all of the changes totarfile should be backported, sohasattr(tarfile,'data_filter') becomes a reliable check for allof the new functionality.

Note that CPython’s usual policy is to avoid adding new APIs in securitybackports.This feature does not make sense without a new API(TarFile.extraction_filter and thefilter argument),so we’ll make an exception.(SeeDiscourse comment 23149/16for details.)

Here are examples of code that takes into account thattarfile may or maynot have the proposed feature.

When copying these snippets, note that settingextraction_filterwill affect subsequent operations.

  • Fully trusted archive:
    my_tarfile.extraction_filter=(lambdamember,path:member)my_tarfile.extractall()
  • Use the'data' filter if available, but revert to Python 3.11 behavior('fully_trusted') if this feature is not available:
    my_tarfile.extraction_filter=getattr(tarfile,'data_filter',(lambdamember,path:member))my_tarfile.extractall()

    (This is an unsafe operation, so it should be spelled out explicitly,ideally with a comment.)

  • Use the'data' filter;fail if it is not available:
    my_tarfile.extractall(filter=tarfile.data_filter)

    or:

    my_tarfile.extraction_filter=tarfile.data_filtermy_tarfile.extractall()
  • Use the'data' filter;warn if it is not available:
    ifhasattr(tarfile,'data_filter'):my_tarfile.extractall(filter='data')else:# remove this when no longer neededwarn_the_user('Extracting may be unsafe; consider updating Python')my_tarfile.extractall()

Security Implications

This proposal improves security, at the expense of backwards compatibility.In particular, it will help users avoidCVE-2007-4559.

How to Teach This

The API, usage notes and tips for further verification will be added tothe documentation.These should be usable for users who are familiar with archives in general, butnot with the specifics of UNIX filesystems nor the related security issues.

Reference Implementation

Seepull request #102953 on GitHub.

Rejected Ideas

SafeTarFile

An initial idea from Lars Gustäbel was to provide a separate class thatimplements security checks (seegh-65308).There are two major issues with this approach:

  • The name is misleading. General archive operations can never be made “safe”from all kinds of unwanted behavior, without impacting legitimate use cases.
  • It does not solve the problem of unsafe defaults.

However, many of the ideas behind SafeTarFile were reused in this PEP.

Add absolute_path option to tarfile

Issuegh-73974 asks for adding anabsolute_path option to extractionmethods. This would be a minimal change to formally resolveCVE-2007-4559.It doesn’t go far enough to protect the unaware, nor to empower the diligentand curious.

Other names for the'tar' filter

The'tar' filter exposes features specific to UNIX-like filesystems,so it could be named'unix'.Or'unix-like','nix','*nix','posix'?

Feature-wise,tar format andUNIX-like filesystem are essentiallyequivalent, sotar is a good name.

Possible Further Work

Adding filters to zipfile and shutil.unpack_archive

For consistency,zipfile andshutil.unpack_archive() could gain supportfor afilter argument.However, this would require research that this PEP’s author can’t promisefor Python 3.12.

Filters forzipfile would probably not help security.Zip is used primarily for cross-platform data bundles, and correspondingly,ZipFile.extract’s defaultsare already similar to what a'data' filter would do.A'fully_trusted' filter, which wouldnewly allow absolute paths and.. path components, might not be useful for much excepta unifiedunpack_archive API.

Filters should be useful for use cases other than security, but thosewould usually need custom filter functions, and those would need API that workswith bothTarInfo andZipInfo.That isdefinitely out of scope of this PEP.

If only this PEP is implemented and nothing changes forzipfile,the effect for callers ofunpack_archive is that the defaultfortar files is changing from'fully_trusted' tothe more appropriate'data'.In the interim period, Python 3.12-3.13 will emitDeprecationWarning.That’s annoying, but there are several ways to handle it: e.g. add afilter argument conditionally, setTarFile.extraction_filterglobally, or ignore/suppress the warning until Python 3.14.

Also, since many calls tounpack_archive are likely to be unsafe,there’s hope that theDeprecationWarning will often turn out to bea helpful hint to review affected code.

Thanks

This proposal is based on prior work and discussions by many people,in particular Lars Gustäbel, Gregory P. Smith, Larry Hastings, Joachim Wagner,Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, Miro Hrončok,and many others.

Copyright

This document is placed in the public domain or under theCC0-1.0-Universal license, whichever is more permissive.


Source:https://github.com/python/peps/blob/main/peps/pep-0706.rst

Last modified:2025-02-01 08:55:40 GMT


[8]ページ先頭

©2009-2025 Movatter.jp