Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 552 – Deterministic pycs

Author:
Benjamin Peterson <benjamin at python.org>
Status:
Final
Type:
Standards Track
Created:
04-Sep-2017
Python-Version:
3.7
Post-History:
07-Sep-2017
Resolution:
Python-Dev message

Table of Contents

Abstract

This PEP proposes an extension to the pyc format to make it more deterministic.

Rationale

Areproducible build is one where the same byte-for-byte output is generatedevery time the same sources are built—even across different machines (naturallysubject to the requirement that they have rather similar environmentsset up). Reproducibility is important for security. It is also a key concept incontent-based build systems such asBazel, which are most effective when theoutput files’ contents are a deterministic function of the input files’contents.

The current Python pyc format is the marshaled code object of the moduleprefixed by amagic number, the source timestamp, and the source filesize. The presence of a source timestamp means that a pyc is not a deterministicfunction of the input file’s contents—it also depends on volatile metadata, themtime of the source. Thus, pycs are a barrier to proper reproducibility.

Distributors of Python code are currently stuck with the options of

  1. not distributing pycs and losing the caching advantages
  2. distributing pycs and losing reproducibility
  3. carefully giving all Python source files a deterministic timestamp(see, for example,https://github.com/python/cpython/pull/296)
  4. doing a complicated mixture of 1. and 2. like generating pycs at installationtime

None of these options are very attractive. This PEP proposes allowing thetimestamp to be replaced with a deterministic hash. The current timestampinvalidation method will remain the default, though. Despite its nondeterminism,timestamp invalidation works well for many workflows and usecases. Thehash-based pyc format can impose the cost of reading and hashing every sourcefile, which is more expensive than simply checking timestamps. Thus, for now, weexpect it to be used mainly by distributors and power use cases.

(Note there are other problems[1][2] we do notaddress here that can make pycs non-deterministic.)

Specification

The pyc header currently consists of 3 32-bit words. We will expand it to 4. Thefirst word will continue to be the magic number, versioning the bytecode and pycformat. The second word, conceptually the new word, will be a bit field. Theinterpretation of the rest of the header and invalidation behavior of the pycdepends on the contents of the bit field.

If the bit field is 0, the pyc is a traditional timestamp-based pyc. I.e., thethird and forth words will be the timestamp and file size respectively, andinvalidation will be done by comparing the metadata of the source file with thatin the header.

If the lowest bit of the bit field is set, the pyc is a hash-based pyc. We callthe second lowest bit thecheck_source flag. Following the bit field is a64-bit hash of the source file. We will use aSipHash with a hardcoded key ofthe contents of the source file. Another fast hash like MD5 orBLAKE2 wouldalso work. We choose SipHash because Python already has a builtin implementationof it fromPEP 456, although an interface that allows picking the SipHash keymust be exposed to Python. Security of the hash is not a concern, though we passover completely-broken hashes like MD5 to ease auditing of Python in controlledenvironments.

When Python encounters a hash-based pyc, its behavior depends on the setting ofthecheck_source flag. If thecheck_source flag is set, Python willdetermine the validity of the pyc by hashing the source file and comparing thehash with the expected hash in the pyc. If the pyc needs to be regenerated, itwill be regenerated as a hash-based pyc again with thecheck_source flagset.

For hash-based pycs with thecheck_source unset, Python will simply load thepyc without checking the hash of the source file. The expectation in this caseis that some external system (e.g., the local Linux distribution’s packagemanager) is responsible for keeping pycs up to date, so Python itself doesn’thave to check. Even when validation is disabled, the hash field should be setcorrectly, so out-of-band consistency checkers can verify the up-to-dateness ofthe pyc. Note also that thePEP 3147 edict that pycs without correspondingsource files not be loaded will still be enforced for hash-based pycs.

The programmatic APIs ofpy_compile andcompileall will supportgeneration of hash-based pycs. Principally,py_compile will define a newenumeration corresponding to all the available pyc invalidation modules:

classPycInvalidationMode(Enum):TIMESTAMPCHECKED_HASHUNCHECKED_HASH

py_compile.compile,compileall.compile_dir, andcompileall.compile_file will all gain aninvalidation_mode parameter,which accepts a value of thePycInvalidationMode enumeration.

Thecompileall tool will be extended with a command new option,--invalidation-mode to generate hash-based pycs with and without thecheck_source bit set.--invalidation-mode will be a tristate optiontaking valuestimestamp (the default),checked-hash, andunchecked-hash corresponding to the values ofPycInvalidationMode.

importlib.util will be extended with asource_hash(source) function thatcomputes the hash used by the pyc writing code for a bytestringsource.

Runtime configuration of hash-based pyc invalidation will be facilitated by anew--check-hash-based-pycs interpreter option. This is a tristate option,which may take 3 values:default,always, andnever. The defaultvalue,default, means thecheck_source flag in hash-based pycsdetermines invalidation as described above.always causes the interpreter tohash the source file for invalidation regardless of value ofcheck_sourcebit.never causes the interpreter to always assume hash-based pycs arevalid. When--check-hash-based-pycs=never is in effect, unchecked hash-basedpycs will be regenerated as unchecked hash-based pycs. Timestamp-based pycs areunaffected by--check-hash-based-pycs.

References

[1]
http://benno.id.au/blog/2013/01/15/python-determinism
[2]
http://bugzilla.opensuse.org/show_bug.cgi?id=1049186

Credits

The author would like to thank Gregory P. Smith, Christian Heimes, and SteveDower for useful conversations on the topic of this PEP.

Copyright

This document has been placed in the public domain.


Source:https://github.com/python/peps/blob/main/peps/pep-0552.rst

Last modified:2025-02-01 08:59:27 GMT


[8]ページ先頭

©2009-2025 Movatter.jp