Data serialized using the pickle module must be portable across Pythonversions. It should also support the latest language features as wellas implementation-specific features. For this reason, the picklemodule knows about several protocols (currently numbered from 0 to 3),each of which appeared in a different Python version. Using alow-numbered protocol version allows to exchange data with old Pythonversions, while using a high-numbered protocol allows access to newerfeatures and sometimes more efficient resource use (both CPU timerequired for (de)serializing, and disk size / network bandwidthrequired for data transfer).
The latest current protocol, coincidentally named protocol 3, appearedwith Python 3.0 and supports the new incompatible features in thelanguage (mainly, unicode strings by default and the new bytesobject). The opportunity was not taken at the time to improve theprotocol in other ways.
This PEP is an attempt to foster a number of incremental improvementsin a new pickle protocol version. The PEP process is used in order togather as many improvements as possible, because the introduction of anew pickle protocol should be a rare occurrence.
Traditionally, when unpickling an object from a stream (by callingload() rather thanloads()), many smallread()calls can be issued on the file-like object, with a potentially hugeperformance impact.
Protocol 4, by contrast, features binary framing. The general structureof a pickle is thus the following:
+------+------+|0x80|0x04|protocolheader(2bytes)+------+------+|OP|FRAMEopcode(1byte)+------+------+-----------+|MMMMMMMMMMMMMMMM|framesize(8bytes,little-endian)+------+------------------+|....|firstframecontents(Mbytes)+------+|OP|FRAMEopcode(1byte)+------+------+-----------+|NNNNNNNNNNNNNNNN|framesize(8bytes,little-endian)+------+------------------+|....|secondframecontents(Nbytes)+------+etc.
To keep the implementation simple, it is forbidden for a pickle opcodeto straddle frame boundaries. The pickler takes care not to produce suchpickles, and the unpickler refuses them. Also, there is no “last frame”marker. The last frame is simply the one which ends with a STOP opcode.
A well-written C implementation doesn’t need additional memory copiesfor the framing layer, preserving general (un)pickling efficiency.
Note
How the pickler decides to partition the pickle stream into frames is animplementation detail. For example, “closing” a frame as soon as itreaches ~64 KiB is a reasonable choice for both performance and picklesize overhead.
The GLOBAL opcode, which is still used in protocol 3, uses theso-called “text” mode of the pickle protocol, which involves lookingfor newlines in the pickle stream. It also complicates the implementationof binary framing.
Protocol 4 forbids use of the GLOBAL opcode and replaces it withSTACK_GLOBAL, a new opcode which takes its operand from the stack.
By default, pickle is only able to serialize module-global functions andclasses. Supporting other kinds of objects, such as unbound methods[4],is a common request. Actually, third-party support for some of them, suchas bound methods, is implemented in the multiprocessing module[5].
The__qualname__ attribute fromPEP 3155 makes it possible tolookup many more objects by name. Making the STACK_GLOBAL opcode acceptdot-separated names would allow the standard pickle implementation tosupport all those kinds of objects.
Current protocol versions export object sizes for various built-intypes (str, bytes) as 32-bit ints. This forbids serialization oflarge data[1]. New opcodes are required to support very large bytesand str objects.
Many common built-in types (such as str, bytes, dict, list, tuple)have dedicated opcodes to improve resource consumption whenserializing and deserializing them; however, sets and frozensetsdon’t. Adding such opcodes would be an obvious improvement. Also,dedicated set support could help remove the current impossibility ofpickling self-referential sets[2].
Currently, classes whose__new__ mandates the use of keyword-onlyarguments can not be pickled (or, rather, unpickled)[3]. Both a newspecial method (__getnewargs_ex__) and a new opcode (NEWOBJ_EX)are needed. The__getnewargs_ex__ method, if it exists, mustreturn a two-tuple(args,kwargs) where the first item is thetuple of positional arguments and the second item is the dict ofkeyword arguments for the class’s__new__ method.
Short str objects currently have their length coded as a 4-bytesinteger, which is wasteful. A specific opcode with a 1-byte lengthwould make many pickles smaller.
The PUT opcodes all require an explicit index to select in which entryof the memo dictionary the top-of-stack is memoized. However, in practicethose numbers are allocated in sequential order. A new opcode, MEMOIZE,will instead store the top-of-stack in at the index equal to the currentsize of the memo dictionary. This allows for shorter pickles, since PUTopcodes are emitted for all non-atomic datatypes.
These reflect the state of the proposed implementation (thanks mostlyto Alexandre Vassalotti’s work):
FRAME: introduce a new frame (followed by the 8-byte frame sizeand the frame contents).SHORT_BINUNICODE: push a utf8-encoded str object with a one-bytesize prefix (therefore less than 256 bytes long).BINUNICODE8: push a utf8-encoded str object with an eight-bytesize prefix (for strings longer than 2**32 bytes, which therefore cannotbe serialized usingBINUNICODE).BINBYTES8: push a bytes object with an eight-byte size prefix(for bytes objects longer than 2**32 bytes, which therefore cannot beserialized usingBINBYTES).EMPTY_SET: push a new empty set object on the stack.ADDITEMS: add the topmost stack items to the set (to be used withEMPTY_SET).FROZENSET: create a frozenset object from the topmost stack items,and push it on the stack.NEWOBJ_EX: take the three topmost stack itemscls,argsandkwargs, and push the result of callingcls.__new__(*args,**kwargs).STACK_GLOBAL: take the two topmost stack itemsmodule_name andqualname, and push the result of looking up the dottedqualnamein the module namedmodule_name.MEMOIZE: store the top-of-stack object in the memo dictionary withan index equal to the current size of the memo dictionary.Serhiy Storchaka suggested to replace framing with a special PREFETCHopcode (with a 2- or 4-bytes argument) to declare known pickle chunksexplicitly. Large data may be pickled outside such chunks. A naïveunpickler should be able to skip the PREFETCH opcode and still decodepickles properly, but good error handling would require checking thatthe PREFETCH length falls on an opcode boundary.
In alphabetic order:
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-3154.rst
Last modified:2025-02-01 08:59:27 GMT