
| This articleis written in the style of a debate rather than an encyclopedic summary. It may requirecleanup to meet Wikipedia'squality standards and make it more accessible to a general audience. Please discuss this issue on thetalk page.(March 2012) |
Incomputer science, in the context of data storage and transmission,serialization is the process of converting adata structure orobject state into a format that can be stored (for example, in afile or memory buffer, or transmitted across anetwork connection link) and "resurrected" later in the same or another computer environment.[1] When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use ofreferences, this process is not straightforward. Serialization of object-orientedobjects does not include any of their associatedmethods with which they were previously inextricably linked.
This process of serializing an object is also calleddeflating ormarshalling an object.[2] The opposite operation, extracting a data structure from a series of bytes, isdeserialization (which is also calledinflating orunmarshalling).
Contents |
Serialization provides:
For some of these features to be useful, architecture independence must be maintained. For example, for maximal use of distribution, a computer running on a different hardware architecture should be able to reliably reconstruct a serialized data stream, regardless ofendianness. This means that the simpler and faster procedure of directly copying the memory layout of the data structure cannot work reliably for all architectures. Serializing the data structure in an architecture independent format means that we do not suffer from the problems ofbyte ordering, memory layout, or simply different ways of representing data structures in differentprogramming languages.
Inherent to any serialization scheme is that, because the encoding of the data is by definition serial, extracting one part of the serialized data structure requires that the entire object be read from start to end, and reconstructed. In many applications this linearity is an asset, because it enables simple, common I/O interfaces to be utilized to hold and pass on the state of an object. In applications where higher performance is an issue, it can make sense to expend more effort to deal with a more complex, non-linear storage organization.
Even on a single machine, primitivepointer objects are too fragile to save, because the objects to which they point may be reloaded to a different location in memory. To deal with this, the serialization process includes a step calledunswizzling orpointer unswizzling and the deserialization process includes a step calledpointer swizzling.
Since both serializing and deserializing can be driven from common code, (for example, theSerialize function inMicrosoft Foundation Classes) it is possible for the common code to do both at the same time, and thus 1) detect differences between the objects being serialized and their prior copies, and 2) provide the input for the next such detection. It is not necessary to actually build the prior copy, since differences can be detected "on the fly". This is a way to understand the technique calleddifferential execution. It is useful in the programming of user interfaces whose contents are time-varying — graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things.
Serialization, however, breaks the opacity of anabstract data type by potentially exposing private implementation details. Trivial implementations which serialize all data members may violateencapsulation.[3]
To discourage competitors from making compatible products, publishers ofproprietary software often keep the details of their programs' serialization formats atrade secret. Some deliberatelyobfuscate or evenencrypt the serialized data. Yet, interoperability requires that applications be able to understand each other's serialization formats. Therefore,remote method call architectures such asCORBA define their serialization formats in detail.
TheXerox Network Systems Courier technology in the early 1980s influenced the first widely-adopted standard.Sun Microsystems published theExternal Data Representation (XDR) in 1987.[4]
In the late 1990s, a push to provide an alternative to the standard serialization protocols started:XML was used to produce a human readable text-based encoding. Such an encoding can be useful for persistent objects that may be read and understood by humans, or communicated to other systems regardless of programming language. It has the disadvantage of losing the more compact, byte-stream-based encoding, but by this point larger storage and transmission capacities made file size less of a concern than in the early days of computing.Binary XML has been proposed as a compromise which is not readable by plain-text editors, but is more compact than regular XML. In the 2000s, XML is often used for asynchronous transfer of structured data between client and server inAjax web applications.
JSON is a more lightweight plain-text alternative to XML which is also commonly used for client-server communication in web applications. JSON is based onJavaScript syntax, but is supported in other programming languages as well.
Another alternative,YAML, is effectively a superset of JSON and includes features that make it more powerful for serialization, more "human friendly," and potentially more compact. These features include a notion of tagging data types, support for non-hierarchical data structures, the option to structure data with indentation, and multiple forms of scalar data quoting.
Another human-readable serialization format is theproperty list format used inNeXTSTEP,GNUstep, andMac OS XCocoa.
For large volume scientific datasets, such as satellite data and output of numerical climate, weather, or ocean models, specific binary serialization standards have been developed, e.g.HDF,netCDF and the olderGRIB.
Severalobject-oriented programming languages directly supportobject serialization (orobject archival), either bysyntactic sugar elements or providing a standardinterface for doing so.
Some of these programming languages areRuby,Smalltalk,Python,PHP,Objective-C,Java, and the.NET family of languages.
There are also libraries available that add serialization support to languages that lack native support for it.
In the.NET languages, classes can be serialized and deserialized by adding theSerializable attribute to the class.
If new members are added to a serializable class, they can be tagged with theOptionalField attribute to allow previous versions of the object to be deserialized without error. This attribute affects only deserialization, and prevents the runtime from throwing an exception if a member is missing from the serialized stream. A member can also be marked with theNonSerialized attribute to indicate that it should not be serialized. This will allow the details of those members to be kept secret.
To modify the default deserialization (for example, to automatically initialize a member markedNonSerialized), the class must implement theIDeserializationCallback interface and define theIDeserializationCallback.OnDeserialization method.
Objects may be serialized in binary format for deserialization by other.NET applications. There are also 3:d party binary serializers that are documented, portable, use less memory footprint and CPU.
The framework also provides theSoapFormatter andXmlSerializer objects to support serialization in human-readable, cross-platform XML.
In theObjective-C programming language, serialization (more commonly known asarchiving) is achieved by overriding thewrite: andread: methods in the Object root class. (NB This is in the GNU runtime variant of Objective-C. In the NeXT-style runtime, the implementation is very similar.)
Java provides automatic serialization which requires that the object bemarked by implementing thejava.io.Serializableinterface. Implementing the interface marks the class as "okay to serialize," and Java then handles serialization internally. There are no serialization methods defined on theSerializable interface, but a serializable class can optionally define methods with certain special names and signatures that if defined, will be called as part of the serialization/deserialization process. The language also allows the developer to override the serialization process more thoroughly by implementing another interface, theExternalizable interface, which includes two special methods that are used to save and restore the object's state.
There are three primary reasons why objects are not serializable by default and must implement theSerializable interface to access Java's serialization mechanism.
Thread object is tied to the state of the currentJVM. There is no context in which a deserializedThread object would maintain useful semantics.The standard encoding method uses a simple translation of the fields into a byte stream. Primitives as well as non-transient, non-static referenced objects are encoded into the stream. Each object that is referenced by the serialized object and not marked astransient must also be serialized; and if any object in the complete graph of non-transient object references is not serializable, then serialization will fail. The developer can influence this behavior by marking objects as transient, or by redefining the serialization for an object so that some portion of the reference graph is truncated and not serialized.
It is possible to serialize Java objects throughJDBC and store them into a database.[5]
WhileSwing components do implement the Serializable interface, they arenot portable between different versions of the Java Virtual Machine. As such, a Swing component, or any component which inherits it, may be serialized to an array of bytes, but it is not guaranteed that this storage will be readable on another machine.
ColdFusion allows data structures to be serialized toWDDX with the<cfwddx> tag and toJSON with theSerializeJSON() function.
OCaml's standard library provides marshalling through theMarshal module (its documentation) and the Pervasives functionsoutput_value andinput_value. While OCaml programming is statically type-checked, uses of theMarshal module may break type guarantees, as there is no way to check whether an unmarshalled stream represents objects of the expected type. In OCaml it is difficult to marshal a function or a data structure which contains a function (e.g. an object which contains a method), because executable code in functions cannot be transmitted across different programs. (There is a flag to marshal the code position of a function but it can only be unmarshalled in exactly the same program.)
SeveralPerl modules available fromCPAN provide serialization mechanisms, includingStorable andFreezeThaw.
Storable includes functions to serialize and deserialize Perl data structures to and from files or Perl scalars.
In addition to serializing directly to files,Storable includes thefreeze function to return a serialized copy of the data packed into a scalar, andthaw to deserialize such a scalar. This is useful for sending a complex data structure over a network socket or storing it in a database.
When serializing structures withStorable, there are network safe functions that always store their data in a format that is readable on any computer at a small cost of speed. These functions are namednstore,nfreeze, etc. There are no "n" functions for deserializing these structures — the regularthaw andretrieve deserialize structures serialized with the "n" functions and their machine-specific equivalents.
C andC++ do not provide direct support for serialization. It is however possible to write your own serialization functions, since both languages support writing binary data. Besides, compiler-based solutions, such as theODBORM system for C++, are capable of automatically producing serialization code with few or no modifications to class declarations. Another popular serialization framework is Boost.Serialization[6] from theBoost Framework.
Python implements serialization through thestandard library modulepickle, and to a lesser extent, the oldermarshal.marshal does offer the ability to serialize Python code objects, unlikepickle. In addition, Python offers thecPickle module, which (as the name suggests) is a C implementation of the pickle module. It can be up to 1000 times faster than the pure Python pickle module, but has a few limitations. Theshelve module is based on thepickle module and can be regarded as a serialized Python dictionary.
As of version 2.6, Python's standard library also includes support forJSON and for XML-encodedproperty lists. (Seejson andplistlib, respectively.) However, these modules only handle basic Python types like strings, integers, and collections of basic types, whereaspickle is intended for arbitrary objects.
PHP originally implemented serialization through the built-inserialize() andunserialize() functions.[7] PHP can serialize any of its data types except resources (file pointers, sockets, etc.). The built-inunserialize() function is often dangerous when used on completely untrusted data.[8]
For objects, there are two "magic methods" that can be implemented within a class —__sleep() and__wakeup() — that are called from withinserialize() andunserialize(), respectively, that can clean up and restore an object. For example, it may be desirable to close a database connection on serialization and restore the connection on deserialization; this functionality would be handled in these two magic methods. They also permit the object to pick which properties are serialized.
Since PHP 5.1, there is an object-oriented serialization mechanism for objects, theSerializable interface.[9]
R has the functiondput which writes an ASCII text representation of an R object to a file or connection. A representation can be read from a file usingdget.[10]
REBOL will serialize to file (save/all) or to astring! (mold/all). Strings and files can be deserialized using thepolymorphicload function.
Ruby includes the standard moduleMarshal with 2 methodsdump andload, akin to the standard Unix utilitiesdump andrestore. These methods serialize to the standard classString, that is, they effectively become a sequence of bytes.
Some objects cannot be serialized (doing so would raise aTypeError exception):
If a class requires custom serialization (for example, it requires certain cleanup actions done on dumping / restoring), it can be done by implementing 2 methods:_dump and_load. The instance method_dump should return aString object containing all the information necessary to reconstitute objects of this class and all referenced objects up to a maximum depth given as an integer parameter (a value of -1 implies that depth checking should be disabled). The class method_load should take aString and return an object of this class.
In general, non-recursive and non-sharing objects can be stored and retrieved in a human readable form using thestoreOn:/readFrom: protocol. ThestoreOn: method generates the text of a Smalltalk expression which - when evaluated usingreadFrom: - recreates the original object. This scheme is special, in that it uses a procedural description of the object, not the data itself. It is therefore very flexible, allowing for classes to define more compact representations. However, in its original form, it does not handle cyclic data structures or preserve the identity of shared references (i.e. two references a single object will be restored as references to two equal, but not identical copies). For this, various portable and non-portable alternatives exist. Some of them are specific to a particular Smalltalk implementation or class library.
There are several ways inSqueak Smalltalk to serialize and store objects. The easiest and most used arestoreOn:/readFrom: and binary storage formats based onSmartRefStream serializers. In addition, bundled objects can be stored and retrieved usingImageSegments.
Both provide a so called "binary-object storage framework", which support serialization into and retrieval from a compact binary form. Both handle cyclic, recursive and shared structures, storage/retrieval of class and metaclass info and include mechanisms for "on the fly" object migration (i.e. to convert instances which were written by an older version of a class with a different object layout). The APIs are similar (storeBinary/readBinary), but the encoding details are different, making these two formats incompatible. However, the Smalltalk/X code is open source and free and can be loaded into other Smalltalks to allow for cross-dialect object interchange.
Object serialization is not part of the ANSI Smalltalk specification. As a result, the code to serialize an object varies by Smalltalk implementation. The resulting binary data also varies. For instance, a serialized object created in Squeak Smalltalk cannot be restored inAmbrai Smalltalk. Consequently, various applications that do work on multiple Smalltalk implementations that rely on object serialization cannot share data between these different implementations. These applications include the MinneStore object database[2] and someRPC packages. A solution to this problem is SIXX[3], which is a package for multiple Smalltalks that uses anXML-based format for serialization.
Generally aLisp data structure can be serialized with the functions "read" and "print". A variable foo containing, for example, a list of arrays would be printed by(print foo). Similarly an object can be read from a stream named s by(read s). These two parts of the Lisp implementation are called the Printer and the Reader. The output of "print" is human readable; it uses lists demarked by parentheses, for example:(4 2.9 "x" y).
In many types of Lisp, includingCommon Lisp, the printer cannot represent every type of data because it is not clear how to do so. In Common Lisp for example the printer cannot print CLOS objects. Instead the programmer may write a method on the generic functionprint-object, this will be invoked when the object is printed. This is somewhat similar to the method used in Ruby.
Lisp code itself is written in the syntax of the reader, called read syntax. Most languages use separate and different parsers to deal with code and data, Lisp only uses one. A file containing lisp code may be read into memory as a data structure, transformed by another program, then possibly executed or written out. SeeREPL.
Notice that not all readers/writers support cyclic, recursive or shared structures.
In Haskell, serialization is supported for types that are members of the Read and Showtype classes. Every type that is a member of theRead type class defines a function that will extract the data from the string representation of the dumped data. TheShow type class, in turn, contains theshow function from which a string representation of the object can be generated.
The programmer need not define the functions explicitly—merely declaring a type to be deriving Read or deriving Show, or both, can make the compiler generate the appropriate functions for many cases (but not all: function types, for example, cannot automatically derive Show or Read).
Additionally, there are haskell libraries that allow high-speed serialization in binary format, e.g.binary.
Windows PowerShell implements serialization through thebuilt-in cmdletExport-CliXML.Export-CliXML serializes .NET objects and stores the resulting XML in a file.
To reconstitute the objects, use theImport-CliXML cmdlet, which generates a deserialized object from the XML in the exported file. Deserialized objects, often known as "property bags" are not live objects; they are snapshots that have properties, but no methods.
Two dimensional data structures can also be (de)serialized inCSV format using the built-in cmdletsImport-CSV andExport-CSV.
For Java:
For C:
For C++:
For PHP:
Serialization systems that support multiple languages: