English

Chinese (Simplified)

XZ data compression in Linux¶

Introduction¶

XZ is a general purpose data compression format with high compressionratio. The XZ decompressor in Linux is called XZ Embedded. It supportsthe LZMA2 filter and optionally also Branch/Call/Jump (BCJ) filtersfor executable code. CRC32 is supported for integrity checking.

See theXZ Embedded home page for the latest version which includesa few optional extra features that aren’t required in the Linux kerneland information about using the code outside the Linux kernel.

For userspace,XZ Utils provide a zlib-like compression libraryand a gzip-like command line tool.

XZ related components in the kernel¶

The xz_dec module provides XZ decompressor with single-call (bufferto buffer) and multi-call (stateful) APIs in include/linux/xz.h.

For decompressing the kernel image, initramfs, and initrd, thereis a wrapper function in lib/decompress_unxz.c. Its API is thesame as in other decompress_*.c files, which is defined ininclude/linux/decompress/generic.h.

For kernel makefiles, three commands are provided for use with$(callif_changed). They require the xz tool from XZ Utils.

$(callif_changed,xzkern) is for compressing the kernel image.It runs the script scripts/xz_wrap.sh which uses arch-optimizedoptions and a big LZMA2 dictionary.
$(callif_changed,xzkern_with_size) is likexzkern above butthis also appends a four-byte trailer containing the uncompressed sizeof the file. The trailer is needed by the boot code on some archs.
Other things can be compressed with$(callif_needed,xzmisc)which will use no BCJ filter and 1 MiB LZMA2 dictionary.

Notes on compression options¶

Since the XZ Embedded supports only streams with CRC32 or no integritycheck, make sure that you don’t use some other integrity check typewhen encoding files that are supposed to be decoded by the kernel.With liblzma from XZ Utils, you need to use eitherLZMA_CHECK_CRC32orLZMA_CHECK_NONE when encoding. With thexz command line tool,use--check=crc32 or--check=none to override the default--check=crc64.

Using CRC32 is strongly recommended unless there is some other layerwhich will verify the integrity of the uncompressed data anyway.Double checking the integrity would probably be waste of CPU cycles.Note that the headers will always have a CRC32 which will be validatedby the decoder; you can only change the integrity check type (ordisable it) for the actual uncompressed data.

In userspace, LZMA2 is typically used with dictionary sizes of severalmegabytes. The decoder needs to have the dictionary in RAM:

In multi-call mode the dictionary is allocated as part of thedecoder state. The reasonable maximum dictionary size for in-kerneluse will depend on the target hardware: a few megabytes is fine fordesktop systems while 64 KiB to 1 MiB might be more appropriate onsome embedded systems.
In single-call mode the output buffer is used as the dictionarybuffer. That is, the size of the dictionary doesn’t affect thedecompressor memory usage at all. Only the base data structuresare allocated which take a little less than 30 KiB of memory.For the best compression, the dictionary should be at leastas big as the uncompressed data. A notable example of single-callmode is decompressing the kernel itself (except on PowerPC).

The compression presets in XZ Utils may not be optimal when creatingfiles for the kernel, so don’t hesitate to use custom settings to,for example, set the dictionary size. Also, xz may produce a smallerfile in single-threaded mode so setting that explicitly is recommended.Example:

xz --threads=1 --check=crc32 --lzma2=dict=512KiB inputfile

xz_dec API¶

This is available with#include<linux/xz.h>.

enumxz_mode¶: Operation mode

Constants

XZ_SINGLE: Single-call mode. This uses less RAM thanmulti-call modes, because the LZMA2dictionary doesn’t need to be allocated aspart of the decoder state. All required datastructures are allocated at initialization,soxz_dec_run() cannot return XZ_MEM_ERROR.
XZ_PREALLOC: Multi-call mode with preallocated LZMA2dictionary buffer. All data structures areallocated at initialization, soxz_dec_run()cannot return XZ_MEM_ERROR.
XZ_DYNALLOC: Multi-call mode. The LZMA2 dictionary isallocated once the required size has beenparsed from the stream headers. If theallocation fails,xz_dec_run() will returnXZ_MEM_ERROR.

Description

It is possible to enable support only for a subset of the abovemodes at compile time by defining XZ_DEC_SINGLE, XZ_DEC_PREALLOC,or XZ_DEC_DYNALLOC. The xz_dec kernel module is always compiledwith support for all operation modes, but the preboot code maybe built with fewer features to minimize code size.

enumxz_ret¶: Return codes

Constants

XZ_OK: Everything is OK so far. More input or moreoutput space is required to continue. Thisreturn code is possible only in multi-call mode(XZ_PREALLOC or XZ_DYNALLOC).
XZ_STREAM_END: Operation finished successfully.
XZ_UNSUPPORTED_CHECK: Integrity check type is not supported. Decodingis still possible in multi-call mode by simplycallingxz_dec_run() again.Note that this return value is used only ifXZ_DEC_ANY_CHECK was defined at build time,which is not used in the kernel. Unsupportedcheck types return XZ_OPTIONS_ERROR ifXZ_DEC_ANY_CHECK was not defined at build time.
XZ_MEM_ERROR: Allocating memory failed. This return code ispossible only if the decoder was initializedwith XZ_DYNALLOC. The amount of memory that wastried to be allocated was no more than thedict_max argument given toxz_dec_init().
XZ_MEMLIMIT_ERROR: A bigger LZMA2 dictionary would be needed thanallowed by the dict_max argument given toxz_dec_init(). This return value is possibleonly in multi-call mode (XZ_PREALLOC orXZ_DYNALLOC); the single-call mode (XZ_SINGLE)ignores the dict_max argument.
XZ_FORMAT_ERROR: File format was not recognized (wrong magicbytes).
XZ_OPTIONS_ERROR: This implementation doesn’t support the requestedcompression options. In the decoder this meansthat the header CRC32 matches, but the headeritself specifies something that we don’t support.
XZ_DATA_ERROR: Compressed data is corrupt.
XZ_BUF_ERROR: Cannot make any progress. Details are slightlydifferent between multi-call and single-callmode; more information below.

Description

In multi-call mode, XZ_BUF_ERROR is returned when two consecutive callsto XZ code cannot consume any input and cannot produce any new output.This happens when there is no new input available, or the output bufferis full while at least one output byte is still pending. Assuming yourcode is not buggy, you can get this error only when decoding a compressedstream that is truncated or otherwise corrupt.

In single-call mode, XZ_BUF_ERROR is returned only when the output bufferis too small or the compressed input is corrupt in a way that makes thedecoder produce more output than the caller expected. When it is(relatively) clear that the compressed input is truncated, XZ_DATA_ERRORis used instead of XZ_BUF_ERROR.

structxz_buf¶: Passing input and output buffers to XZ code

Definition:

struct xz_buf {    const uint8_t *in;    size_t in_pos;    size_t in_size;    uint8_t *out;    size_t out_pos;    size_t out_size;};

Members

in: Beginning of the input buffer. This may be NULL if and onlyif in_pos is equal to in_size.
in_pos: Current position in the input buffer. This must not exceedin_size.
in_size: Size of the input buffer
out: Beginning of the output buffer. This may be NULL if and onlyif out_pos is equal to out_size.
out_pos: Current position in the output buffer. This must not exceedout_size.
out_size: Size of the output buffer

Description

Only the contents of the output buffer from out[out_pos] onward, andthe variables in_pos and out_pos are modified by the XZ code.

structxz_dec*xz_dec_init(enumxz_modemode,uint32_tdict_max)¶: Allocate and initialize a XZ decoder state

Parameters

enumxz_modemode: Operation mode
uint32_tdict_max: Maximum size of the LZMA2 dictionary (history buffer) formulti-call decoding. This is ignored in single-call mode(mode == XZ_SINGLE). LZMA2 dictionary is always 2^n bytesor 2^n + 2^(n-1) bytes (the latter sizes are less commonin practice), so other values for dict_max don’t make sense.In the kernel, dictionary sizes of 64 KiB, 128 KiB, 256 KiB,512 KiB, and 1 MiB are probably the only reasonable values,except for kernel and initramfs images where a biggerdictionary can be fine and useful.

Description

Single-call mode (XZ_SINGLE):xz_dec_run() decodes the whole stream atonce. The caller must provide enough output space or the decoding willfail. The output space is used as the dictionary buffer, which is whythere is no need to allocate the dictionary as part of the decoder’sinternal state.

Because the output buffer is used as the workspace, streams encoded usinga big dictionary are not a problem in single-call mode. It is enough thatthe output buffer is big enough to hold the actual uncompressed data; itcan be smaller than the dictionary size stored in the stream headers.

Multi-call mode with preallocated dictionary (XZ_PREALLOC): dict_max bytesof memory is preallocated for the LZMA2 dictionary. This way there is norisk thatxz_dec_run() could run out of memory, sincexz_dec_run() willnever allocate any memory. Instead, if the preallocated dictionary is toosmall for decoding the given input stream,xz_dec_run() will returnXZ_MEMLIMIT_ERROR. Thus, it is important to know what kind of data will bedecoded to avoid allocating excessive amount of memory for the dictionary.

Multi-call mode with dynamically allocated dictionary (XZ_DYNALLOC):dict_max specifies the maximum allowed dictionary size thatxz_dec_run()may allocate once it has parsed the dictionary size from the streamheaders. This way excessive allocations can be avoided while stilllimiting the maximum memory usage to a sane value to prevent running thesystem out of memory when decompressing streams from untrusted sources.

On success,xz_dec_init() returns a pointer tostructxz_dec, which isready to be used withxz_dec_run(). If memory allocation fails,xz_dec_init() returns NULL.

enumxz_retxz_dec_run(structxz_dec*s,structxz_buf*b)¶: Run the XZ decoder

Parameters

structxz_dec*s: Decoder state allocated usingxz_dec_init()
structxz_buf*b: Input and output buffers

Description

The possible return values depend on build options and operation mode.Seeenumxz_ret for details.

Note that if an error occurs in single-call mode (return value is notXZ_STREAM_END), b->in_pos and b->out_pos are not modified and thecontents of the output buffer from b->out[b->out_pos] onward areundefined. This is true even after XZ_BUF_ERROR, because with some filterchains, there may be a second pass over the output buffer, and this passcannot be properly done if the output buffer is truncated. Thus, youcannot give the single-call decoder a too small buffer and then expect toget that amount valid data from the beginning of the stream. You must usethe multi-call decoder if you don’t want to uncompress the whole stream.

voidxz_dec_reset(structxz_dec*s)¶: Reset an already allocated decoder state

Parameters

structxz_dec*s: Decoder state allocated usingxz_dec_init()

Description

This function can be used to reset the multi-call decoder state withoutfreeing and reallocating memory withxz_dec_end() andxz_dec_init().

In single-call mode,xz_dec_reset() is always called in the beginning ofxz_dec_run(). Thus, explicit call toxz_dec_reset() is useful only inmulti-call mode.

voidxz_dec_end(structxz_dec*s)¶: Free the memory allocated for the decoder state

Parameters

structxz_dec*s: Decoder state allocated usingxz_dec_init(). If s is NULL,this function does nothing.

MicroLZMA decompressor

This MicroLZMA header format was created for use in EROFS but may be usedby others too.In most cases one needs the XZ APIs above instead.

The compressed format supported by this decoder is a raw LZMA streamwhose first byte (always 0x00) has been replaced with bitwise-negationof the LZMA properties (lc/lp/pb) byte. For example, if lc/lp/pb is3/0/2, the first byte is 0xA2. This way the first byte can never be 0x00.Just like with LZMA2, lc + lp <= 4 must be true. The LZMA end-of-streammarker must not be used. The unused values are reserved for future use.

structxz_dec_microlzma*xz_dec_microlzma_alloc(enumxz_modemode,uint32_tdict_size)¶: Allocate memory for the MicroLZMA decoder

Parameters

enumxz_modemode: XZ_SINGLE or XZ_PREALLOC
uint32_tdict_size: LZMA dictionary size. This must be at least 4 KiB andat most 3 GiB.

Description

In contrast toxz_dec_init(), this function only allocates the memoryand remembers the dictionary size.xz_dec_microlzma_reset() must be usedbefore callingxz_dec_microlzma_run().

The amount of allocated memory is a little less than 30 KiB with XZ_SINGLE.With XZ_PREALLOC also a dictionary buffer of dict_size bytes is allocated.

On success,xz_dec_microlzma_alloc() returns a pointer tostructxz_dec_microlzma. If memory allocation fails ordict_size is invalid, NULL is returned.

voidxz_dec_microlzma_reset(structxz_dec_microlzma*s,uint32_tcomp_size,uint32_tuncomp_size,intuncomp_size_is_exact)¶: Reset the MicroLZMA decoder state

Parameters

structxz_dec_microlzma*s: Decoder state allocated usingxz_dec_microlzma_alloc()
uint32_tcomp_size: Compressed size of the input stream
uint32_tuncomp_size: Uncompressed size of the input stream. A value smallerthan the real uncompressed size of the input stream canbe specified if uncomp_size_is_exact is set to false.uncomp_size can never be set to a value larger than theexpected real uncompressed size because it would eventuallyresult in XZ_DATA_ERROR.
intuncomp_size_is_exact: This is an int instead of bool to avoidrequiring stdbool.h. This should normally be set to true.When this is set to false, error detection is weaker.

enumxz_retxz_dec_microlzma_run(structxz_dec_microlzma*s,structxz_buf*b)¶: Run the MicroLZMA decoder

Parameters

structxz_dec_microlzma*s: Decoder state initialized usingxz_dec_microlzma_reset()
structxz_buf*b: Input and output buffers

Description

This works similarly toxz_dec_run() with a few important differences.Only the differences are documented here.

The only possible return values are XZ_OK, XZ_STREAM_END, andXZ_DATA_ERROR. This function cannot return XZ_BUF_ERROR: if no progressis possible due to lack of input data or output space, this function willkeep returning XZ_OK. Thus, the calling code must be written so that itwill eventually provide input and output space matching (or exceeding)comp_size and uncomp_size arguments given toxz_dec_microlzma_reset().If the caller cannot do this (for example, if the input file is truncatedor otherwise corrupt), the caller must detect this error by itself toavoid an infinite loop.

If the compressed data seems to be corrupt, XZ_DATA_ERROR is returned.This can happen also when incorrect dictionary, uncompressed, orcompressed sizes have been specified.

With XZ_PREALLOC only: As an extra feature, b->out may be NULL to skip overuncompressed data. This way the caller doesn’t need to provide a temporaryoutput buffer for the bytes that will be ignored.

With XZ_SINGLE only: In contrast toxz_dec_run(), the return value XZ_OKis also possible and thus XZ_SINGLE is actually a limited multi-call mode.After XZ_OK the bytes decoded so far may be read from the output buffer.It is possible to continue decoding but the variables b->out and b->out_posMUST NOT be changed by the caller. Increasing the value of b->out_size isallowed to make more output space available; one doesn’t need to providespace for the whole uncompressed data on the first call. The input buffermay be changed normally like with XZ_PREALLOC. This way input data can beprovided from non-contiguous memory.

voidxz_dec_microlzma_end(structxz_dec_microlzma*s)¶: Free the memory allocated for the decoder state

Parameters

structxz_dec_microlzma*s: Decoder state allocated usingxz_dec_microlzma_alloc().If s is NULL, this function does nothing.

Movatterモバイル変換

XZ data compression in Linux¶

Introduction¶

XZ related components in the kernel¶

Notes on compression options¶

xz_dec API¶