Compression Tools Compared
Data compression works so well that popular backupand networking tools have some built in. Linux offersmore than a dozen compression tools to choosefrom, and most of them let you pick a compressionlevel too. To find out which perform best, I benchmarked87 combinations of tools and levels. Read this article to learnwhich compressor is a hundred times faster than the othersand which ones compress the most.
The most popular data compression tool for Linux is gzip, which lets youchoose a compression level from one to nine. One is fast, and nine compresses well.Choosing a good trade-off between speed and compression ratio becomesimportant when it takes hours to handle gigabytes of data. You can get asense of what your choices are from the graphshown in Figure 1. The fastest choices are on the left,and the highest compressing ones are on the top. The best all-aroundperformers are presented in the graph's upper left-hand corner.

Figure 1. Increasing the compression level in gzip increases bothcompression ratio and time required to complete.
But many other data compression tools are available to choose from inLinux. See the comprehensive compressionand decompression benchmarks inFigures 2 and 3.As with gzip, the best performers are in the upperleft-hand corner, but these charts' time axes arescaled logarithmically to accommodate huge differencesin how fast they work.
The Benchmarks
How compactly data can be compressed dependson what type of data it is. Don't expect bigperformance increases from data that's alreadycompressed, such as files in Ogg Vorbis,MP3 or JPEG format.On the other hand, I've seen data that allowsperformance increases of 1,000%!
All benchmarks in this article used the same45MB of typical Linux data, containing:
24% ELF 32-bit LSB
15% ASCII C program
11% gzip compressed data
8% ASCII English text
7% binary package
4% directory
2% current ar archive
2% Texinfo source text
2% PostScript document text
2% Bourne shell script
2% ASCII text
21% various other data types
This data set was chosen because it is more representative of the demands madeon today's Linux systems than the data used in the traditionalCanterbury and Calgary test data, because this data set isbigger and contains Linux binaries.
I used the same lightly loaded AMD AthlonXP 1700+ CPU with 1GB of RAM andversion 2.4.27-1-k7 of the Linux kernel forall tests. Unpredictable disk drive delayswere minimized by pre-loading data into RAM.Elapsed times were measured in thousandths ofa second. I'm not affiliated with any of thetools, and I strove to be objective andaccurate.
The tools that tend to compress moreandfaster are singled out in thegraphs shown in Figures 4 and 5.Use these for backups to disk drives. Remember,their time axes are scaled logarithmically. Thered lines show the top-performing ones, and thegreen lines show the top performers that also canact as filters.
Filters
Filters are tools that can be chained togetherat the command line so that the output of oneis piped elegantly into the input of the next.A common example is:
$ ls | more
Filtering is crucial for speeding up networktransfers. Without it, you have to wait forall the data to be compressed beforetransferring any of it, and you need to waitfor the whole transfer to complete beforestarting to decompress. Filters speed upnetwork transfers by allowing data to besimultaneously compressed, transferred anddecompressed. This happens with negligiblelatency if you're sending enough data.Filters also eliminate the need for anintermediate archive of your files.
Check whether the data compression tool that you wantis installed on both computers. If it's not, youcan see where to get it in the on-line Resourcesfor this article. Remember to replacea/dir in the following exampleswith the real path of the data to back up.
Unless your data already is in one big file, besmart and consolidate it with a tool such as tar.Aggregated data has more redundancy to winnow out,so it's ultimately more compressible.
But be aware that the redundancy that saps yourperformance also may make it easier to recoverfrom corruption. If you're worried aboutcorruption, you might consider testing for it withthe cksum command or adding a limited amount ofredundancy back into your compressed data with atool such as parchive or ras.
lzop often is the fastest tool. It finishesabout three times faster than gzip but stillcompresses data almost as much. It finishes abouta hundred times faster than lzma and 7za.Furthermore, lzop occasionally decompresses data even fasterthan simply copying it! Use lzop on the commandline as a filter with the backup tool named tar:
$ tar c a/dir | lzop - > backup.tar.lzo
tar's c option tells it to create one bigarchive from the files in a/dir. The | is a shellcommand that automatically pipes tar's output intolzop's input. The - tells lzop to read from itsstandard input, and the > is a shell command thatredirects lzop's output to a file namedbackup.tar.lzo.
You can restore with:
$ lzop -dc backup.tar.lzo | tar x
The d and c options tell lzop to decompressand write to standard output, respectively. tar'sx option tells it to extract the original filesfrom the archive.
Although lzop is impressive, you can get even higher compressionratios—much higher! Here's how. Combine a little-known data compressiontool named lzma with tar to increasestorage space effectively by 400%. Here's how you would use it to back up:
$ tar c a/dir | lzma -x -s26 > backup.tar.lzma
lzma's -x option tells it to compress more, andits -s option tells it how big of a dictionaryto use.
You can restore with:
$ cat backup.tar.lzma | lzma -d | tar x
The -d option tells lzma to decompress. Youneed patience to increase storage by 400%; lzmatakes about 40 times as long as gzip. In other words, thatone-hour gzip backup might take all day with lzma.
This version of lzma is the hardest compressor tofind. Make sure you get the one that acts as afilter. See Resources for its two locations.
The data compression tool with the best trade-offbetween speed and compression ratio is rzip.With compression level 0, rzip finishes about 400%faster than gzipand compactsdata 70% more. rzip accomplishes this feat byusing more working memory. Whereas gzip uses only32 kilobytes of working memory during compression,rzip can use up to 900 megabytes, but that's okaybecause memory is getting cheaper and cheaper.
Here's the big but: rzip doesn't work as a filter—yet. Unless yourdata already is in one file, you temporarily need some extra disk space fora tar archive. If you want a good project to work onthat would shake up the Linuxworld, enhance rzip to work as a filter. Untilthen, rzip is a particularly good option forsqueezing a lot of data onto CDs or DVDs, becauseit performs well and you can use your hard drivefor the temporary tar file.
Here's how to back up with rzip:
$ tar cf dir.tar a/dir$ rzip -0 dir.tar
The -0 option says to use compression level 0.Unless you use rzip's -k option, itautomatically deletes the input file, which inthis case is the tar archive. Make sure you use-k if you want to keep the original file.
rzipped tar archives can be restored with:
$ rzip -d dir.tar.rz$ tar xf dir.tar
rzip's default compression level is another topperformer. It can increase your effective diskspace by 375% but in only about a fifth of thetime lzma can take. Using it is almostexactly the same as the example above; simply omitcompression level -0.
Data compression also can speed up networktransfers. How much depends on how fast your CPUand network are. Slow networks with fast CPUs canbe sped up the most by thoroughly compressing thedata. Alternatively, slow CPUs with fastconnections do best with no compression.
Find the best compressor and compression level foryour hardware in the graphshown inFigure 6. This graph's CPU and network speed axes are scaledlogarithmically too. Look where your CPU andnetwork speeds intersect in the graph, and try thedata compression tool and compression level atthat point. It also should give you a sense of howmuch your bandwidth may increase.
Network Transfer Estimates
To find the best compressors for various CPUand network speeds, I considered how long ittakes to compress data, send it and decompress it.I projected how long compression anddecompression should take on computers ofvarious speeds by simply scaling actual testresults from my 1.7GHz CPU. For example, a3.4GHz CPU should compress data about twiceas fast. Likewise, I estimated transfer times bydividing the size of the compressed data bythe network's real speed.
The overall transfer time for non-filtering datacompression tools, such as rzip, simply shouldbe about the sum of the estimated times tocompress, send and decompress the data.
However, compressors that can act as filters,such as gzip, have an advantage. Theysimultaneously can compress, transfer anddecompress. I assumed their overall transfertimes are dominated by the slowest of thethree steps. I verified some estimates by timing realtransfers.
For example, if you have a 56Kbps dial-up modemand a 3GHz CPU, their speeds intersect in thelight-yellow region labeled lzma 26 at the topof the graph. This corresponds to using lzma witha 226 size dictionary. The graph predicts a 430%increase in effective bandwidth.
On the other hand, if you have a 1GHz network,but only a 100MHz CPU, it should be faster simply tosend the raw uncompressed data. This is depicted inthe flat black region at the bottom of the graph.
Don't assume that you always should increaseperformance the most by using lzma, however. The bestcompression tool for data transfers depends on theratio of your particular CPU's speed to yourparticular network's speed.
If the sending and receiving computers havedifferent CPU speeds, try looking up the sendingcomputer's speed in the graph. Compression can bemuch more CPU-intensive. Check whether the data compression tool and scp areinstalled on both computers. Remember to replaceuser@box.com and file with the real names.
For the fastest CPUs and/or slowest networkconnections that fall in the graph's light-yellowregion, speed up your network transfers like this:
$ cat file \| lzma -x -s26 \| ssh user@box.com "lzma -d > file"
ssh stands for secure shell. It's a safe way toexecute commands on remote computers. This mayspeed up your network transfer by more than 400%.
For fast CPUs and/or slow networks that fall intothe graph's dark-yellow zone, use rzip with acompression level of one. Because rzip doesn't workas a filter, you need temporary space for thecompressed file on the originating box:
$ rzip -1 -k file$ scp file.rz user@box.com:$ ssh user@box.com "rzip -d file.rz"
The -1 tells rzip to use compression level 1, and the-k tells it to keep its input file. Remember touse a : at the end of the scp command.
rzipped network transfers can be 375% faster.That one-hour transfer might finish in only 16 minutes!
For slightly slower CPUs and/or faster networksthat fall in the graph's orange region, try usinggzip with compression level 1. Here's how:
$ gzip -1c file | ssh user@box.com "gzip -d > file"
It might double your effective bandwidth. -1ctells gzip to use compression level 1 and write tostandard output, and -d tells it to decompress.
For fast network connections and slow CPUs fallingin the graph's blue region, quickly compress alittle with lzop at compression level 1:
$ lzop -1c file | ssh user@box.com "lzop -d > file"
The -1c tells lzop to use compression level 1 and towrite to standard output. -d tells it todecompress. Even with this minimal compression,you still might increase your hardware's effectivebandwidth by 75%.
For network connections and CPUs falling in thegraph's black region, don't compress at all. Simplysend it.
C Libraries
If you want even more performance, you may want to trycalling a C compression library from your own program.
Resources for this article:/article/8403.
Kingsley G. Morse Jr. has been using computers for 29 years, andDebian GNU/Linux has been on his desktop fornine. He worked at Hewlett-Packard and advocatesfor men's reproductive rights. He can be reachedatchange@nas.com.