|
1 | 1 | /* ----------
|
2 | 2 | * pg_lzcompress.c -
|
3 | 3 | *
|
4 |
| - * $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.6 2000/07/03 23:09:52 wieck Exp $ |
| 4 | + * $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.7 2000/07/06 21:02:07 wieck Exp $ |
5 | 5 | *
|
6 | 6 | *This is an implementation of LZ compression for PostgreSQL.
|
7 | 7 | *It uses a simple history table and generates 2-3 byte tags
|
|
46 | 46 | *The return value is the size of bytes written to buff.
|
47 | 47 | *Obviously the same as PGLZ_RAW_SIZE() returns.
|
48 | 48 | *
|
49 |
| - *Thecompression algorithm and internal data format: |
| 49 | + *Thedecompression algorithm and internal data format: |
50 | 50 | *
|
51 | 51 | *PGLZ_Header is defined as
|
52 | 52 | *
|
|
57 | 57 | *
|
58 | 58 | *The header is followed by the compressed data itself.
|
59 | 59 | *
|
60 |
| - *Thealgorithmis easiest explained by describing the process |
61 |
| - *of decompression. |
| 60 | + *Thedata representationis easiest explained by describing |
| 61 | + *the processof decompression. |
62 | 62 | *
|
63 | 63 | *If varsize == rawsize + sizeof(PGLZ_Header), then the data
|
64 | 64 | *is stored uncompressed as plain bytes. Thus, the decompressor
|
|
108 | 108 | *and end up with a total compression rate of 96%, what's still
|
109 | 109 | *worth a Whow.
|
110 | 110 | *
|
| 111 | + *The compression algorithm |
| 112 | + * |
| 113 | + *The following uses numbers used in the default strategy. |
| 114 | + * |
| 115 | + *The compressor works best for attributes of a size between |
| 116 | + *1K and 1M. For smaller items there's not that much chance of |
| 117 | + *redundancy in the character sequence (except for large areas |
| 118 | + *of identical bytes like trailing spaces) and for bigger ones |
| 119 | + *the allocation of the history table is expensive (it needs |
| 120 | + *8 times the size of the input!). |
| 121 | + * |
| 122 | + *The compressor creates a table for 8192 lists of positions. |
| 123 | + *For each input position (except the last 3), a hash key is |
| 124 | + *built from the 4 next input bytes and the posiiton remembered |
| 125 | + *in the appropriate list. Thus, the table points to linked |
| 126 | + *lists of likely to be at least in the first 4 characters |
| 127 | + *matching strings. This is done on the fly while the input |
| 128 | + *is compressed into the output area. |
| 129 | + * |
| 130 | + *For each byte in the input, it's hash key (built from this |
| 131 | + *byte and the next 3) is used to find the appropriate list |
| 132 | + *in the table. The lists remember the positions of all bytes |
| 133 | + *that had the same hash key in the past in increasing backward |
| 134 | + *offset order. Now for all entries in the used lists, the |
| 135 | + *match length is computed by comparing the characters from the |
| 136 | + *entries position with the characters from the actual input |
| 137 | + *position. |
| 138 | + * |
| 139 | + *The compressor starts with a so called "good_match" of 128. |
| 140 | + *It is a "prefer speed against compression ratio" optimizer. |
| 141 | + *So if the first entry looked at already has 128 or more |
| 142 | + *matching characters, the lookup stops and that position is |
| 143 | + *used for the next tag in the output. |
| 144 | + * |
| 145 | + *For each subsequent entry in the history list, the "good_match" |
| 146 | + *is lowered by 10%. So the compressor will be more happy with |
| 147 | + *short matches the farer it has to go back in the history. |
| 148 | + *Another "speed against ratio" preference characteristic of |
| 149 | + *the algorithm. |
| 150 | + * |
| 151 | + *Thus there are 3 stop conditions for the lookup of matches: |
| 152 | + * |
| 153 | + *- a match >= good_match is found |
| 154 | + *- there are no more history entries to look at |
| 155 | + * - the next history entry is already too far back |
| 156 | + * to be coded into a tag. |
| 157 | + * |
| 158 | + *Finally the match algorithm checks that at least a match |
| 159 | + *of 3 or more bytes has been found, because thats the smallest |
| 160 | + *amount of copy information to code into a tag. If so, a tag |
| 161 | + *is omitted and all the input bytes covered by that are just |
| 162 | + *scanned for the history add's, otherwise a literal character |
| 163 | + *is omitted and only his history entry added. |
| 164 | + * |
111 | 165 | *Acknowledgements:
|
112 | 166 | *
|
113 | 167 | *Many thanks to Adisak Pochanayon, who's article about SLZ
|
|