Commitb027ad9

Jan Wieck

authored and

Jan Wieck

committed

Added comments about the compression algorithm as requested by Tom

Jan

1 parent43f6ab8 commitb027ad9Copy full SHA for b027ad9

File tree

1 file changed

+58

-4

lines changed

src/backend/utils/adt
- pg_lzcompress.c

1 file changed

+58

-4

lines changed

`‎src/backend/utils/adt/pg_lzcompress.c‎`

Lines changed: 58 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`/* ----------`
`2`	`2`	`* pg_lzcompress.c -`
`3`	`3`	`*`
`4`		`- * $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.6 2000/07/03 23:09:52 wieck Exp $`
	`4`	`+ * $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.7 2000/07/06 21:02:07 wieck Exp $`
`5`	`5`	`*`
`6`	`6`	`*This is an implementation of LZ compression for PostgreSQL.`
`7`	`7`	`*It uses a simple history table and generates 2-3 byte tags`
`@@ -46,7 +46,7 @@`
`46`	`46`	`*The return value is the size of bytes written to buff.`
`47`	`47`	`*Obviously the same as PGLZ_RAW_SIZE() returns.`
`48`	`48`	`*`
`49`		`- *Thecompression algorithm and internal data format:`
	`49`	`+ *Thedecompression algorithm and internal data format:`
`50`	`50`	`*`
`51`	`51`	`*PGLZ_Header is defined as`
`52`	`52`	`*`
`@@ -57,8 +57,8 @@`
`57`	`57`	`*`
`58`	`58`	`*The header is followed by the compressed data itself.`
`59`	`59`	`*`
`60`		`- *Thealgorithmis easiest explained by describing the process`
`61`		`- *of decompression.`
	`60`	`+ *Thedata representationis easiest explained by describing`
	`61`	`+ *the processof decompression.`
`62`	`62`	`*`
`63`	`63`	`*If varsize == rawsize + sizeof(PGLZ_Header), then the data`
`64`	`64`	`*is stored uncompressed as plain bytes. Thus, the decompressor`
`@@ -108,6 +108,60 @@`
`108`	`108`	`*and end up with a total compression rate of 96%, what's still`
`109`	`109`	`*worth a Whow.`
`110`	`110`	`*`
	`111`	`+ *The compression algorithm`
	`112`	`+ *`
	`113`	`+ *The following uses numbers used in the default strategy.`
	`114`	`+ *`
	`115`	`+ *The compressor works best for attributes of a size between`
	`116`	`+ *1K and 1M. For smaller items there's not that much chance of`
	`117`	`+ *redundancy in the character sequence (except for large areas`
	`118`	`+ *of identical bytes like trailing spaces) and for bigger ones`
	`119`	`+ *the allocation of the history table is expensive (it needs`
	`120`	`+ *8 times the size of the input!).`
	`121`	`+ *`
	`122`	`+ *The compressor creates a table for 8192 lists of positions.`
	`123`	`+ *For each input position (except the last 3), a hash key is`
	`124`	`+ *built from the 4 next input bytes and the posiiton remembered`
	`125`	`+ *in the appropriate list. Thus, the table points to linked`
	`126`	`+ *lists of likely to be at least in the first 4 characters`
	`127`	`+ *matching strings. This is done on the fly while the input`
	`128`	`+ *is compressed into the output area.`
	`129`	`+ *`
	`130`	`+ *For each byte in the input, it's hash key (built from this`
	`131`	`+ *byte and the next 3) is used to find the appropriate list`
	`132`	`+ *in the table. The lists remember the positions of all bytes`
	`133`	`+ *that had the same hash key in the past in increasing backward`
	`134`	`+ *offset order. Now for all entries in the used lists, the`
	`135`	`+ *match length is computed by comparing the characters from the`
	`136`	`+ *entries position with the characters from the actual input`
	`137`	`+ *position.`
	`138`	`+ *`
	`139`	`+ *The compressor starts with a so called "good_match" of 128.`
	`140`	`+ *It is a "prefer speed against compression ratio" optimizer.`
	`141`	`+ *So if the first entry looked at already has 128 or more`
	`142`	`+ *matching characters, the lookup stops and that position is`
	`143`	`+ *used for the next tag in the output.`
	`144`	`+ *`
	`145`	`+ *For each subsequent entry in the history list, the "good_match"`
	`146`	`+ *is lowered by 10%. So the compressor will be more happy with`
	`147`	`+ *short matches the farer it has to go back in the history.`
	`148`	`+ *Another "speed against ratio" preference characteristic of`
	`149`	`+ *the algorithm.`
	`150`	`+ *`
	`151`	`+ *Thus there are 3 stop conditions for the lookup of matches:`
	`152`	`+ *`
	`153`	`+ *- a match >= good_match is found`
	`154`	`+ *- there are no more history entries to look at`
	`155`	`+ * - the next history entry is already too far back`
	`156`	`+ * to be coded into a tag.`
	`157`	`+ *`
	`158`	`+ *Finally the match algorithm checks that at least a match`
	`159`	`+ *of 3 or more bytes has been found, because thats the smallest`
	`160`	`+ *amount of copy information to code into a tag. If so, a tag`
	`161`	`+ *is omitted and all the input bytes covered by that are just`
	`162`	`+ *scanned for the history add's, otherwise a literal character`
	`163`	`+ *is omitted and only his history entry added.`
	`164`	`+ *`
`111`	`165`	`*Acknowledgements:`
`112`	`166`	`*`
`113`	`167`	`*Many thanks to Adisak Pochanayon, who's article about SLZ`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitb027ad9

File tree

1 file changed

1 file changed

`‎src/backend/utils/adt/pg_lzcompress.c‎`

0 commit comments