bwesterb/go-ncrlitePublic

NotificationsYou must be signed in to change notification settings
Fork0
Star22

Compress sets of integers efficiently

License

MIT license

22 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
cmd/ncrlite		cmd/ncrlite
LICENSE		LICENSE
README.md		README.md
bitio.go		bitio.go
bitio_test.go		bitio_test.go
go.mod		go.mod
go.sum		go.sum
huffman.go		huffman.go
ncrlite.go		ncrlite.go
ncrlite_test.go		ncrlite_test.go

Repository files navigation

go-ncrlite

ncrlite is a simple and fast compression format specifically designed to compress an unorderedset of positive integers (below 2⁶⁴).This repository contains aGo packagethat implements it and a commandline tool.

Warning. The file format is not yet final.

Performance

ncrlite achieves smaller compressed sizes than general-purpose compressors.

Dataset	Description	CSV	ncrlite	`gzip -9`	`xz -9`
le.csv	Sequence numbers of Let's Encrypt certificates revoked on July 18th, 2024	4.8MB	706kB	1.7MB	900kB
primes.csv	First million prime numbers	8.2MB	674kB	2.4MB	941kB
sigs.csv	List of the 9 signature algorithms supported by Chrome 126	44B	16B	58B	96B
9900.csv	Numbers {9900, 9901, ..., 9999, 10000}	506B	24B	181B	200B

Compared to more specialized compressors,ncrlite outperformsElias–Fano.nrclite performs slightly worse thanRice coding on random sets,but is still close to the theoretical limit oflg N choose k.ncrlite does perform better than Rice coding on skewed sets like {9900, ..., 10000}.

Dataset	ncrlite	Rice	Elias–Fano	Limit for random sets
le.csv	706kB	707kB	734kB	704kB
primes.csv	674kB	669kB	742kB	668kB
sigs.csv	16B	11B	11B	11B
9900.csv	24B	108B	108B	101B

Theoretical limit for random sets

There areN choose k subsets ofk positive integers belowN.Thus there is a hard limit: no compression method can encodeeverysuch set in less thanlg N choose k bits.

Of course a compression method can beat the limit for specific sets,but it will have to compensate by using more bits for others.

Origin of the name

The namencrlite is a pun on this theoretical limitandCRLite.NamelyN choose k is sometimes written asN nCr k, includingon my oldTI 83+,and I studied this problem initially in the context of compressingcertificate transparency index numbers of revoked certificates.

Commandline tool

Installation

InstallGo and run

$ go install github.com/bwesterb/go-ncrlite/cmd/ncrlite@latest

Now you can usencrlite.

Basic operation

ncrlite takes as input a textfile with a positive number on each line.

$ cat dunbar515351505001500

To compress simply run:

$ ncrlite dunbar

This will createdunbar.ncrlite and removedunbar.

The input file does not have to be sorted (numerically). If it is not,ncrlite will sort the input first, which is slower.

To decompress, run:

$ ncrlite -d dunbar.ncrlite

This will createdunbar and removedunbar.ncrlite. The output file is always sorted.

Other formats

At the moment, thencrlite commandline tool only supports the simple text format.Reach out if another is useful.

Other flags

ncrlite supports several familiar flags.

  -f, --force    overwrite output  -k, --keep    keep (don't delete) input file  -c, --stdout    write to stdout; implies -k

Without specifying a filename (or using-),ncrlite will read fromstdin and write tostdout.

Inspect compressed file

With-i we can inspect a compressed file:

$ ncrlite -i le.csv.ncrlite max bitlength        14codelength h[0]      9dictionary size      56bCodebook bitlengths: 0 111111110 1 11111110 2 1111100 3 1111101 4 11110 5 1100 6 1101 7 100 8 00 9 0110 10111 111012 111111013 111111111014 1111111111Maximum value    (N)  382584265Number of values (k)  512652Theoretical best avg  703953.8BOverhead              0.4%

Format

In short: we store the deltas (differences) which are each prefixed by a Huffmancode for their bitlength. The Huffman code is stored using bzip2's method.

Now, in detail. The file starts with thesize of the set as an unsigned varint.

There are two special cases.

If the size of the set is zero, the file ends immediately after the size(without endmarker.)
If the size of the set is one, then the value of that element is encodedas an unsigned varint after it and the file ends (without endmarker.)

The values of the set are not encoded directly, but instead theirdeltasare encoded. Thenth delta is the difference between thenthand then-1th value, considering the set as a sorted list.

The first delta is special: it's the minimum value of the set plus oneso that a delta is never zero.

For each deltad, we consider itsbitlength. That is the leastlsuch that2^(l+1) > d. Note that this is different from the typicaldefinition of bitlength being one smaller: the length of 1 is 0 and of 4 is 2.

The six least significant bits of the next byte encode the largest bitlengthof any delta that occurs. We assign to that bitlength, and each smallerbitlength, a canonical Huffman code by encoding the length of each of theircodewords.

The next six bits (that is: the two most significantbits of the byte used for the largest bitlength, and the four least significantbits of the byte afterward) encode the length of the codeword for thezero bitlength.

We continue with the remaining bitlengths in order. If the next bitlengthhas a codeword of the same length as the previous codeword, we encode thiswith a single bit 1.

Instead, if the codeword is one larger we encode this as first a single bit 0to say we're not done; then a single bit 1 to say the next is larger;and finally a 1 to say we're done. Together:0b101.If it's two larger we repeat twice:0b10101. And so on. If the codewordis one smaller we use0b001, and repeat00 if the difference is larger.

After having encoded the Huffman code for the bitlengths, we encodethe deltas themselves. First we write the Huffman code for the bitlength.Then we write the delta with that many bits, without its most significant bitas it's implied.

Finally, we write the endmarker0xaa =0b10101010. This allows for simplerdecompression using prefix tables. The remaining high bits in the final byteare set to zero.

About

Compress sets of integers efficiently

Releases

No releases published

Packages

No packages published

Languages

Go100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

go-ncrlite

Performance

Theoretical limit for random sets

Origin of the name

Commandline tool

Installation

Basic operation

Other formats

Other flags

Inspect compressed file

Format

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

bwesterb/go-ncrlite

Folders and files

Latest commit

History

Repository files navigation

go-ncrlite

Performance

Theoretical limit for random sets

Origin of the name

Commandline tool

Installation

Basic operation

Other formats

Other flags

Inspect compressed file

Format

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages