jermp/interpolative_codingPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star30

A flexible and efficient C++ implementation of the Binary Interpolative Coding algorithm.

License

MIT license

30 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
external		external
include		include
src		src
test		test
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Binary Interpolative Coding

A C++ library implementing theBinary Interpolative Coding compression algorithm invented by Alistair Moffat and Lang Stuiver [1].

The algorithm can be used to compress sorted integer sequences (here,assumed to be increasing).

The implementation comes in different flavours:it can be specified the use ofsimplebinary codes,left-most minimal codes andcentered minimal codes.Additionally, the implementation isrun-aware, i.e.,it optimizes encoding/decoding of runs of consecutive identifiers.

All details and experiments are provided in the followingtechnical report [2]

Compiling the code

The code is tested on Linux withgcc 7.3.0, 8.3.0, 9.2.1 and on Mac 10.14 withclang 10.0.0.To build the code,CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/interpolative_coding.git

If you have cloned the repository without--recursive, you will need to perform the following commands beforecompiling:

git submodule initgit submodule update

To compile the code for a release environmentand best performance (see fileCMakeLists.txt for the used compilation flags), do:

mkdir buildcd buildcmake .. -DRUNAWARE=Onmake

Hint: Usemake -j4 to compile the library in parallel using, e.g., 4 jobs.

For a testing environment, use the following instead:

mkdir debug_buildcd debug_buildcmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_SANITIZERS=Onmake

Quick Start

For a quick start, see the source filetest/example.cpp.After compilation, run this example with

./example

A simpler variation is shown below.

#include<iostream>#include"interpolative_coding.hpp"usingnamespacebic;template<typename BinaryCode>voidtest(std::vector<uint32_t>const& in) {    std::cout <<"to be encoded:\n";for (auto x : in) {        std::cout << x <<"";    }    std::cout << std::endl;uint32_t n = in.size();    encoder<typename BinaryCode::writer> enc;    enc.encode(in.data(), n);    std::vector<uint32_t>out(n);    decoder<typename BinaryCode::reader> dec;uint32_t m = dec.decode(enc.bits().data(), out.data());assert(m == n);    std::cout <<"decoded" << m <<" values" << std::endl;    std::cout <<"total bits" << enc.num_bits() << std::endl;    std::cout <<static_cast<double>(enc.num_bits()) / m <<" bits x key"              << std::endl;    std::cout <<"decoded:\n";for (auto x : out) {        std::cout << x <<"";    }    std::cout << std::endl;}intmain(int argc,char** argv) {if (argc <2) {        std::cerr << argv[0] <<" binary_code_type" << std::endl;return1;    }    std::vector<uint32_t> in = {3,4,7,13,14,15,21,25,36,38,54,62};    std::stringtype(argv[1]);if (type =="binary") {        test<binary>(in);    }elseif (type =="leftmost_minimal") {        test<leftmost_minimal>(in);    }elseif (type =="centered_minimal") {        test<centered_minimal>(in);    }else {        std::cerr <<"unknown type '" << type <<"'" << std::endl;return1;    }return0;}

Encoding/decoding a collection of sequences

Typically, we want to build all the sequences froma collection.In this case, we assume that the input collectionis a binary file with all the sequences being writtenas 32-bit integers. In this library, we follow theinput data format of theds2i library:each sequence is prefixed by an additional32-bit integer representing the size of the sequence.The collection file starts with a singleton sequencecontaining the universe of representation of the sequences, i.e., the maximum representable value.

We also assume all sequences areincreasing.

The filedata/test_collection.docs represents an example ofsuch organization.

To encode all the sequences from this file, do:

./encode leftmost_minimal ../data/test_collection.docs -o test.bin

To decode all the sequences from the encoded filetest.bin, do:

./decode leftmost_minimal test.bin

To check correctness of the implementation, use:

./check leftmost_minimal test.bin ../data/test_collection.docs

which will compare every decoded integer against the input collection.

Benchmark

For this benchmark we used the whole Gov2 datasets, containing5,742,630,292 integers in 35,636,425 sequences.

We report the average number of bits per integer (bpi)and nanoseconds spent per decoded integer (with and without therun-aware optimization).

We used two different Intel processors: i7-7700and i9-9900K, both clocked at 3.6 GHz and having 32K L1 caches forinstructions and data.Both systems run Linux 4.4.0 and have 64 GB on RAM.The code was compiled with gcc 7.3.0 on the firstsystem; with gcc 8.3.0 on the second.In both cases we used all optimizations(see alsoCMakeLists.txt).

Method	bpi	ns/int (run-aware) on i7-7700	ns/int (not run-aware) on i7-7700	ns/int (run-aware) on i9-9900K	ns/int (not run-aware) on i9-9900K
simple	3.532	3.45	4.65	2.52	3.37
left-most minimal	3.362	5.78	7.07	4.18	5.28
centered minimal	3.361	5.78	7.07	4.24	5.33

Author

Giulio Ermanno Pibiri,giulio.ermanno.pibiri@isti.cnr.it

References

[1] Alistair Moffat and Lang Stuiver. 2000.Binary Interpolative Coding for Effective Index Compression. Information Retrieval Journal 3, 1 (2000), 25 – 47.
[2] Giulio Ermanno Pibiri. 2019.On Implementing the Binary Interpolative Coding Algorithm. Technical report.http://pages.di.unipi.it/pibiri/papers/BIC.pdf

About

A flexible and efficient C++ implementation of the Binary Interpolative Coding algorithm.

pages.di.unipi.it/pibiri/papers/BIC.pdf

Releases1

Version 1.0.0 Latest

Sep 30, 2021

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Binary Interpolative Coding

Table of contents

Compiling the code

Quick Start

Encoding/decoding a collection of sequences

Benchmark

Author

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Languages

Movatterモバイル変換

License

jermp/interpolative_coding

Folders and files

Latest commit

History

Repository files navigation

Binary Interpolative Coding

Table of contents

Compiling the code

Quick Start

Encoding/decoding a collection of sequences

Benchmark

Author

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Languages

Packages