- Notifications
You must be signed in to change notification settings - Fork64
A simple integer compression library in Java
License
fast-pack/JavaFastPFOR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
It is a library to compress and uncompress arrays of integersvery fast. The assumption is that most (but not all) values inyour array use much less than 32 bits, or that the gaps betweenthe integers use much less than 32 bits. These sort of arrays often come upwhen using differential coding in databases and informationretrieval (e.g., in inverted indexes or column stores).
Please note that random integers are not compressible, by thislibrary or by any other means. If you ever had the means ofsystematically compressing random integers, you could compressany data source to nothing, by recursive application of your technique.
This library can decompress integers at a rate of over 1.2 billions per second(4.5 GB/s). It is significantly faster than generic codecs (suchas Snappy, LZ4 and so on) when compressing arrays of integers.
The library is used inLinkedIn Pinot, a realtime distributed OLAP datastore.Part of this library has been integrated in Parquet (http://parquet.io/).A modified version of the library is included in the search engineTerrier (http://terrier.org/). This libary is used by ClueWebTools (https://github.com/lintool/clueweb). It is also used byApache NiFi.
This library inspired a compression scheme used by Apache Lucene and Apache Lucene.NET (e.g., seehttp://lucene.apache.org/core/4_6_1/core/org/apache/lucene/util/PForDeltaDocIdSet.html ).
It is a java port of the fastpfor C++ library (https://github.com/lemire/FastPFor).There is also a Go port (https://github.com/reducedb/encoding). The C++library is used by the zsearch engine (http://victorparmar.github.com/zsearch/)as well as in GMAP and GSNAP (http://research-pub.gene.com/gmap/).
packageorg.example;importme.lemire.integercompression.FastPFOR128;importme.lemire.integercompression.IntWrapper;importjava.util.Arrays;publicclassMain {publicstaticvoidmain(String[]args) {FastPFOR128fastpfor =newFastPFOR128();intN =9984;int[]data =newint[N];for (vari =0;i <N;i +=150) {data[i] =i; }int[]compressedoutput1 =newint[N +1024];IntWrapperinputoffset1 =newIntWrapper(0);IntWrapperoutputoffset1 =newIntWrapper(0);fastpfor.compress(data,inputoffset1,N,compressedoutput1,outputoffset1);intcompressedsize1 =outputoffset1.get();int[]recovered1 =newint[N];inputoffset1 =newIntWrapper(0);outputoffset1 =newIntWrapper(0);fastpfor.uncompress(compressedoutput1,outputoffset1,compressedsize1,recovered1,inputoffset1);// quick verification: count mismatchesintmismatches =0;for (inti =0;i <N;i++) {if (data[i] !=recovered1[i])mismatches++; }System.out.println("N=" +N +" compressedSizeWords=" +compressedsize1 +" mismatches=" +mismatches);System.out.println("first 20 original: " +Arrays.toString(Arrays.copyOf(data,20)));System.out.println("first 20 recovered: " +Arrays.toString(Arrays.copyOf(recovered1,20))); }}
For more examples, see example.java or the examples folder.
JavaFastPFOR supports compressing and uncompressing data in chunks (e.g., seeadvancedExample
inhttps://github.com/lemire/JavaFastPFOR/blob/master/example.java).
Some CODECs ("integrated codecs") assume that the integers arein sorted orders and use differential coding (they compress deltas).They can be found in the package me.lemire.integercompression.differential.Most others do not.
The Java Team at Intel (R) introduced the vector implementation for FastPFORbased on the Java Vector API that showed significant gains over thenon-vectorized implementation. For an example usage, seeexamples/vector/Example.java. The feature requires JDK 19+ and is currently foradvanced users.
We have a demo project using JavaFastPFOR as a dependency (both Maven and Gradle). See...
https://github.com/fast-pack/JavaFastPFORDemo
- Maven
Using this code in your own project is easy with maven, just addthe following code in your pom.xml file:
<dependency> <groupId>com.github.fast-pack</groupId> <artifactId>JavaFastPFor</artifactId> <version>JavaFastPFOR-0.3.2</version></dependency>
as well as jitpack as a repository...
<repositories><repository> <id>jitpack.io</id> <url>https://jitpack.io</url></repository></repositories>
Naturally, you should replace "version" by the versionyou desire.
- Gradle (groovy)
Then all you need is to edit yourbuild.gradle
file like so:
plugins { id'java'}repositories { mavenCentral() maven { url'https://jitpack.io' }}dependencies { implementation'com.github.fast-pack:JavaFastPFor:JavaFastPFOR-0.3.2'}
Naturally, you should replace "version" by the versionyou desire.
Some codecs are thread-safe while others are not.For this reason, it is best to use one codec per thread.The memory usage of a codec instance is small in any case.
Nevertheless, if you want to reuse codec instances,note that by convention, unless the documentation of a codec specifythat it is not thread-safe, then it can be assumed to be thread-safe.
In our tests, Kamikaze PForDelta is slower than our implementations. Seethe benchmarkresults directory for some results.
Reference:http://sna-projects.com/kamikaze/
Releases up to 0.1.12 require Java 7 or better.
The current development versions assume JDK 21 or better.
Compile the code and executeme.lemire.integercompression.benchmarktools.Benchmark
.
Speed is always reported in millions of integers per second.
mvn compilemvn exec:java
You may run our examples as follows:
mvn packagejavac -cp target/classes/:. example.javajava -cp target/classes/:. example
If you use Apache ant, please try this:
$ ant Benchmark
or:
$ ant Benchmark -Dbenchmark.target=BenchmarkBitPacking
http://www.javadoc.io/doc/me.lemire.integercompression/JavaFastPFOR/
This library was a key ingredient in the best paper at ECIR 2014 :
Matteo Catena, Craig Macdonald, Iadh Ounis, On Inverted Index Compression for Search Engine Efficiency, Lecture Notes in Computer Science 8416 (ECIR 2014), 2014.http://dx.doi.org/10.1007/978-3-319-06028-6_30
We wrote several research papers documenting many of the CODECs implemented here:
- Daniel Lemire, Nathan Kurz, Christoph Rupp, Stream VByte: Faster Byte-Oriented Integer Compression, Information Processing Letters (to appear)https://arxiv.org/abs/1709.08990
- Daniel Lemire, Leonid Boytsov, Nathan Kurz, SIMD Compression and the Intersection of Sorted Integers, Software Practice & Experience Volume 46, Issue 6, pages 723-749, June 2016http://arxiv.org/abs/1401.6399
- Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization, Software Practice & Experience 45 (1), 2015.http://arxiv.org/abs/1209.2137http://onlinelibrary.wiley.com/doi/10.1002/spe.2203/abstract
- Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.http://arxiv.org/abs/1503.07387
- Wayne Xin Zhao, Xudong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen, A General SIMD-based Approach to Accelerating Compression Algorithms, ACM Transactions on Information Systems 33 (3), 2015.http://arxiv.org/abs/1502.01916
Ikhtear Sharif wrote his M.Sc. thesis on this library:
Ikhtear Sharif, Performance Evaluation of Fast Integer Compression Techniques Over Tables, M.Sc. thesis, UNB 2013.https://unbscholar.lib.unb.ca/islandora/object/unbscholar%3A9399/datastream/PDF/view
He also posted his slides online:http://www.slideshare.net/ikhtearSharif/ikhtear-defense
- Fast integer compression in Go:https://github.com/ronanh/intcomp
- Encoding: Integer Compression Libraries for Gohttps://github.com/zhenjl/encoding
- CSharpFastPFOR: A C# integer compression libraryhttps://github.com/Genbox/CSharpFastPFOR
- TurboPFor is a C library that offers lots of interesting optimizations and Java wrappers. Well worth checking! (Uses a GPL license.)https://github.com/powturbo/TurboPFor
This work was supported by NSERC grant number 26143.
About
A simple integer compression library in Java
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.