Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

SlimArray compresses uint32 into several bits, by using a polynomial to describe overall trend of an array.

License

NotificationsYou must be signed in to change notification settings

openacid/slimarray

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Travistest

Report cardCoverage Status

GoDocPkgGoDevSourcegraph

SlimArray is a space efficient, staticuint32 array.It uses polynomial to compress and store an array.With a SlimArray with a million sorted number in range[0, 1000*1000],

  • auint32 requires only5 bits (17% of original data);
  • compressing auint32 takes110 ns, e.g., 9 million insert per second;
  • reading auint32 withGet() takes7 ns.
  • batch reading withSlice() takes3.8 ns/elt.

SlimBytes is an array of var-length records(a record is a[]byte), which is indexed by SlimArray.Thus the memory overhead of storingoffset andlength of each record is very low, e.g., about8 bits/record,compared to a typical implementation that uses an offset of type int(32 to 64 bit / record).AnGet() takes15 ns.

中文介绍:https://blog.openacid.com/algo/slimarray/

Why

  • Space efficient: In a sorted array, an elt only takes about10 bits tostore a 32-bit int.
Data sizeData Setgzip sizeslimarry sizeavg sizeratio
1,000rand u32: [0, 1000]x824 byte6 bit/elt18%
1,000,000rand u32: [0, 1000,000]x702 KB5 bit/elt15%
1,000,000IPv4 DB2 MB2 MB16 bit/elt50%
600slim star count602 byte832 byte10 bit/elt26%
  • Fast:Get(): 7 ns/op. Building: 110 ns/elt. Run and see the benchmark:go test . -bench=..

  • Adaptive: It does not require the data to be totally sorted to compressit. E.g., SlimArray is perfect to store online user histogram data.

  • Ready for transport: slimarray is protobuf defined, and has the same structure in memory ason disk. No cost to load or dump.

What It Is And What It Is Not

Another space efficient data structure to store uint32 array is trie(Aka prefixtree or radix tree). It is possible to use bitmap-based btree like structureto reduce space(very likely in such case it provides higher compression rate).But it requires the array to besorted.

SlimArray does not have such restriction. It is more adaptive with datalayout. To achieve high compression rate, it only requires the data has aoverall trend, e.g.,roughly sorted.

Additionally, it also accept duplicated element in the array, whicha bitmap based or tree-like data structure does not allow.

In theipv4-list example, we feed 450,000 ipv4 to SlimArray.We see that SlimArray costs as small as gzip-ed data(2.1 MB vs 2.0 MB),while it provides instance access to the data without decompressing it.And in theslimstar example, SlimArray memory usage vs gzip-ed data is 832 bytes vs 602 bytes.

Limitation

  • Static: slimarray is a static data structure that can not be modifiedafter creation. Thus slimarray is ideal for a time-series-database, i.e., dataset is huge but never change.

  • 32 bits: currently slimarray supports only one element typeuint32.

Install

go get github.com/openacid/slimarray

Synopsis

Build a SlimArray

package slimarray_testimport ("fmt""github.com/openacid/slimarray")funcExampleSlimArray() {nums:= []uint32{0,16,32,48,64,79,95,111,126,142,158,174,190,206,222,236,252,268,275,278,281,283,285,289,296,301,304,307,311,313,318,321,325,328,335,339,344,348,353,357,360,364,369,372,377,383,387,393,399,404,407,410,415,418,420,422,426,430,434,439,444,446,448,451,456,459,462,465,470,473,479,482,488,490,494,500,506,509,513,519,521,528,530,534,537,540,544,546,551,556,560,566,568,572,574,576,580,585,588,592,594,600,603,606,608,610,614,620,623,628,630,632,638,644,647,653,658,660,662,665,670,672,676,681,683,687,689,691,693,695,697,703,706,710,715,719,722,726,731,735,737,741,748,750,753,757,763,766,768,775,777,782,785,791,795,798,800,806,811,815,818,821,824,829,832,836,838,842,846,850,855,860,865,870,875,878,882,886,890,895,900,906,910,913,916,921,925,929,932,937,940,942,944,946,952,954,956,958,962,966,968,971,975,979,983,987,989,994,997,1000,}a:=slimarray.NewU32(nums)fmt.Println("last elt is:",a.Get(int32(a.Len()-1)))st:=a.Stat()for_,k:=range []string{"elt_width","mem_elts","bits/elt"} {fmt.Printf("%10s : %d\n",k,st[k])}// Unordered output:// last elt is: 1000//  elt_width : 3//   mem_elts : 112//   bits/elt : 16}

Build a SlimBytes

package slimarray_testimport ("fmt""github.com/openacid/slimarray")funcExampleSlimBytes() {records:= [][]byte{[]byte("SlimBytes"),[]byte("is"),[]byte("an"),[]byte("array"),[]byte("of"),[]byte("var-length"),[]byte("records(a"),[]byte("record"),[]byte("is"),[]byte("a"),[]byte("[]byte"),[]byte("which"),[]byte("is"),[]byte("indexed"),[]byte("by"),[]byte("SlimArray"),}a,err:=slimarray.NewBytes(records)_=errfori:=0;i<16;i++ {fmt.Print(string(a.Get(int32(i)))," ")}fmt.Println()// Output:// SlimBytes is an array of var-length records(a record is a []byte which is indexed by SlimArray}

How it works

Package slimarray uses polynomial to compress and store an array of uint32. Auint32 costs only 5 bits in a sorted array of a million number in range [0,1000*1000].

The General Idea

We use a polynomial y = a + bx + cx² to describe the overall trend of thenumbers. And for every number i we add a residual to fit the gap between y(i)and nums[i]. E.g. If there are 4 numbers: 0, 15, 33, 50 The polynomial andresiduals are:

y = 16x0, -1, 1, 2

In this case the residuals require 3 bits for each of them. To retrieve thenumbers, we evaluate y(i) and add the residual to it:

get(0) = y(0) + 0 = 16 * 0 + 0 = 0get(1) = y(1) - 1 = 16 * 1 - 1 = 15get(2) = y(2) + 1 = 16 * 2 + 1 = 33get(3) = y(3) + 2 = 16 * 3 + 2 = 50

What It Is And What It Is Not

Another space efficient data structure to store uint32 array is trie or prefixtree or radix tree. It is possible to use bitmap-based btree like structure toreduce space(very likely in such case it provides higher compression rate). Butit requires the array to be sorted.

SlimArray does not have such restriction. It is more adaptive with data layout.To achieve high compression rate, it only requires the data has a overall trend,e.g., roughly sorted, as seen in the above 4 integers examples. Additionally, italso accept duplicated element in the array, which a bitmap based or tree-likedata structure does not allow.

Data Structure

SlimArray splits the entire array into segments(Seg), each of which has 1024numbers. And then it splits every segment into several spans. Every span has itsown polynomial. A span has 16*k numbers. A segment has at most 64 spans.

        seg[0]                      seg[1]        1024 nums                   1024 nums|-------+---------------+---|---------------------------|... span[0]    span[1] 16 nums    32 nums      ..

Uncompressed Data Structures

A SlimArray is a compacted data structure. The original data structures aredefined as follow(assumes original user data isnums []uint32):

Seg struct {  SpansBitmap   uint64      // describe span layout  Rank         uint64      // count `1` in preceding Seg.  Spans       []Span}Span struct {  width         int32       // is retrieved from SpansBitmap  Polynomial [3]double      //  Config struct {           //    Offset        int32     // residual offset    ResidualWidth int32     // number of bits a residual requires  }  Residuals  [width][ResidualWidth]bit // pack into SlimArray.Residuals}

A span stores 16*k int32 in it, where k ∈ [1, 64).

Seg.SpansBitmap describes the layout of Span-s in a Seg. The i-th "1"indicates where the last 16 numbers are in the i-th Span. e.g.:

001011110000......<-- least significant bit

In the above example:

span[0] has 16*3 nums in it.span[1] has 16*2 nums in it.span[2] has 16*1 nums in it.

Seg.Rank caches the total count of "1" in all preceding Seg.SpansBitmap. Thisaccelerate locating a Span in the packed field SlimArray.Polynomials .

Span.width is the count of numbers stored in this span. It does not need to bestored because it can be calculated by counting the "0" between two "1" inSeg.SpansBitmap.

Span.Polynomial stores 3 coefficients of the polynomial describing the overalltrend of this span. I.e. the[a, b, c] iny = a + bx + cx²

Span.Config.Offset adjust the offset to locate a residual. In a span we wantto have that:

residual position = Config.Offset + (i%1024) * Config.ResidualWidth

But if the preceding span has smaller residual width, the "offset" could benegative, e.g.: span[0] has residual of width 0 and 16 residuals, span[1] hasresidual of width 4. Then the "offset" of span[1] is-16*4 in order tosatisfy:(-16*4) + i * 4 is the correct residual position, for i in [16, 32).

Span.Config.ResidualWidth specifies the number of bits to store every residualin this span, it must be a power of 2:2^k.

Span.Residuals is an array of residuals of lengthSpan.width. Every elt init is aResidualWidth-bits integers.

Compact

SlimArray compactSeg into a dense format:

SlimArray.Bitmap = [  Seg[0].SpansBitmap,  Seg[1].SpansBitmap,  ... ]SlimArray.Polynomials = [  Seg[0].Spans[0].Polynomials,  Seg[0].Spans[1].Polynomials,  ...  Seg[1].Spans[0].Polynomials,  Seg[1].Spans[1].Polynomials,  ...]SlimArray.Configs = [  Seg[0].Spans[0].Config  Seg[0].Spans[1].Config  ...  Seg[1].Spans[0].Config  Seg[1].Spans[1].Config  ...]

SlimArray.Residuals simply packs the residuals of every nums[i] together.

About

SlimArray compresses uint32 into several bits, by using a polynomial to describe overall trend of an array.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp