openacid/slimPublic

NotificationsYou must be signed in to change notification settings
Fork102
Star1.9k

Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

License

MIT license

1.9k stars 102 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 490 Commits
.github		.github
.idea		.idea
_tmpl		_tmpl
array		array
benchhelper		benchhelper
docs		docs
encode		encode
index		index
scripts		scripts
tools/reports/memusage		tools/reports/memusage
trie		trie
.gitignore		.gitignore
.gitsubrepo		.gitsubrepo
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
appveyor.yml		appveyor.yml
git-subrepo		git-subrepo
go.mod		go.mod
go.sum		go.sum
int-types.md		int-types.md

Repository files navigation

Slim - surprisingly space efficient data types in Golang

Slim is collection of surprisingly space efficient data types, withcorresponding serialization APIs to persisting them on-disk or for transport.

Why slim

As data on internet keeps increasing exponentially,the capacity gap between memory and disk becomes greater.

Most of the time, a data itself does not need to be loaded into expensive main memory.Only the much more important information, WHERE-A-DATA-IS, deserve a seat inmain memory.

This is whatslim does, keeps as little information as possible in mainmemory, as a minimized index of huge amount external data.

SlimIndex: is a common index structure, building on top ofSlimTrie.
SlimTrie is the underlying index data structure, evolved fromtrie.
Features:
- Minimized:11 bits per key(far less than an 64-bits pointer!!).
- Stable:memory consumption is stable in various scenarios.The Worst case converges to average consumption tightly.See benchmark.
- Loooong keys:You can haveVERY long keys(16K bytes), without any waste of memory(and money).Do not waste your life writing another prefix compression:).(aws-s3 limits key length to 1024 bytes).Memory consumption only relates to key count,not to key length.
- Ordered:likebtree, keys are stored.Range-scan will be ready in0.6.0.
- Fast:~150 ns perGet().Time complexity for a get isO(log(n) + k); n: key count; k: key length.
- Ready for transport:a singleproto.Marshal() is all it requires to serialize, transport or persisting on disk etc.

Performance and memory overhead

3.3 times faster than thebtree.
2.3 times faster than binary search.

Memory overhead is about 11 bit per key.

The data struct in this benchmark is a slice of key-value pairs with aSlimTrie serving as the index.The slim itself is built in thefilter mode, to maximize memory reduction and performance.The whole structslimKV is a fully functional kv-store, just like a staticbtree.

typeslimKVstruct {slim*trie.SlimTrieElts []*KVElt}typeKVEltstruct {KeystringValint32}

You can find the benchmark code inbenchmark;

Synopsis

1. Index on-disk key-values

One of the typical usages of slim is to index serialized data on disk(e.g., key value records in a SSTable).By keeping a slim in memory, one can quickly find the on-disk offset of the record by a key.

Show me the code ......

package index_testimport ("fmt""strings""github.com/openacid/slim/index")typeDatastringfunc (dData)Read(offsetint64,keystring) (string,bool) {kv:=strings.Split(string(d)[offset:],",")[0:2]ifkv[0]==key {returnkv[1],true}return"",false}funcExample() {// Accelerate external data accessing (in memory or on disk) by indexing// them with a SlimTrie:// `data` is a sample of some unindexed data. In our example it is a comma// separated key value series.//// In order to let SlimTrie be able to read data, `data` should have// a `Read` method://     Read(offset int64, key string) (string, bool)data:=Data("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")// keyOffsets is a prebuilt index that stores key and its offset in data accordingly.keyOffsets:= []index.OffsetIndexItem{{Key:"Aaron",Offset:0},{Key:"Agatha",Offset:8},{Key:"Al",Offset:17},{Key:"Albert",Offset:22},{Key:"Alexander",Offset:31},{Key:"Alison",Offset:43},}// `SlimIndex` is simply a container of SlimTrie and its data.st,err:=index.NewSlimIndex(keyOffsets,data)iferr!=nil {fmt.Println(err)}// Lookupv,found:=st.Get("Alison")fmt.Printf("key: %q\n  found: %t\n  value: %q\n","Alison",found,v)v,found=st.Get("foo")fmt.Printf("key: %q\n  found: %t\n  value: %q\n","foo",found,v)// Output:// key: "Alison"//   found: true//   value: "8"// key: "foo"//   found: false//   value: ""}

2. Sparse index

Create an index item for every 4(or more as you wish) keys.

Let several adjacent keys share one index item reduces a lot memorycost if there are huge amount keys in external data.Such as to index billions of 4KB objects on a 4TB disk(because one disk IOcosts 20ms for either reading 4KB or reading 1MB).

Show me the code ......

package index_testimport ("fmt""strings""github.com/openacid/slim/index")typeRangeDatastringfunc (dRangeData)Read(offsetint64,keystring) (string,bool) {fori:=0;i<4;i++ {ifint(offset)>=len(d) {break}kv:=strings.Split(string(d)[offset:],",")[0:2]ifkv[0]==key {returnkv[1],true}offset+=int64(len(kv[0])+len(kv[1])+2)}return"",false}funcExample_indexRanges() {// Index ranges instead of keys:// In this example at most 4 keys shares one index item.data:=RangeData("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")// keyOffsets is a prebuilt index that stores range start, range end and its offset.keyOffsets:= []index.OffsetIndexItem{// Aaron  +--> 0// Agatha |// Al     |// Albert |// Alexander +--> 31// Alison    |{Key:"Aaron",Offset:0},{Key:"Agatha",Offset:0},{Key:"Al",Offset:0},{Key:"Albert",Offset:0},{Key:"Alexander",Offset:31},{Key:"Alison",Offset:31},}st,err:=index.NewSlimIndex(keyOffsets,data)iferr!=nil {panic(err)}v,found:=st.RangeGet("Aaron")fmt.Printf("key: %q\n  found: %t\n  value: %q\n","Aaron",found,v)v,found=st.RangeGet("Al")fmt.Printf("key: %q\n  found: %t\n  value: %q\n","Al",found,v)v,found=st.RangeGet("foo")fmt.Printf("key: %q\n  found: %t\n  value: %q\n","foo",found,v)// Output:// key: "Aaron"//   found: true//   value: "1"// key: "Al"//   found: true//   value: "2"// key: "foo"//   found: false//   value: ""}

3. Range scan

Slim can also be used as a traditional in-memory kv-store:Building a slim withOpt{ Complete: Bool(true) },it won't strip out any information(e.g., it won't eliminate single-branch labels)and it will functions the same as abtree.This snippet shows how to iterate key values.

Show me the code ......

package trieimport ("fmt""github.com/openacid/slim/encode")funcExampleSlimTrie_ScanFrom() {varkeys= []string{"","`","a","ab","abc","abca","abcd","abcd1","abce","be","c","cde0","d",}values:=makeI32s(len(keys))codec:= encode.I32{}st,_:=NewSlimTrie(codec,keys,values,Opt{Complete:Bool(true),})// untilD stops when encountering "d".untilD:=func(k,v []byte)bool {ifstring(k)=="d" {returnfalse}_,i32:=codec.Decode(v)fmt.Println(string(k),i32)returntrue}fmt.Println("scan (ab, +∞):")st.ScanFrom("ab",false,true,untilD)fmt.Println()fmt.Println("scan [be, +∞):")st.ScanFrom("be",true,true,untilD)fmt.Println()fmt.Println("scan (ab, be):")st.ScanFromTo("ab",false,"be",false,true,untilD)// Output://// scan (ab, +∞):// abc 4// abca 5// abcd 6// abcd1 7// abce 8// be 9// c 10// cde0 11//// scan [be, +∞):// be 9// c 10// cde0 11//// scan (ab, be):// abc 4// abca 5// abcd 6// abcd1 7// abce 8}

Filter mode and KV mode.

Slim can be built into either a filter(likebloom filter but with key order preserved.) or a real kv-store(likebtree)There is anoption inNewSlimTrie(..., option) to control the building behavior.Ref:Opt

To use slim as a kv-store, set the option toComplete then there won't be false positives.
To use it as a filter, setInnerPrefix,LeafPrefix to false(Complete impliesInnerPrefix==true andLeafPrefix==true).Then slim won't store any single branch label in the trie it builds.
WithInnerPrefix==true, it does not reduce a single label branch that leads to an inner node.
WithLeafPrefix==true, it does not reduce a single label branch that leads to a leaf node.
E.g.:
```
// CompleteInnerPrefix: trueLeafPrefix: true^ -a-> 1 -b-> $ `-c-> 2 -x-> 3 -y-> $               `-z-> $InnerPrefix: trueLeafPrefix: false^ -a-> $ `-c-> 2 -x-> 3 -y-> $               `-z-> $InnerPrefix: falseLeafPrefix: true^ -a-> 1 -b-> $ `-c-> 3 -y-> $        `-z-> $InnerPrefix: falseLeafPrefix: false^ -a-> $ `-c-> 3 -y-> $        `-z-> $
```

The memory consumption in filter mode and kv mode differs significantly.The following chart shows memory consumption by 1 million var-length string, 10 to 20 byte in different mode:

-	size	gzip-size
sample data size	15.0M	14.0M
Complete:true	14.0M	10.0M
InnerPrefix:ture	1.3M	0.9M
all false	1.3M	0.8M

Try it

Install

go get github.com/openacid/slim/trie

Change-log:Change-log

Versions

A newer versiony being compatible with an older versionx meansy canload data serialized byx. Butx should never try to load data serialized bya newer versiony.

v0.5.* is compatible with0.2.*,0.3.*,0.4.*,0.5.*.
v0.4.* is compatible with0.2.*,0.3.*,0.4.*.
v0.3.* is compatible with0.2.*,0.3.*.
v0.2.* is compatible with0.2.*.

Who are using slim

baishancloud

Feedback and contributions

Feedback and Contributions are greatly appreciated.

At this stage, the maintainers are most interested in feedback centered on:

Do you have a real life scenario thatslim supports well, or doesn't support at all?
Do any of the APIs fulfill your needs well?

Let us know by filing an issue, describing what you did or wanted to do, whatyou expected to happen, and what actually happened:

Or other type ofissue.

Authors

刘保海marshaling
吴义谱array
张炎泼slimtrie design
李文博trie-compressing, trie-search
李树龙marshaling

See also the list ofcontributors who participated in this project.

License

This project is licensed under the MIT License - see theLICENSE file for details.

About

Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

openacid.github.io/

Movatterモバイル変換

License

openacid/slim

Folders and files

Latest commit

History

Repository files navigation

Slim - surprisingly space efficient data types in Golang

Why slim

Performance and memory overhead

Synopsis

1. Index on-disk key-values

2. Sparse index

3. Range scan

Filter mode and KV mode.

Try it

Versions

Who are using slim

Feedback and contributions

Authors

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors8

Uh oh!

Languages

Packages