hrbrmstr/tdigestPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star36

Wicked Fast, Accurate Quantiles Using 't-Digests'

License

Unknown, MIT licenses found

Licenses found

36 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
R		R
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.build.yml		.build.yml
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
CONDUCT.md		CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md
README.qmd		README.qmd
appveyor.yml		appveyor.yml
tdigest.Rproj		tdigest.Rproj

Repository files navigation

tdigest

Wicked Fast, Accurate Quantiles Using ‘t-Digests’

Description

The t-Digest construction algorithm uses a variant of 1-dimensionalk-means clustering to produce a very compact data structure that allowsaccurate estimation of quantiles. This t-Digest data structure can beused to estimate quantiles, compute other rank statistics or even toestimate related measures like trimmed means. The advantage of thet-Digest over previous digests for this purpose is that the t-Digesthandles data with full floating point resolution. The accuracy ofquantile estimates produced by t-Digests can be orders of magnitude moreaccurate than those produced by previous digest algorithms. Methods areprovided to create and update t-Digests and retrieve quantiles from theaccumulated distributions.

Seethe original paper by Ted Dunning & OtmarErtl for more details on t-Digests.

What’s Inside The Tin

The following functions are implemented:

as.list.tdigest: Serialize a tdigest object to an R list orunserialize a serialized tdigest list back into a tdigest object
td_add: Add a value to the t-Digest with the specified count
td_create: Allocate a new histogram
td_merge: Merge one t-Digest into another
td_quantile_of: Return the quantile of the value
td_total_count: Total items contained in the t-Digest
td_value_at: Return the value at the specified quantile
tquantile: Calculate sample quantiles from a t-Digest

Installation

install.packages("tdigest")# NOTE: CRAN version is 0.4.1# orremotes::install_gitlab("hrbrmstr/tdigest")

NOTE: To use the ‘remotes’ install options you will need to have the{remotes} package installed.

Usage

library(tdigest)# current versionpackageVersion("tdigest")## [1] '0.4.2'

Basic (Low-level interface)

td<- td_create(10)td## <tdigest; size=0; compression=10; cap=70>td_total_count(td)## [1] 0td_add(td,0,1) %>%   td_add(10,1)## <tdigest; size=2; compression=10; cap=70>td_total_count(td)## [1] 2td_value_at(td,0.1)==0## [1] TRUEtd_value_at(td,0.5)==5## [1] TRUEquantile(td)## [1]  0  0  5 10 10

Bigger (and Vectorised)

td<- tdigest(c(0,10),10)is_tdigest(td)## [1] TRUEtd_value_at(td,0.1)==0## [1] TRUEtd_value_at(td,0.5)==5## [1] TRUEset.seed(1492)x<- sample(0:100,1000000,replace=TRUE)td<- tdigest(x,1000)td_total_count(td)## [1] 1e+06tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574## [10]  80.3090454  90.2594642  99.4269454 100.0000000quantile(td)## [1]   0.00000  24.74751  49.99666  75.24783 100.00000

Serialization

These [de]serialization functions make it possible to create &populate a tdigest, serialize it out, read it in at a later time andcontinue populating it enabling compact distribution accumulation &storage for large, “continuous” datasets.

set.seed(1492)x<- sample(0:100,1000000,replace=TRUE)td<- tdigest(x,1000)tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574## [10]  80.3090454  90.2594642  99.4269454 100.0000000str(in_r<- as.list(td),1)## List of 7##  $ compression   : num 1000##  $ cap           : int 6010##  $ merged_nodes  : int 226##  $ unmerged_nodes: int 0##  $ merged_count  : num 1e+06##  $ unmerged_count: num 0##  $ nodes         :List of 2##  - attr(*, "class")= chr [1:2] "tdigest_list" "list"td2<- as_tdigest(in_r)tquantile(td2, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574## [10]  80.3090454  90.2594642  99.4269454 100.0000000identical(in_r, as.list(td2))## [1] TRUE

ALTREP-aware

N<-1000000x.altrep<- seq_len(N)# this is an ALTREP in R version >= 3.5.0td<- tdigest(x.altrep)td[0.1]## [1] 93051td[0.5]## [1] 491472.5length(td)## [1] 1000000

Proof it’s faster

microbenchmark::microbenchmark(tdigest= tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1)),r_quantile= quantile(x, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1)))## Unit: microseconds##        expr       min        lq        mean     median        uq     max neval##     tdigest     3.198     3.731     7.79369     4.4895    12.792    16.4   100##  r_quantile 39197.353 39445.444 40069.38938 39584.8030 40062.945 43613.3   100

tdigest Metrics

Lang	# Files	(%)	LoC	(%)	Blank lines	(%)	# Lines	(%)
C	3	0.15	499	0.36	71	0.29	45	0.10
R	6	0.30	161	0.12	35	0.14	156	0.34
C/C++ Header	1	0.05	24	0.02	16	0.07	30	0.06
SUM	10	0.50	684	0.50	122	0.50	231	0.50

{cloc} 📦 metrics for tdigest

Code of Conduct

Please note that this project is released with a Contributor Code ofConduct. By participating in this project you agree to abide by itsterms.

About

Wicked Fast, Accurate Quantiles Using 't-Digests'

Topics

r quantile rstats t-digest

Resources

Readme

License

Unknown, MIT licenses found

Releases1

Stable Genius Latest

Aug 1, 2019

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

tdigest

Description

What’s Inside The Tin

Installation

Usage

Basic (Low-level interface)

Bigger (and Vectorised)

Serialization

ALTREP-aware

Proof it’s faster

tdigest Metrics

Code of Conduct

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Languages

Movatterモバイル変換

License

Licenses found

hrbrmstr/tdigest

Folders and files

Latest commit

History

Repository files navigation

tdigest

Description

What’s Inside The Tin

Installation

Usage

Basic (Low-level interface)

Bigger (and Vectorised)

Serialization

ALTREP-aware

Proof it’s faster

tdigest Metrics

Code of Conduct

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Languages

Packages