- Notifications
You must be signed in to change notification settings - Fork8
Wicked Fast, Accurate Quantiles Using 't-Digests'
License
Unknown, MIT licenses found
Licenses found
hrbrmstr/tdigest
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Wicked Fast, Accurate Quantiles Using ‘t-Digests’
The t-Digest construction algorithm uses a variant of 1-dimensionalk-means clustering to produce a very compact data structure that allowsaccurate estimation of quantiles. This t-Digest data structure can beused to estimate quantiles, compute other rank statistics or even toestimate related measures like trimmed means. The advantage of thet-Digest over previous digests for this purpose is that the t-Digesthandles data with full floating point resolution. The accuracy ofquantile estimates produced by t-Digests can be orders of magnitude moreaccurate than those produced by previous digest algorithms. Methods areprovided to create and update t-Digests and retrieve quantiles from theaccumulated distributions.
Seethe original paper by Ted Dunning & OtmarErtl for more details on t-Digests.
The following functions are implemented:
as.list.tdigest
: Serialize a tdigest object to an R list orunserialize a serialized tdigest list back into a tdigest objecttd_add
: Add a value to the t-Digest with the specified counttd_create
: Allocate a new histogramtd_merge
: Merge one t-Digest into anothertd_quantile_of
: Return the quantile of the valuetd_total_count
: Total items contained in the t-Digesttd_value_at
: Return the value at the specified quantiletquantile
: Calculate sample quantiles from a t-Digest
install.packages("tdigest")# NOTE: CRAN version is 0.4.1# orremotes::install_gitlab("hrbrmstr/tdigest")
NOTE: To use the ‘remotes’ install options you will need to have the{remotes} package installed.
library(tdigest)# current versionpackageVersion("tdigest")## [1] '0.4.2'
td<- td_create(10)td## <tdigest; size=0; compression=10; cap=70>td_total_count(td)## [1] 0td_add(td,0,1) %>% td_add(10,1)## <tdigest; size=2; compression=10; cap=70>td_total_count(td)## [1] 2td_value_at(td,0.1)==0## [1] TRUEtd_value_at(td,0.5)==5## [1] TRUEquantile(td)## [1] 0 0 5 10 10
td<- tdigest(c(0,10),10)is_tdigest(td)## [1] TRUEtd_value_at(td,0.1)==0## [1] TRUEtd_value_at(td,0.5)==5## [1] TRUEset.seed(1492)x<- sample(0:100,1000000,replace=TRUE)td<- tdigest(x,1000)td_total_count(td)## [1] 1e+06tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))## [1] 0.0000000 0.8099857 9.6725790 19.7533723 29.7448283 39.7544675 49.9966628 60.0235148 70.2067574## [10] 80.3090454 90.2594642 99.4269454 100.0000000quantile(td)## [1] 0.00000 24.74751 49.99666 75.24783 100.00000
These [de]serialization functions make it possible to create &populate a tdigest, serialize it out, read it in at a later time andcontinue populating it enabling compact distribution accumulation &storage for large, “continuous” datasets.
set.seed(1492)x<- sample(0:100,1000000,replace=TRUE)td<- tdigest(x,1000)tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))## [1] 0.0000000 0.8099857 9.6725790 19.7533723 29.7448283 39.7544675 49.9966628 60.0235148 70.2067574## [10] 80.3090454 90.2594642 99.4269454 100.0000000str(in_r<- as.list(td),1)## List of 7## $ compression : num 1000## $ cap : int 6010## $ merged_nodes : int 226## $ unmerged_nodes: int 0## $ merged_count : num 1e+06## $ unmerged_count: num 0## $ nodes :List of 2## - attr(*, "class")= chr [1:2] "tdigest_list" "list"td2<- as_tdigest(in_r)tquantile(td2, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))## [1] 0.0000000 0.8099857 9.6725790 19.7533723 29.7448283 39.7544675 49.9966628 60.0235148 70.2067574## [10] 80.3090454 90.2594642 99.4269454 100.0000000identical(in_r, as.list(td2))## [1] TRUE
N<-1000000x.altrep<- seq_len(N)# this is an ALTREP in R version >= 3.5.0td<- tdigest(x.altrep)td[0.1]## [1] 93051td[0.5]## [1] 491472.5length(td)## [1] 1000000
microbenchmark::microbenchmark(tdigest= tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1)),r_quantile= quantile(x, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1)))## Unit: microseconds## expr min lq mean median uq max neval## tdigest 3.198 3.731 7.79369 4.4895 12.792 16.4 100## r_quantile 39197.353 39445.444 40069.38938 39584.8030 40062.945 43613.3 100
Lang | # Files | (%) | LoC | (%) | Blank lines | (%) | # Lines | (%) |
---|---|---|---|---|---|---|---|---|
C | 3 | 0.15 | 499 | 0.36 | 71 | 0.29 | 45 | 0.10 |
R | 6 | 0.30 | 161 | 0.12 | 35 | 0.14 | 156 | 0.34 |
C/C++ Header | 1 | 0.05 | 24 | 0.02 | 16 | 0.07 | 30 | 0.06 |
SUM | 10 | 0.50 | 684 | 0.50 | 122 | 0.50 | 231 | 0.50 |
{cloc} 📦 metrics for tdigest
Please note that this project is released with a Contributor Code ofConduct. By participating in this project you agree to abide by itsterms.
About
Wicked Fast, Accurate Quantiles Using 't-Digests'