Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Wicked Fast, Accurate Quantiles Using 't-Digests'

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

hrbrmstr/tdigest

Repository files navigation

Project Status: Active – The project has reached a stable, usable state and is being actively developed.Signed bySigned commit %

cran checksCRAN statusMinimal R VersionLicense

tdigest

Wicked Fast, Accurate Quantiles Using ‘t-Digests’

Description

The t-Digest construction algorithm uses a variant of 1-dimensionalk-means clustering to produce a very compact data structure that allowsaccurate estimation of quantiles. This t-Digest data structure can beused to estimate quantiles, compute other rank statistics or even toestimate related measures like trimmed means. The advantage of thet-Digest over previous digests for this purpose is that the t-Digesthandles data with full floating point resolution. The accuracy ofquantile estimates produced by t-Digests can be orders of magnitude moreaccurate than those produced by previous digest algorithms. Methods areprovided to create and update t-Digests and retrieve quantiles from theaccumulated distributions.

Seethe original paper by Ted Dunning & OtmarErtl for more details on t-Digests.

What’s Inside The Tin

The following functions are implemented:

  • as.list.tdigest: Serialize a tdigest object to an R list orunserialize a serialized tdigest list back into a tdigest object
  • td_add: Add a value to the t-Digest with the specified count
  • td_create: Allocate a new histogram
  • td_merge: Merge one t-Digest into another
  • td_quantile_of: Return the quantile of the value
  • td_total_count: Total items contained in the t-Digest
  • td_value_at: Return the value at the specified quantile
  • tquantile: Calculate sample quantiles from a t-Digest

Installation

install.packages("tdigest")# NOTE: CRAN version is 0.4.1# orremotes::install_gitlab("hrbrmstr/tdigest")

NOTE: To use the ‘remotes’ install options you will need to have the{remotes} package installed.

Usage

library(tdigest)# current versionpackageVersion("tdigest")## [1] '0.4.2'

Basic (Low-level interface)

td<- td_create(10)td## <tdigest; size=0; compression=10; cap=70>td_total_count(td)## [1] 0td_add(td,0,1) %>%   td_add(10,1)## <tdigest; size=2; compression=10; cap=70>td_total_count(td)## [1] 2td_value_at(td,0.1)==0## [1] TRUEtd_value_at(td,0.5)==5## [1] TRUEquantile(td)## [1]  0  0  5 10 10

Bigger (and Vectorised)

td<- tdigest(c(0,10),10)is_tdigest(td)## [1] TRUEtd_value_at(td,0.1)==0## [1] TRUEtd_value_at(td,0.5)==5## [1] TRUEset.seed(1492)x<- sample(0:100,1000000,replace=TRUE)td<- tdigest(x,1000)td_total_count(td)## [1] 1e+06tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574## [10]  80.3090454  90.2594642  99.4269454 100.0000000quantile(td)## [1]   0.00000  24.74751  49.99666  75.24783 100.00000

Serialization

These [de]serialization functions make it possible to create &populate a tdigest, serialize it out, read it in at a later time andcontinue populating it enabling compact distribution accumulation &storage for large, “continuous” datasets.

set.seed(1492)x<- sample(0:100,1000000,replace=TRUE)td<- tdigest(x,1000)tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574## [10]  80.3090454  90.2594642  99.4269454 100.0000000str(in_r<- as.list(td),1)## List of 7##  $ compression   : num 1000##  $ cap           : int 6010##  $ merged_nodes  : int 226##  $ unmerged_nodes: int 0##  $ merged_count  : num 1e+06##  $ unmerged_count: num 0##  $ nodes         :List of 2##  - attr(*, "class")= chr [1:2] "tdigest_list" "list"td2<- as_tdigest(in_r)tquantile(td2, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1))##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574## [10]  80.3090454  90.2594642  99.4269454 100.0000000identical(in_r, as.list(td2))## [1] TRUE

ALTREP-aware

N<-1000000x.altrep<- seq_len(N)# this is an ALTREP in R version >= 3.5.0td<- tdigest(x.altrep)td[0.1]## [1] 93051td[0.5]## [1] 491472.5length(td)## [1] 1000000

Proof it’s faster

microbenchmark::microbenchmark(tdigest= tquantile(td, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1)),r_quantile= quantile(x, c(0,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99,1)))## Unit: microseconds##        expr       min        lq        mean     median        uq     max neval##     tdigest     3.198     3.731     7.79369     4.4895    12.792    16.4   100##  r_quantile 39197.353 39445.444 40069.38938 39584.8030 40062.945 43613.3   100

tdigest Metrics

Lang# Files(%)LoC(%)Blank lines(%)# Lines(%)
C30.154990.36710.29450.10
R60.301610.12350.141560.34
C/C++ Header10.05240.02160.07300.06
SUM100.506840.501220.502310.50

{cloc} 📦 metrics for tdigest

Code of Conduct

Please note that this project is released with a Contributor Code ofConduct. By participating in this project you agree to abide by itsterms.

About

Wicked Fast, Accurate Quantiles Using 't-Digests'

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp