- Notifications
You must be signed in to change notification settings - Fork54
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
License
CamDavidsonPilon/tdigest
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Python implementation of Ted Dunning'st-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here:Percentile and Quantile Estimation of Big Data: The t-Digest
tdigest is compatible with both Python 2 and Python 3.
pip install tdigest
from tdigest import TDigestfrom numpy.random import randomdigest = TDigest()for x in range(5000): digest.update(random())print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
another_digest = TDigest()another_digest.batch_update(random(5000))print(another_digest.percentile(15))
sum_digest = digest + another_digest sum_digest.percentile(30) # about 0.3
You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
digest = TDigest()digest.update(1)digest.update(2)digest.update(3)print(digest.to_dict())
Or you can get only a list of Centroids withcentroids_to_list()
.
digest.centroids_to_list()
Similarly, you can restore a Python dict of digest values withupdate_from_dict()
. Centroids are merged with any existing ones in the digest.For example, make a fresh digest and restore values from a python dictionary.
digest = TDigest()digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
K and delta values are optional, or you can provide only a list of centroids withupdate_centroids_from_list()
.
digest = TDigest()digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
If you want to serialize with other tools like JSON, you can first convert to_dict().
json.dumps(digest.to_dict())
Alternatively, make a custom encoder function to provide as default to the standard json module.
def encoder(digest_obj): return digest_obj.to_dict()
Then pass the encoder function as the default parameter.
json.dumps(digest, default=encoder)
TDigest.
update(x, w=1)
: update the tdigest with valuex
and weightw
.batch_update(x, w=1)
: update the tdigest with values in arrayx
and weightw
.compress()
: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.percentile(p)
: return thep
th percentile. Example:p=50
is the median.cdf(x)
: return the CDF the valuex
is at.trimmed_mean(p1, p2)
: return the mean of data set without the values below and above thep1
andp2
percentile respectively.to_dict()
: return a Python dictionary of the TDigest and internal Centroid values.update_from_dict(dict_values)
: update from serialized dictionary values into the TDigest object.centroids_to_list()
: return a Python list of the TDigest object's internal Centroid values.update_centroids_from_list(list_values)
: update Centroids from a python list.
About
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark