CamDavidsonPilon/tdigestPublic

NotificationsYou must be signed in to change notification settings
Fork54
Star392

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

License

MIT license

392 stars 54 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
tdigest		tdigest
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pyspark_by_key_example.py		pyspark_by_key_example.py
pyspark_example.py		pyspark_example.py
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

tdigest

Efficient percentile estimation of streaming or distributed data

This is a Python implementation of Ted Dunning'st-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here:Percentile and Quantile Estimation of Big Data: The t-Digest

Installation

tdigest is compatible with both Python 2 and Python 3.

pip install tdigest

Usage

Update the digest sequentially

from tdigest import TDigestfrom numpy.random import randomdigest = TDigest()for x in range(5000):    digest.update(random())print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution

Update the digest in batches

another_digest = TDigest()another_digest.batch_update(random(5000))print(another_digest.percentile(15))

Sum two digests to create a new digest

sum_digest = digest + another_digest sum_digest.percentile(30)  # about 0.3

To dict or serializing a digest with JSON

You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.

digest = TDigest()digest.update(1)digest.update(2)digest.update(3)print(digest.to_dict())

Or you can get only a list of Centroids withcentroids_to_list().

digest.centroids_to_list()

Similarly, you can restore a Python dict of digest values withupdate_from_dict(). Centroids are merged with any existing ones in the digest.For example, make a fresh digest and restore values from a python dictionary.

digest = TDigest()digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})

K and delta values are optional, or you can provide only a list of centroids withupdate_centroids_from_list().

digest = TDigest()digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])

If you want to serialize with other tools like JSON, you can first convert to_dict().

json.dumps(digest.to_dict())

Alternatively, make a custom encoder function to provide as default to the standard json module.

def encoder(digest_obj):    return digest_obj.to_dict()

Then pass the encoder function as the default parameter.

json.dumps(digest, default=encoder)

API

TDigest.

update(x, w=1): update the tdigest with valuex and weightw.
batch_update(x, w=1): update the tdigest with values in arrayx and weightw.
compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
percentile(p): return thepth percentile. Example:p=50 is the median.
cdf(x): return the CDF the valuex is at.
trimmed_mean(p1, p2): return the mean of data set without the values below and above thep1 andp2 percentile respectively.
to_dict(): return a Python dictionary of the TDigest and internal Centroid values.
update_from_dict(dict_values): update from serialized dictionary values into the TDigest object.
centroids_to_list(): return a Python list of the TDigest object's internal Centroid values.
update_centroids_from_list(list_values): update Centroids from a python list.

About

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

Releases4

v0.6.0.1 Latest

May 4, 2023

+ 3 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

tdigest

Efficient percentile estimation of streaming or distributed data

Installation

Usage

Update the digest sequentially

Update the digest in batches

Sum two digests to create a new digest

To dict or serializing a digest with JSON

API

About

Topics

Resources

License

Stars

Watchers

Forks

Releases4

Packages

Contributors14

Languages

Movatterモバイル変換

License

CamDavidsonPilon/tdigest

Folders and files

Latest commit

History

Repository files navigation

tdigest

Efficient percentile estimation of streaming or distributed data

Installation

Usage

Update the digest sequentially

Update the digest in batches

Sum two digests to create a new digest

To dict or serializing a digest with JSON

API

About

Topics

Resources

License

Stars

Watchers

Forks

Releases4

Packages0

Contributors14

Languages

Packages