Python Enhancement Proposals

Python »
PEP Index »
PEP 450

PEP 450 – Adding A Statistics Module To The Standard Library

Author:: Steven D’Aprano <steve at pearwood.info>
Status:

Abstract

This PEP proposes the addition of a module for common statistics functions suchas mean, median, variance and standard deviation to the Python standardlibrary. See alsohttp://bugs.python.org/issue18606

Rationale

The proposed statistics module is motivated by the “batteries included”philosophy towards the Python standard library. Raymond Hettinger and othersenior developers have requested a quality statistics library that fallssomewhere in between high-end statistics libraries and ad hoc code.[1]Statistical functions such as mean, standard deviation and others are obviousand useful batteries, familiar to any Secondary School student. Even cheapscientific calculators typically include multiple statistical functions suchas:

mean
population and sample variance
population and sample standard deviation
linear regression
correlation coefficient

Graphing calculators aimed at Secondary School students typically include allof the above, plus some or all of:

median
mode
functions for calculating the probability of random variables from thenormal, t, chi-squared, and F distributions
inference on the mean

and others[2]. Likewise spreadsheet applications such as Microsoft Excel,LibreOffice and Gnumeric include rich collections of statisticalfunctions[3].

In contrast, Python currently has no standard way to calculate even thesimplest and most obvious statistical functions such as mean. For those whoneed statistical functions in Python, there are two obvious solutions:

install numpy and/or scipy[4];
or use a Do It Yourself solution.

Numpy is perhaps the most full-featured solution, but it has a fewdisadvantages:

It may be overkill for many purposes. The documentation for numpy even warns
“It can be hard to know what functions are available in numpy. This isnot a complete list, but it does cover most of them.”[5]
and then goes on to list over 270 functions, only a small number of which arerelated to statistics.
Numpy is aimed at those doing heavy numerical work, and may be intimidatingto those who don’t have a background in computational mathematics andcomputer science. For example,numpy.mean takes four arguments:
```
mean(a,axis=None,dtype=None,out=None)
```
although fortunately for the beginner or casual numpy user, three areoptional andnumpy.mean does the right thing in simple cases:
```
>>>numpy.mean([1,2,3,4])2.5
```
For many people, installing numpy may be difficult or impossible. Forexample, people in corporate environments may have to go through a difficult,time-consuming process before being permitted to install third-partysoftware. For the casual Python user, having to learn about installingthird-party packages in order to average a list of numbers is unfortunate.

This leads to option number 2, DIY statistics functions. At first glance, thisappears to be an attractive option, due to the apparent simplicity of commonstatistical functions. For example:

defmean(data):returnsum(data)/len(data)defvariance(data):# Use the Computational Formula for Variance.n=len(data)ss=sum(x**2forxindata)-(sum(data)**2)/nreturnss/(n-1)defstandard_deviation(data):returnmath.sqrt(variance(data))

The above appears to be correct with a casual test:

>>>data=[1,2,4,5,8]>>>variance(data)7.5

But adding a constant to every data point should not change the variance:

>>>data=[x+1e12forxindata]>>>variance(data)0.0

And variance shouldnever be negative:

>>>variance(data*100)-1239429440.1282566

By contrast, the proposed reference implementation gets the exactly correctanswer 7.5 for the first two examples, and a reasonably close answer for thethird: 6.012. numpy does no better[6].

Even simple statistical calculations contain traps for the unwary, startingwith the Computational Formula itself. Despite the name, it is numericallyunstable and can be extremely inaccurate, as can be seen above. It iscompletely unsuitable for computation by computer[7]. This problem plaguesusers of many programming language, not just Python[8], as coders reinventthe same numerically inaccurate code over and over again[9], or advise othersto do so[10].

It isn’t just the variance and standard deviation. Even the mean is not quiteas straightforward as it might appear. The above implementation seems toosimple to have problems, but it does:

The built-insum can lose accuracy when dealing with floats of wildlydiffering magnitude. Consequently, the above naivemean fails this“torture test”:
```
assertmean([1e30,1,3,-1e30])==1
```
returning 0 instead of 1, a purely computational error of 100%.
Usingmath.fsum insidemean will make it more accurate with floatdata, but it also has the side-effect of converting any arguments to floateven when unnecessary. E.g. we should expect the mean of a list of Fractionsto be a Fraction, not a float.

While the above mean implementation does not fail quite as catastrophically asthe naive variance does, a standard library function can do much better thanthe DIY versions.

The example above involves an especially bad set of data, but even for morerealistic data sets accuracy is important. The first step in interpretingvariation in data (including dealing with ill-conditioned data) is often tostandardize it to a series with variance 1 (and often mean 0). Thisstandardization requires accurate computation of the mean and variance of theraw series. Naive computation of mean and variance can lose precision veryquickly. Because precision bounds accuracy, it is important to use the mostprecise algorithms for computing mean and variance that are practical, or theresults of standardization are themselves useless.

Comparison To Other Languages/Packages

The proposed statistics library is not intended to be a competitor to suchthird-party libraries as numpy/scipy, or of proprietary full-featuredstatistics packages aimed at professional statisticians such as Minitab, SASand Matlab. It is aimed at the level of graphing and scientific calculators.

Most programming languages have little or no built-in support for statisticsfunctions. Some exceptions:

R

R (and its proprietary cousin, S) is a programming language designed forstatistics work. It is extremely popular with statisticians and is extremelyfeature-rich[11].

C#

The C# LINQ package includes extension methods to calculate the average ofenumerables[12].

Ruby

Ruby does not ship with a standard statistics module, despite some apparentdemand[13]. Statsample appears to be a feature-rich third-party library,aiming to compete with R[14].

PHP

PHP has an extremely feature-rich (although mostly undocumented) set ofadvanced statistical functions[15].

Delphi

Delphi includes standard statistical functions including Mean, Sum,Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].

GNU Scientific Library

The GNU Scientific Library includes standard statistical functions,percentiles, median and others[17]. One innovation I have borrowed from theGSL is to allow the caller to optionally specify the pre-calculated mean ofthe sample (or an a priori known population mean) when calculating the varianceand standard deviation[18].

Design Decisions Of The Module

My intention is to start small and grow the library as needed, rather than tryto include everything from the start. Consequently, the current referenceimplementation includes only a small number of functions: mean, variance,standard deviation, median, mode. (See the reference implementation for a fulllist.)

I have aimed for the following design features:

Correctness over speed. It is easier to speed up a correct but slow functionthan to correct a fast but buggy one.
Concentrate on data in sequences, allowing two-passes over the data, ratherthan potentially compromise on accuracy for the sake of a one-pass algorithm.Functions expect data will be passed as a list or other sequence; if given aniterator, they may internally convert to a list.
Functions should, as much as possible, honour any type of numeric data. E.g.the mean of a list of Decimals should be a Decimal, not a float. When this isnot possible, treat float as the “lowest common data type”.
Although functions support data sets of floats, Decimals or Fractions, thereis no guarantee thatmixed data sets will be supported. (But on the otherhand, they aren’t explicitly rejected either.)
Plenty of documentation, aimed at readers who understand the basic conceptsbut may not know (for example) which variance they should use (population orsample?). Mathematicians and statisticians have a terrible habit of beinginconsistent with both notation and terminology[19], and having spent manyhours making sense of the contradictory/confusing definitions in use, it isonly fair that I do my best to clarify rather than obfuscate the topic.
But avoid going into tedious[20] mathematical detail.

API

The initial version of the library will provide univariate (single variable)statistics functions. The general API will be based on a functional modelfunction(data,...)->result, wheredata is a mandatory iterable of(usually) numeric data.

The author expects that lists will be the most common data type used, but anyiterable type should be acceptable. Where necessary, functions may convert tolists internally. Where possible, functions are expected to conserve the typeof the data values, for example, the mean of a list of Decimals should be aDecimal rather than float.

Calculating mean, median and mode

Themean,median* andmode functions take a single mandatoryargument and return the appropriate statistic, e.g.:

>>>mean([1,2,3])2.0

Functions provided are:

mean(data)
arithmetic mean ofdata.
median(data)
median (middle value) ofdata, taking the average of the twomiddle values when there are an even number of values.
median_high(data)
high median ofdata, taking the larger of the two middlevalues when the number of items is even.
median_low(data)
low median ofdata, taking the smaller of the two middlevalues when the number of items is even.
median_grouped(data,interval=1)
50th percentile of groupeddata, using interpolation.
mode(data)
most commondata point.

mode is the sole exception to the rule that the data argument must benumeric. It will also accept an iterable of nominal data, such as strings.

Calculating variance and standard deviation

In order to be similar to scientific calculators, the statistics module willinclude separate functions for population and sample variance and standarddeviation. All four functions have similar signatures, with a single mandatoryargument, an iterable of numeric data, e.g.:

>>>variance([1,2,2,2,3])0.5

All four functions also accept a second, optional, argument, the mean of thedata. This is modelled on a similar API provided by the GNU ScientificLibrary[18]. There are three use-cases for using this argument, in noparticular order:

The value of the mean is knowna priori.
You have already calculated the mean, and wish to avoid calculatingit again.
You wish to (ab)use the variance functions to calculate the secondmoment about some given point other than the mean.

In each case, it is the caller’s responsibility to ensure that givenargument is meaningful.

Functions provided are:

variance(data,xbar=None)
sample variance ofdata, optionally usingxbar as the sample mean.
stdev(data,xbar=None)
sample standard deviation ofdata, optionally usingxbar as thesample mean.
pvariance(data,mu=None)
population variance ofdata, optionally usingmu as the populationmean.
pstdev(data,mu=None)
population standard deviation ofdata, optionally usingmu as thepopulation mean.

Other functions

There is one other public function:

sum(data,start=0)
high-precision sum of numericdata.

Specification

As the proposed reference implementation is in pure Python, other Pythonimplementations can easily make use of the module unchanged, or adapt it asthey see fit.

What Should Be The Name Of The Module?

This will be a top-level modulestatistics.

There was some interest in turningmath into a package, and making this asub-module ofmath, but the general consensus eventually agreed on atop-level module. Other potential but rejected names includedstats (toomuch risk of confusion with existingstat module), andstatslib(described as “too C-like”).

Discussion And Resolved Issues

This proposal has been previously discussed here[21].

A number of design issues were resolved during the discussion on Python-Ideasand the initial code review. There was a lot of concern about the addition ofyet anothersum function to the standard library, see the FAQs below formore details. In addition, the initial implementation ofsum suffered fromsome rounding issues and other design problems when dealing with Decimals.Oscar Benjamin’s assistance in resolving this was invaluable.

Another issue was the handling of data in the form of iterators. The firstimplementation of variance silently swapped between a one- and two-passalgorithm, depending on whether the data was in the form of an iterator orsequence. This proved to be a design mistake, as the calculated variance coulddiffer slightly depending on the algorithm used, andvariance etc. werechanged to internally generate a list and always use the more accurate two-passimplementation.

One controversial design involved the functions to calculate median, which wereimplemented as attributes on themedian callable, e.g.median,median.low,median.high etc. Although there is at least one existinguse of this style in the standard library, inunittest.mock, the codereviewers felt that this was too unusual for the standard library.Consequently, the design has been changed to a more traditional design ofseparate functions with a pseudo-namespace naming convention,median_low,median_high, etc.

Another issue that was of concern to code reviewers was the existence of afunction calculating the sample mode of continuous data, with some peoplequestioning the choice of algorithm, and whether it was a sufficiently commonneed to be included. So it was dropped from the API, andmode nowimplements only the basic schoolbook algorithm based on counting unique values.

Another significant point of discussion was calculating statistics oftimedelta objects. Although the statistics module will not directlysupporttimedelta objects, it is possible to support this use-case byconverting them to numbers first using thetimedelta.total_seconds method.

Frequently Asked Questions

Shouldn’t this module spend time on PyPI before being considered for the standard library?

Older versions of this module have been available on PyPI[22] since 2010.Being much simpler than numpy, it does not require many years of externaldevelopment.

Does the standard library really need yet another version of`sum`?

This proved to be the most controversial part of the reference implementation.In one sense, clearly three sums is two too many. But in another sense, yes.The reasons why the two existing versions are unsuitable are describedhere[23] but the short summary is:

the built-in sum can lose precision with floats;
the built-in sum accepts any non-numeric data type that supports the+operator, apart from strings and bytes;
math.fsum is high-precision, but coerces all arguments to float.

There was some interest in “fixing” one or the other of the existing sums. Ifthis occurs before 3.4 feature-freeze, the decision to keepstatistics.sumcan be re-considered.

Will this module be backported to older versions of Python?

The module currently targets 3.3, and I will make it available on PyPI for3.3 for the foreseeable future. Backporting to older versions of the 3.xseries is likely (but not yet decided). Backporting to 2.7 is less likely butnot ruled out.

Is this supposed to replace numpy?

No. While it is likely to grow over the years (see open issues below) it isnot aimed to replace, or even compete directly with, numpy. Numpy is afull-featured numeric library aimed at professionals, the nuclear reactor ofnumeric libraries in the Python ecosystem. This is just a battery, as in“batteries included”, and is aimed at an intermediate level somewhere between“use numpy” and “roll your own version”.

Future Work

At this stage, I am unsure of the best API for multivariate statisticalfunctions such as linear regression, correlation coefficient, and covariance.Possible APIs include:
- Separate arguments for x and y data:
```
function([x0,x1,...],[y0,y1,...])
```
- A single argument for (x, y) data:
```
function([(x0,y0),(x1,y1),...])
```
  This API is preferred by GvR[24].
- Selecting arbitrary columns from a 2D array:
```
function([[a0,x0,y0,z0],[a1,x1,y1,z1],...],x=1,y=2)
```
- Some combination of the above.
In the absence of a consensus of preferred API for multivariate stats, I willdefer including such multivariate functions until Python 3.5.
Likewise, functions for calculating probability of random variables andinference testing (e.g. Student’s t-test) will be deferred until 3.5.
There is considerable interest in including one-pass functions that cancalculate multiple statistics from data in iterator form, without having toconvert to a list. The experimentalstats package on PyPI includesco-routine versions of statistics functions. Including these will be deferredto 3.5.