This PEP proposes the addition of a module for common statistics functions suchas mean, median, variance and standard deviation to the Python standardlibrary. See alsohttp://bugs.python.org/issue18606
The proposed statistics module is motivated by the “batteries included”philosophy towards the Python standard library. Raymond Hettinger and othersenior developers have requested a quality statistics library that fallssomewhere in between high-end statistics libraries and ad hoc code.[1]Statistical functions such as mean, standard deviation and others are obviousand useful batteries, familiar to any Secondary School student. Even cheapscientific calculators typically include multiple statistical functions suchas:
Graphing calculators aimed at Secondary School students typically include allof the above, plus some or all of:
and others[2]. Likewise spreadsheet applications such as Microsoft Excel,LibreOffice and Gnumeric include rich collections of statisticalfunctions[3].
In contrast, Python currently has no standard way to calculate even thesimplest and most obvious statistical functions such as mean. For those whoneed statistical functions in Python, there are two obvious solutions:
Numpy is perhaps the most full-featured solution, but it has a fewdisadvantages:
“It can be hard to know what functions are available in numpy. This isnot a complete list, but it does cover most of them.”[5]
and then goes on to list over 270 functions, only a small number of which arerelated to statistics.
numpy.mean takes four arguments:mean(a,axis=None,dtype=None,out=None)
although fortunately for the beginner or casual numpy user, three areoptional andnumpy.mean does the right thing in simple cases:
>>>numpy.mean([1,2,3,4])2.5
This leads to option number 2, DIY statistics functions. At first glance, thisappears to be an attractive option, due to the apparent simplicity of commonstatistical functions. For example:
defmean(data):returnsum(data)/len(data)defvariance(data):# Use the Computational Formula for Variance.n=len(data)ss=sum(x**2forxindata)-(sum(data)**2)/nreturnss/(n-1)defstandard_deviation(data):returnmath.sqrt(variance(data))
The above appears to be correct with a casual test:
>>>data=[1,2,4,5,8]>>>variance(data)7.5
But adding a constant to every data point should not change the variance:
>>>data=[x+1e12forxindata]>>>variance(data)0.0
And variance shouldnever be negative:
>>>variance(data*100)-1239429440.1282566
By contrast, the proposed reference implementation gets the exactly correctanswer 7.5 for the first two examples, and a reasonably close answer for thethird: 6.012. numpy does no better[6].
Even simple statistical calculations contain traps for the unwary, startingwith the Computational Formula itself. Despite the name, it is numericallyunstable and can be extremely inaccurate, as can be seen above. It iscompletely unsuitable for computation by computer[7]. This problem plaguesusers of many programming language, not just Python[8], as coders reinventthe same numerically inaccurate code over and over again[9], or advise othersto do so[10].
It isn’t just the variance and standard deviation. Even the mean is not quiteas straightforward as it might appear. The above implementation seems toosimple to have problems, but it does:
sum can lose accuracy when dealing with floats of wildlydiffering magnitude. Consequently, the above naivemean fails this“torture test”:assertmean([1e30,1,3,-1e30])==1
returning 0 instead of 1, a purely computational error of 100%.
math.fsum insidemean will make it more accurate with floatdata, but it also has the side-effect of converting any arguments to floateven when unnecessary. E.g. we should expect the mean of a list of Fractionsto be a Fraction, not a float.While the above mean implementation does not fail quite as catastrophically asthe naive variance does, a standard library function can do much better thanthe DIY versions.
The example above involves an especially bad set of data, but even for morerealistic data sets accuracy is important. The first step in interpretingvariation in data (including dealing with ill-conditioned data) is often tostandardize it to a series with variance 1 (and often mean 0). Thisstandardization requires accurate computation of the mean and variance of theraw series. Naive computation of mean and variance can lose precision veryquickly. Because precision bounds accuracy, it is important to use the mostprecise algorithms for computing mean and variance that are practical, or theresults of standardization are themselves useless.
The proposed statistics library is not intended to be a competitor to suchthird-party libraries as numpy/scipy, or of proprietary full-featuredstatistics packages aimed at professional statisticians such as Minitab, SASand Matlab. It is aimed at the level of graphing and scientific calculators.
Most programming languages have little or no built-in support for statisticsfunctions. Some exceptions:
R (and its proprietary cousin, S) is a programming language designed forstatistics work. It is extremely popular with statisticians and is extremelyfeature-rich[11].
The C# LINQ package includes extension methods to calculate the average ofenumerables[12].
Ruby does not ship with a standard statistics module, despite some apparentdemand[13]. Statsample appears to be a feature-rich third-party library,aiming to compete with R[14].
PHP has an extremely feature-rich (although mostly undocumented) set ofadvanced statistical functions[15].
Delphi includes standard statistical functions including Mean, Sum,Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
The GNU Scientific Library includes standard statistical functions,percentiles, median and others[17]. One innovation I have borrowed from theGSL is to allow the caller to optionally specify the pre-calculated mean ofthe sample (or an a priori known population mean) when calculating the varianceand standard deviation[18].
My intention is to start small and grow the library as needed, rather than tryto include everything from the start. Consequently, the current referenceimplementation includes only a small number of functions: mean, variance,standard deviation, median, mode. (See the reference implementation for a fulllist.)
I have aimed for the following design features:
The initial version of the library will provide univariate (single variable)statistics functions. The general API will be based on a functional modelfunction(data,...)->result, wheredata is a mandatory iterable of(usually) numeric data.
The author expects that lists will be the most common data type used, but anyiterable type should be acceptable. Where necessary, functions may convert tolists internally. Where possible, functions are expected to conserve the typeof the data values, for example, the mean of a list of Decimals should be aDecimal rather than float.
Themean,median* andmode functions take a single mandatoryargument and return the appropriate statistic, e.g.:
>>>mean([1,2,3])2.0
Functions provided are:
mean(data)median(data)median_high(data)median_low(data)median_grouped(data,interval=1)mode(data)mode is the sole exception to the rule that the data argument must benumeric. It will also accept an iterable of nominal data, such as strings.
In order to be similar to scientific calculators, the statistics module willinclude separate functions for population and sample variance and standarddeviation. All four functions have similar signatures, with a single mandatoryargument, an iterable of numeric data, e.g.:
>>>variance([1,2,2,2,3])0.5
All four functions also accept a second, optional, argument, the mean of thedata. This is modelled on a similar API provided by the GNU ScientificLibrary[18]. There are three use-cases for using this argument, in noparticular order:
In each case, it is the caller’s responsibility to ensure that givenargument is meaningful.
Functions provided are:
variance(data,xbar=None)stdev(data,xbar=None)pvariance(data,mu=None)pstdev(data,mu=None)There is one other public function:
sum(data,start=0)As the proposed reference implementation is in pure Python, other Pythonimplementations can easily make use of the module unchanged, or adapt it asthey see fit.
This will be a top-level modulestatistics.
There was some interest in turningmath into a package, and making this asub-module ofmath, but the general consensus eventually agreed on atop-level module. Other potential but rejected names includedstats (toomuch risk of confusion with existingstat module), andstatslib(described as “too C-like”).
This proposal has been previously discussed here[21].
A number of design issues were resolved during the discussion on Python-Ideasand the initial code review. There was a lot of concern about the addition ofyet anothersum function to the standard library, see the FAQs below formore details. In addition, the initial implementation ofsum suffered fromsome rounding issues and other design problems when dealing with Decimals.Oscar Benjamin’s assistance in resolving this was invaluable.
Another issue was the handling of data in the form of iterators. The firstimplementation of variance silently swapped between a one- and two-passalgorithm, depending on whether the data was in the form of an iterator orsequence. This proved to be a design mistake, as the calculated variance coulddiffer slightly depending on the algorithm used, andvariance etc. werechanged to internally generate a list and always use the more accurate two-passimplementation.
One controversial design involved the functions to calculate median, which wereimplemented as attributes on themedian callable, e.g.median,median.low,median.high etc. Although there is at least one existinguse of this style in the standard library, inunittest.mock, the codereviewers felt that this was too unusual for the standard library.Consequently, the design has been changed to a more traditional design ofseparate functions with a pseudo-namespace naming convention,median_low,median_high, etc.
Another issue that was of concern to code reviewers was the existence of afunction calculating the sample mode of continuous data, with some peoplequestioning the choice of algorithm, and whether it was a sufficiently commonneed to be included. So it was dropped from the API, andmode nowimplements only the basic schoolbook algorithm based on counting unique values.
Another significant point of discussion was calculating statistics oftimedelta objects. Although the statistics module will not directlysupporttimedelta objects, it is possible to support this use-case byconverting them to numbers first using thetimedelta.total_seconds method.
Older versions of this module have been available on PyPI[22] since 2010.Being much simpler than numpy, it does not require many years of externaldevelopment.
sum?This proved to be the most controversial part of the reference implementation.In one sense, clearly three sums is two too many. But in another sense, yes.The reasons why the two existing versions are unsuitable are describedhere[23] but the short summary is:
+operator, apart from strings and bytes;math.fsum is high-precision, but coerces all arguments to float.There was some interest in “fixing” one or the other of the existing sums. Ifthis occurs before 3.4 feature-freeze, the decision to keepstatistics.sumcan be re-considered.
The module currently targets 3.3, and I will make it available on PyPI for3.3 for the foreseeable future. Backporting to older versions of the 3.xseries is likely (but not yet decided). Backporting to 2.7 is less likely butnot ruled out.
No. While it is likely to grow over the years (see open issues below) it isnot aimed to replace, or even compete directly with, numpy. Numpy is afull-featured numeric library aimed at professionals, the nuclear reactor ofnumeric libraries in the Python ecosystem. This is just a battery, as in“batteries included”, and is aimed at an intermediate level somewhere between“use numpy” and “roll your own version”.
function([x0,x1,...],[y0,y1,...])
function([(x0,y0),(x1,y1),...])
This API is preferred by GvR[24].
function([[a0,x0,y0,z0],[a1,x1,y1,z1],...],x=1,y=2)
In the absence of a consensus of preferred API for multivariate stats, I willdefer including such multivariate functions until Python 3.5.
stats package on PyPI includesco-routine versions of statistics functions. Including these will be deferredto 3.5.LibreOffice:https://help.libreoffice.org/Calc/Statistical_Functions_Part_Onehttps://help.libreoffice.org/Calc/Statistical_Functions_Part_Twohttps://help.libreoffice.org/Calc/Statistical_Functions_Part_Threehttps://help.libreoffice.org/Calc/Statistical_Functions_Part_Fourhttps://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-0450.rst
Last modified:2025-02-01 08:59:27 GMT