NotificationsYou must be signed in to change notification settings
Fork32.4k
Star67.9k

Commit6d34eb0

authored

gh-115532: Add kernel density estimation to the statistics module (gh-115863)

1 parent6a3236f commit6d34eb0Copy full SHA for 6d34eb0

File tree

5 files changed

+285

-41

lines changed

Doc
- library
  - statistics.rst
- whatsnew
  - 3.13.rst
Lib
- statistics.py
- test
  - test_statistics.py
Misc/NEWS.d/next/Library
- 2024-02-23-11-08-31.gh-issue-115532.zVd3gK.rst

5 files changed

+285

-41

lines changed

`‎Doc/library/statistics.rst`

Lines changed: 49 additions & 40 deletions

Original file line number	Diff line number	Diff line change
`@@ -76,6 +76,7 @@ or sample.`
`76`	`76`	:func:`fmean` Fast, floating point arithmetic mean, with optional weighting.
`77`	`77`	:func:`geometric_mean` Geometric mean of data.
`78`	`78`	:func:`harmonic_mean` Harmonic mean of data.
	`79`	+:func:`kde` Estimate the probability density distribution of the data.
`79`	`80`	:func:`median` Median (middle value) of data.
`80`	`81`	:func:`median_low` Low median of data.
`81`	`82`	:func:`median_high` High median of data.
`@@ -259,6 +260,54 @@ However, for reading convenience, most of the examples show sorted sequences.`
`259`	`260`	`..versionchanged::3.10`
`260`	`261`	`Added support for weights.`
`261`	`262`
	`263`	`+`
	`264`	`+..function::kde(data, h, kernel='normal')`
	`265`	`+`
	`266`	+ `Kernel Density Estimation (KDE)
	`267`	+<https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf>`_:
	`268`	`+ Create a continuous probability density function from discrete samples.`
	`269`	`+`
	`270`	+ The basic idea is to smooth the data using `a kernel function
	`271`	+<https://en.wikipedia.org/wiki/Kernel_(statistics)>`_.
	`272`	`+ to help draw inferences about a population from a sample.`
	`273`	`+`
	`274`	`+ The degree of smoothing is controlled by the scaling parameter h`
	`275`	`+ which is called the bandwidth. Smaller values emphasize local`
	`276`	`+ features while larger values give smoother results.`
	`277`	`+`
	`278`	`+ The kernel determines the relative weights of the sample data`
	`279`	`+ points. Generally, the choice of kernel shape does not matter`
	`280`	`+ as much as the more influential bandwidth smoothing parameter.`
	`281`	`+`
	`282`	`+ Kernels that give some weight to every sample point include`
	`283`	`+ normal or gauss, logistic, and sigmoid.`
	`284`	`+`
	`285`	`+ Kernels that only give weight to sample points within the bandwidth`
	`286`	`+ include rectangular or uniform, triangular, parabolic or`
	`287`	`+ epanechnikov, quartic or biweight, triweight, and cosine.`
	`288`	`+`
	`289`	+ A:exc:`StatisticsError` will be raised if the data sequence is empty.
	`290`	`+`
	`291`	+ `Wikipedia has an example
	`292`	+<https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
	`293`	+ where we can use:func:`kde` to generate and plot a probability
	`294`	`+ density function estimated from a small sample:`
	`295`	`+`
	`296`	`+ ..doctest::`
	`297`	`+`
	`298`	`+ >>>sample= [-2.1,-1.3,-0.4,1.9,5.1,6.2]`
	`299`	`+ >>>f_hat= kde(sample,h=1.5)`
	`300`	`+ >>>xarr= [i/100for iinrange(-750,1100)]`
	`301`	`+ >>>yarr= [f_hat(x)for xin xarr]`
	`302`	`+`
	`303`	+ The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
	`304`	`+`
	`305`	`+ ..image::kde_example.png`
	`306`	`+:alt:Scatter plot of the estimated probability density function.`
	`307`	`+`
	`308`	`+ ..versionadded::3.13`
	`309`	`+`
	`310`	`+`
`262`	`311`	`..function::median(data)`
`263`	`312`
`264`	`313`	`Return the median (middle value) of numeric data, using the common "mean of`
`@@ -1095,46 +1144,6 @@ The final prediction goes to the largest posterior. This is known as the`
`1095`	`1144`	`'female'`
`1096`	`1145`
`1097`	`1146`
`1098`		`-Kernel density estimation`
`1099`		`-*************************`
`1100`		`-`
`1101`		`-It is possible to estimate a continuous probability density function`
`1102`		`-from a fixed number of discrete samples.`
`1103`		`-`
`1104`		-The basic idea is to smooth the data using `a kernel function such as a
`1105`		`-normal distribution, triangular distribution, or uniform distribution`
`1106`		-<https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use>`_.
`1107`		-The degree of smoothing is controlled by a scaling parameter, ``h``,
`1108`		`-which is called the bandwidth.`
`1109`		`-`
`1110`		`-..testcode::`
`1111`		`-`
`1112`		`- def kde_normal(sample, h):`
`1113`		`- "Create a continuous probability density function from a sample."`
`1114`		`- # Smooth the sample with a normal distribution kernel scaled by h.`
`1115`		`- kernel_h = NormalDist(0.0, h).pdf`
`1116`		`- n = len(sample)`
`1117`		`- def pdf(x):`
`1118`		`- return sum(kernel_h(x - x_i) for x_i in sample) / n`
`1119`		`- return pdf`
`1120`		`-`
`1121`		-`Wikipedia has an example
`1122`		-<https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
`1123`		-where we can use the ``kde_normal()`` recipe to generate and plot
`1124`		`-a probability density function estimated from a small sample:`
`1125`		`-`
`1126`		`-..doctest::`
`1127`		`-`
`1128`		`- >>>sample= [-2.1,-1.3,-0.4,1.9,5.1,6.2]`
`1129`		`- >>>f_hat= kde_normal(sample,h=1.5)`
`1130`		`- >>>xarr= [i/100for iinrange(-750,1100)]`
`1131`		`- >>>yarr= [f_hat(x)for xin xarr]`
`1132`		`-`
`1133`		-The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
`1134`		`-`
`1135`		`-..image::kde_example.png`
`1136`		`-:alt:Scatter plot of the estimated probability density function.`
`1137`		`-`
`1138`	`1147`	`..`
`1139`	`1148`	`# This modelines must appear within the last ten lines of the file.`
`1140`	`1149`	`kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;`

`‎Doc/whatsnew/3.13.rst`

Lines changed: 8 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -476,6 +476,14 @@ sqlite3`
`476`	`476`	`for filtering database objects to dump.`
`477`	`477`	(Contributed by Mariusz Felisiak in:gh:`91602`.)
`478`	`478`
	`479`	`+statistics`
	`480`	`+----------`
	`481`	`+`
	`482`	+* Add:func:`statistics.kde` for kernel density estimation.
	`483`	`+ This makes it possible to estimate a continuous probability density function`
	`484`	`+ from a fixed number of discrete samples.`
	`485`	+ (Contributed by Raymond Hettinger in:gh:`115863`.)
	`486`	`+`
`479`	`487`	`subprocess`
`480`	`488`	`----------`
`481`	`489`

`‎Lib/statistics.py`

Lines changed: 167 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -112,6 +112,7 @@`
`112`	`112`	`'fmean',`
`113`	`113`	`'geometric_mean',`
`114`	`114`	`'harmonic_mean',`
	`115`	`+'kde',`
`115`	`116`	`'linear_regression',`
`116`	`117`	`'mean',`
`117`	`118`	`'median',`
`@@ -137,7 +138,7 @@`
`137`	`138`	`fromitertoolsimportcount,groupby,repeat`
`138`	`139`	`frombisectimportbisect_left,bisect_right`
`139`	`140`	`frommathimporthypot,sqrt,fabs,exp,erf,tau,log,fsum,sumprod`
`140`		`-frommathimportisfinite,isinf`
	`141`	`+frommathimportisfinite,isinf,pi,cos,cosh`
`141`	`142`	`fromfunctoolsimportreduce`
`142`	`143`	`fromoperatorimportitemgetter`
`143`	`144`	`fromcollectionsimportCounter,namedtuple,defaultdict`
`@@ -802,6 +803,171 @@ def multimode(data):`
`802`	`803`	`return [valueforvalue,countincounts.items()ifcount==maxcount]`
`803`	`804`
`804`	`805`
	`806`	`+defkde(data,h,kernel='normal'):`
	`807`	`+"""Kernel Density Estimation: Create a continuous probability`
	`808`	`+ density function from discrete samples.`
	`809`	`+`
	`810`	`+ The basic idea is to smooth the data using a kernel function`
	`811`	`+ to help draw inferences about a population from a sample.`
	`812`	`+`
	`813`	`+ The degree of smoothing is controlled by the scaling parameter h`
	`814`	`+ which is called the bandwidth. Smaller values emphasize local`
	`815`	`+ features while larger values give smoother results.`
	`816`	`+`
	`817`	`+ The kernel determines the relative weights of the sample data`
	`818`	`+ points. Generally, the choice of kernel shape does not matter`
	`819`	`+ as much as the more influential bandwidth smoothing parameter.`
	`820`	`+`
	`821`	`+ Kernels that give some weight to every sample point:`
	`822`	`+`
	`823`	`+ normal or gauss`
	`824`	`+ logistic`
	`825`	`+ sigmoid`
	`826`	`+`
	`827`	`+ Kernels that only give weight to sample points within`
	`828`	`+ the bandwidth:`
	`829`	`+`
	`830`	`+ rectangular or uniform`
	`831`	`+ triangular`
	`832`	`+ parabolic or epanechnikov`
	`833`	`+ quartic or biweight`
	`834`	`+ triweight`
	`835`	`+ cosine`
	`836`	`+`
	`837`	`+ A StatisticsError will be raised if the data sequence is empty.`
	`838`	`+`
	`839`	`+ Example`
	`840`	`+ -------`
	`841`	`+`
	`842`	`+ Given a sample of six data points, construct a continuous`
	`843`	`+ function that estimates the underlying probability density:`
	`844`	`+`
	`845`	`+ >>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]`
	`846`	`+ >>> f_hat = kde(sample, h=1.5)`
	`847`	`+`
	`848`	`+ Compute the area under the curve:`
	`849`	`+`
	`850`	`+ >>> sum(f_hat(x) for x in range(-20, 20))`
	`851`	`+ 1.0`
	`852`	`+`
	`853`	`+ Plot the estimated probability density function at`
	`854`	`+ evenly spaced points from -6 to 10:`
	`855`	`+`
	`856`	`+ >>> for x in range(-6, 11):`
	`857`	`+ ... density = f_hat(x)`
	`858`	`+ ... plot = ' ' * int(density * 400) + 'x'`
	`859`	`+ ... print(f'{x:2}: {density:.3f} {plot}')`
	`860`	`+ ...`
	`861`	`+ -6: 0.002 x`
	`862`	`+ -5: 0.009 x`
	`863`	`+ -4: 0.031 x`
	`864`	`+ -3: 0.070 x`
	`865`	`+ -2: 0.111 x`
	`866`	`+ -1: 0.125 x`
	`867`	`+ 0: 0.110 x`
	`868`	`+ 1: 0.086 x`
	`869`	`+ 2: 0.068 x`
	`870`	`+ 3: 0.059 x`
	`871`	`+ 4: 0.066 x`
	`872`	`+ 5: 0.082 x`
	`873`	`+ 6: 0.082 x`
	`874`	`+ 7: 0.058 x`
	`875`	`+ 8: 0.028 x`
	`876`	`+ 9: 0.009 x`
	`877`	`+ 10: 0.002 x`
	`878`	`+`
	`879`	`+ References`
	`880`	`+ ----------`
	`881`	`+`
	`882`	`+ Kernel density estimation and its application:`
	`883`	`+ https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf`
	`884`	`+`
	`885`	`+ Kernel functions in common use:`
	`886`	`+ https://en.wikipedia.org/wiki/Kernel_(statistics)#kernel_functions_in_common_use`
	`887`	`+`
	`888`	`+ Interactive graphical demonstration and exploration:`
	`889`	`+ https://demonstrations.wolfram.com/KernelDensityEstimation/`
	`890`	`+`
	`891`	`+ """`
	`892`	`+`
	`893`	`+n=len(data)`
	`894`	`+ifnotn:`
	`895`	`+raiseStatisticsError('Empty data sequence')`
	`896`	`+`
	`897`	`+ifnotisinstance(data[0], (int,float)):`
	`898`	`+raiseTypeError('Data sequence must contain ints or floats')`
	`899`	`+`
	`900`	`+ifh<=0.0:`
	`901`	`+raiseStatisticsError(f'Bandwidth h must be positive, not{h=!r}')`
	`902`	`+`
	`903`	`+matchkernel:`
	`904`	`+`
	`905`	`+case'normal'\|'gauss':`
	`906`	`+c=1/sqrt(2*pi)`
	`907`	`+K=lambdat:cexp(-1/2t*t)`
	`908`	`+support=None`
	`909`	`+`
	`910`	`+case'logistic':`
	`911`	`+# 1.0 / (exp(t) + 2.0 + exp(-t))`
	`912`	`+K=lambdat:1/2/ (1.0+cosh(t))`
	`913`	`+support=None`
	`914`	`+`
	`915`	`+case'sigmoid':`
	`916`	`+# (2/pi) / (exp(t) + exp(-t))`
	`917`	`+c=1/pi`
	`918`	`+K=lambdat:c/cosh(t)`
	`919`	`+support=None`
	`920`	`+`
	`921`	`+case'rectangular'\|'uniform':`
	`922`	`+K=lambdat:1/2`
	`923`	`+support=1.0`
	`924`	`+`
	`925`	`+case'triangular':`
	`926`	`+K=lambdat:1.0-abs(t)`
	`927`	`+support=1.0`
	`928`	`+`
	`929`	`+case'parabolic'\|'epanechnikov':`
	`930`	`+K=lambdat:3/4* (1.0-t*t)`
	`931`	`+support=1.0`
	`932`	`+`
	`933`	`+case'quartic'\|'biweight':`
	`934`	`+K=lambdat:15/16* (1.0-tt)*2`
	`935`	`+support=1.0`
	`936`	`+`
	`937`	`+case'triweight':`
	`938`	`+K=lambdat:35/32* (1.0-tt)*3`
	`939`	`+support=1.0`
	`940`	`+`
	`941`	`+case'cosine':`
	`942`	`+c1=pi/4`
	`943`	`+c2=pi/2`
	`944`	`+K=lambdat:c1cos(c2t)`
	`945`	`+support=1.0`
	`946`	`+`
	`947`	`+case _:`
	`948`	`+raiseStatisticsError(f'Unknown kernel name:{kernel!r}')`
	`949`	`+`
	`950`	`+ifsupportisNone:`
	`951`	`+`
	`952`	`+defpdf(x):`
	`953`	`+returnsum(K((x-x_i)/h)forx_iindata)/ (n*h)`
	`954`	`+`
	`955`	`+else:`
	`956`	`+`
	`957`	`+sample=sorted(data)`
	`958`	`+bandwidth=h*support`
	`959`	`+`
	`960`	`+defpdf(x):`
	`961`	`+i=bisect_left(sample,x-bandwidth)`
	`962`	`+j=bisect_right(sample,x+bandwidth)`
	`963`	`+supported=sample[i :j]`
	`964`	`+returnsum(K((x-x_i)/h)forx_iinsupported)/ (n*h)`
	`965`	`+`
	`966`	`+pdf.__doc__=f'PDF estimate with{kernel=!r} and{h=!r}'`
	`967`	`+`
	`968`	`+returnpdf`
	`969`	`+`
	`970`	`+`
`805`	`971`	`# Notes on methods for computing quantiles`
`806`	`972`	`# ----------------------------------------`
`807`	`973`	`#`

`‎Lib/test/test_statistics.py`

Lines changed: 60 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -2353,6 +2353,66 @@ def test_mixed_int_and_float(self):`
`2353`	`2353`	`self.assertAlmostEqual(actual_mean,expected_mean,places=5)`
`2354`	`2354`
`2355`	`2355`
	`2356`	`+classTestKDE(unittest.TestCase):`
	`2357`	`+`
	`2358`	`+deftest_kde(self):`
	`2359`	`+kde=statistics.kde`
	`2360`	`+StatisticsError=statistics.StatisticsError`
	`2361`	`+`
	`2362`	`+kernels= ['normal','gauss','logistic','sigmoid','rectangular',`
	`2363`	`+'uniform','triangular','parabolic','epanechnikov',`
	`2364`	`+'quartic','biweight','triweight','cosine']`
	`2365`	`+`
	`2366`	`+sample= [-2.1,-1.3,-0.4,1.9,5.1,6.2]`
	`2367`	`+`
	`2368`	`+# The approximate integral of a PDF should be close to 1.0`
	`2369`	`+`
	`2370`	`+defintegrate(func,low,high,steps=10_000):`
	`2371`	`+"Numeric approximation of a definite function integral."`
	`2372`	`+dx= (high-low)/steps`
	`2373`	`+midpoints= (low+ (i+1/2)*dxforiinrange(steps))`
	`2374`	`+returnsum(map(func,midpoints))*dx`
	`2375`	`+`
	`2376`	`+forkernelinkernels:`
	`2377`	`+withself.subTest(kernel=kernel):`
	`2378`	`+f_hat=kde(sample,h=1.5,kernel=kernel)`
	`2379`	`+area=integrate(f_hat,-20,20)`
	`2380`	`+self.assertAlmostEqual(area,1.0,places=4)`
	`2381`	`+`
	`2382`	`+# Check error cases`
	`2383`	`+`
	`2384`	`+withself.assertRaises(StatisticsError):`
	`2385`	`+kde([],h=1.0)# Empty dataset`
	`2386`	`+withself.assertRaises(TypeError):`
	`2387`	`+kde(['abc','def'],1.5)# Non-numeric data`
	`2388`	`+withself.assertRaises(TypeError):`
	`2389`	`+kde(iter(sample),1.5)# Data is not a sequence`
	`2390`	`+withself.assertRaises(StatisticsError):`
	`2391`	`+kde(sample,h=0.0)# Zero bandwidth`
	`2392`	`+withself.assertRaises(StatisticsError):`
	`2393`	`+kde(sample,h=0.0)# Negative bandwidth`
	`2394`	`+withself.assertRaises(TypeError):`
	`2395`	`+kde(sample,h='str')# Wrong bandwidth type`
	`2396`	`+withself.assertRaises(StatisticsError):`
	`2397`	`+kde(sample,h=1.0,kernel='bogus')# Invalid kernel`
	`2398`	`+`
	`2399`	`+# Test name and docstring of the generated function`
	`2400`	`+`
	`2401`	`+h=1.5`
	`2402`	`+kernel='cosine'`
	`2403`	`+f_hat=kde(sample,h,kernel)`
	`2404`	`+self.assertEqual(f_hat.__name__,'pdf')`
	`2405`	`+self.assertIn(kernel,f_hat.__doc__)`
	`2406`	`+self.assertIn(str(h),f_hat.__doc__)`
	`2407`	`+`
	`2408`	`+# Test closed interval for the support boundaries.`
	`2409`	`+# In particular, 'uniform' should non-zero at the boundaries.`
	`2410`	`+`
	`2411`	`+f_hat=kde([0],1.0,'uniform')`
	`2412`	`+self.assertEqual(f_hat(-1.0),1/2)`
	`2413`	`+self.assertEqual(f_hat(1.0),1/2)`
	`2414`	`+`
	`2415`	`+`
`2356`	`2416`	`classTestQuantiles(unittest.TestCase):`
`2357`	`2417`
`2358`	`2418`	`deftest_specific_cases(self):`

`‎Misc/NEWS.d/next/Library/2024-02-23-11-08-31.gh-issue-115532.zVd3gK.rst`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Add kernel density estimation to the statistics module.`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit6d34eb0

File tree

5 files changed

5 files changed

`‎Doc/library/statistics.rst`

`‎Doc/whatsnew/3.13.rst`

`‎Lib/statistics.py`

`‎Lib/test/test_statistics.py`

`‎Misc/NEWS.d/next/Library/2024-02-23-11-08-31.gh-issue-115532.zVd3gK.rst`

0 commit comments