Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit6d34eb0

Browse files
authored
gh-115532: Add kernel density estimation to the statistics module (gh-115863)
1 parent6a3236f commit6d34eb0

File tree

5 files changed

+285
-41
lines changed

5 files changed

+285
-41
lines changed

‎Doc/library/statistics.rst

Lines changed: 49 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ or sample.
7676
:func:`fmean` Fast, floating point arithmetic mean, with optional weighting.
7777
:func:`geometric_mean` Geometric mean of data.
7878
:func:`harmonic_mean` Harmonic mean of data.
79+
:func:`kde` Estimate the probability density distribution of the data.
7980
:func:`median` Median (middle value) of data.
8081
:func:`median_low` Low median of data.
8182
:func:`median_high` High median of data.
@@ -259,6 +260,54 @@ However, for reading convenience, most of the examples show sorted sequences.
259260
..versionchanged::3.10
260261
Added support for *weights*.
261262

263+
264+
..function::kde(data, h, kernel='normal')
265+
266+
`Kernel Density Estimation (KDE)
267+
<https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf>`_:
268+
Create a continuous probability density function from discrete samples.
269+
270+
The basic idea is to smooth the data using `a kernel function
271+
<https://en.wikipedia.org/wiki/Kernel_(statistics)>`_.
272+
to help draw inferences about a population from a sample.
273+
274+
The degree of smoothing is controlled by the scaling parameter *h*
275+
which is called the bandwidth. Smaller values emphasize local
276+
features while larger values give smoother results.
277+
278+
The *kernel* determines the relative weights of the sample data
279+
points. Generally, the choice of kernel shape does not matter
280+
as much as the more influential bandwidth smoothing parameter.
281+
282+
Kernels that give some weight to every sample point include
283+
*normal* or *gauss*, *logistic*, and *sigmoid*.
284+
285+
Kernels that only give weight to sample points within the bandwidth
286+
include *rectangular* or *uniform*, *triangular*, *parabolic* or
287+
*epanechnikov*, *quartic* or *biweight*, *triweight*, and *cosine*.
288+
289+
A:exc:`StatisticsError` will be raised if the *data* sequence is empty.
290+
291+
`Wikipedia has an example
292+
<https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
293+
where we can use:func:`kde` to generate and plot a probability
294+
density function estimated from a small sample:
295+
296+
..doctest::
297+
298+
>>>sample= [-2.1,-1.3,-0.4,1.9,5.1,6.2]
299+
>>>f_hat= kde(sample,h=1.5)
300+
>>>xarr= [i/100for iinrange(-750,1100)]
301+
>>>yarr= [f_hat(x)for xin xarr]
302+
303+
The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
304+
305+
..image::kde_example.png
306+
:alt:Scatter plot of the estimated probability density function.
307+
308+
..versionadded::3.13
309+
310+
262311
..function::median(data)
263312

264313
Return the median (middle value) of numeric data, using the common "mean of
@@ -1095,46 +1144,6 @@ The final prediction goes to the largest posterior. This is known as the
10951144
'female'
10961145

10971146

1098-
Kernel density estimation
1099-
*************************
1100-
1101-
It is possible to estimate a continuous probability density function
1102-
from a fixed number of discrete samples.
1103-
1104-
The basic idea is to smooth the data using `a kernel function such as a
1105-
normal distribution, triangular distribution, or uniform distribution
1106-
<https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use>`_.
1107-
The degree of smoothing is controlled by a scaling parameter, ``h``,
1108-
which is called the *bandwidth*.
1109-
1110-
..testcode::
1111-
1112-
def kde_normal(sample, h):
1113-
"Create a continuous probability density function from a sample."
1114-
# Smooth the sample with a normal distribution kernel scaled by h.
1115-
kernel_h = NormalDist(0.0, h).pdf
1116-
n = len(sample)
1117-
def pdf(x):
1118-
return sum(kernel_h(x - x_i) for x_i in sample) / n
1119-
return pdf
1120-
1121-
`Wikipedia has an example
1122-
<https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
1123-
where we can use the ``kde_normal()`` recipe to generate and plot
1124-
a probability density function estimated from a small sample:
1125-
1126-
..doctest::
1127-
1128-
>>>sample= [-2.1,-1.3,-0.4,1.9,5.1,6.2]
1129-
>>>f_hat= kde_normal(sample,h=1.5)
1130-
>>>xarr= [i/100for iinrange(-750,1100)]
1131-
>>>yarr= [f_hat(x)for xin xarr]
1132-
1133-
The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
1134-
1135-
..image::kde_example.png
1136-
:alt:Scatter plot of the estimated probability density function.
1137-
11381147
..
11391148
# This modelines must appear within the last ten lines of the file.
11401149
kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;

‎Doc/whatsnew/3.13.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -476,6 +476,14 @@ sqlite3
476476
for filtering database objects to dump.
477477
(Contributed by Mariusz Felisiak in:gh:`91602`.)
478478

479+
statistics
480+
----------
481+
482+
* Add:func:`statistics.kde` for kernel density estimation.
483+
This makes it possible to estimate a continuous probability density function
484+
from a fixed number of discrete samples.
485+
(Contributed by Raymond Hettinger in:gh:`115863`.)
486+
479487
subprocess
480488
----------
481489

‎Lib/statistics.py

Lines changed: 167 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@
112112
'fmean',
113113
'geometric_mean',
114114
'harmonic_mean',
115+
'kde',
115116
'linear_regression',
116117
'mean',
117118
'median',
@@ -137,7 +138,7 @@
137138
fromitertoolsimportcount,groupby,repeat
138139
frombisectimportbisect_left,bisect_right
139140
frommathimporthypot,sqrt,fabs,exp,erf,tau,log,fsum,sumprod
140-
frommathimportisfinite,isinf
141+
frommathimportisfinite,isinf,pi,cos,cosh
141142
fromfunctoolsimportreduce
142143
fromoperatorimportitemgetter
143144
fromcollectionsimportCounter,namedtuple,defaultdict
@@ -802,6 +803,171 @@ def multimode(data):
802803
return [valueforvalue,countincounts.items()ifcount==maxcount]
803804

804805

806+
defkde(data,h,kernel='normal'):
807+
"""Kernel Density Estimation: Create a continuous probability
808+
density function from discrete samples.
809+
810+
The basic idea is to smooth the data using a kernel function
811+
to help draw inferences about a population from a sample.
812+
813+
The degree of smoothing is controlled by the scaling parameter h
814+
which is called the bandwidth. Smaller values emphasize local
815+
features while larger values give smoother results.
816+
817+
The kernel determines the relative weights of the sample data
818+
points. Generally, the choice of kernel shape does not matter
819+
as much as the more influential bandwidth smoothing parameter.
820+
821+
Kernels that give some weight to every sample point:
822+
823+
normal or gauss
824+
logistic
825+
sigmoid
826+
827+
Kernels that only give weight to sample points within
828+
the bandwidth:
829+
830+
rectangular or uniform
831+
triangular
832+
parabolic or epanechnikov
833+
quartic or biweight
834+
triweight
835+
cosine
836+
837+
A StatisticsError will be raised if the data sequence is empty.
838+
839+
Example
840+
-------
841+
842+
Given a sample of six data points, construct a continuous
843+
function that estimates the underlying probability density:
844+
845+
>>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
846+
>>> f_hat = kde(sample, h=1.5)
847+
848+
Compute the area under the curve:
849+
850+
>>> sum(f_hat(x) for x in range(-20, 20))
851+
1.0
852+
853+
Plot the estimated probability density function at
854+
evenly spaced points from -6 to 10:
855+
856+
>>> for x in range(-6, 11):
857+
... density = f_hat(x)
858+
... plot = ' ' * int(density * 400) + 'x'
859+
... print(f'{x:2}: {density:.3f} {plot}')
860+
...
861+
-6: 0.002 x
862+
-5: 0.009 x
863+
-4: 0.031 x
864+
-3: 0.070 x
865+
-2: 0.111 x
866+
-1: 0.125 x
867+
0: 0.110 x
868+
1: 0.086 x
869+
2: 0.068 x
870+
3: 0.059 x
871+
4: 0.066 x
872+
5: 0.082 x
873+
6: 0.082 x
874+
7: 0.058 x
875+
8: 0.028 x
876+
9: 0.009 x
877+
10: 0.002 x
878+
879+
References
880+
----------
881+
882+
Kernel density estimation and its application:
883+
https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf
884+
885+
Kernel functions in common use:
886+
https://en.wikipedia.org/wiki/Kernel_(statistics)#kernel_functions_in_common_use
887+
888+
Interactive graphical demonstration and exploration:
889+
https://demonstrations.wolfram.com/KernelDensityEstimation/
890+
891+
"""
892+
893+
n=len(data)
894+
ifnotn:
895+
raiseStatisticsError('Empty data sequence')
896+
897+
ifnotisinstance(data[0], (int,float)):
898+
raiseTypeError('Data sequence must contain ints or floats')
899+
900+
ifh<=0.0:
901+
raiseStatisticsError(f'Bandwidth h must be positive, not{h=!r}')
902+
903+
matchkernel:
904+
905+
case'normal'|'gauss':
906+
c=1/sqrt(2*pi)
907+
K=lambdat:c*exp(-1/2*t*t)
908+
support=None
909+
910+
case'logistic':
911+
# 1.0 / (exp(t) + 2.0 + exp(-t))
912+
K=lambdat:1/2/ (1.0+cosh(t))
913+
support=None
914+
915+
case'sigmoid':
916+
# (2/pi) / (exp(t) + exp(-t))
917+
c=1/pi
918+
K=lambdat:c/cosh(t)
919+
support=None
920+
921+
case'rectangular'|'uniform':
922+
K=lambdat:1/2
923+
support=1.0
924+
925+
case'triangular':
926+
K=lambdat:1.0-abs(t)
927+
support=1.0
928+
929+
case'parabolic'|'epanechnikov':
930+
K=lambdat:3/4* (1.0-t*t)
931+
support=1.0
932+
933+
case'quartic'|'biweight':
934+
K=lambdat:15/16* (1.0-t*t)**2
935+
support=1.0
936+
937+
case'triweight':
938+
K=lambdat:35/32* (1.0-t*t)**3
939+
support=1.0
940+
941+
case'cosine':
942+
c1=pi/4
943+
c2=pi/2
944+
K=lambdat:c1*cos(c2*t)
945+
support=1.0
946+
947+
case _:
948+
raiseStatisticsError(f'Unknown kernel name:{kernel!r}')
949+
950+
ifsupportisNone:
951+
952+
defpdf(x):
953+
returnsum(K((x-x_i)/h)forx_iindata)/ (n*h)
954+
955+
else:
956+
957+
sample=sorted(data)
958+
bandwidth=h*support
959+
960+
defpdf(x):
961+
i=bisect_left(sample,x-bandwidth)
962+
j=bisect_right(sample,x+bandwidth)
963+
supported=sample[i :j]
964+
returnsum(K((x-x_i)/h)forx_iinsupported)/ (n*h)
965+
966+
pdf.__doc__=f'PDF estimate with{kernel=!r} and{h=!r}'
967+
968+
returnpdf
969+
970+
805971
# Notes on methods for computing quantiles
806972
# ----------------------------------------
807973
#

‎Lib/test/test_statistics.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2353,6 +2353,66 @@ def test_mixed_int_and_float(self):
23532353
self.assertAlmostEqual(actual_mean,expected_mean,places=5)
23542354

23552355

2356+
classTestKDE(unittest.TestCase):
2357+
2358+
deftest_kde(self):
2359+
kde=statistics.kde
2360+
StatisticsError=statistics.StatisticsError
2361+
2362+
kernels= ['normal','gauss','logistic','sigmoid','rectangular',
2363+
'uniform','triangular','parabolic','epanechnikov',
2364+
'quartic','biweight','triweight','cosine']
2365+
2366+
sample= [-2.1,-1.3,-0.4,1.9,5.1,6.2]
2367+
2368+
# The approximate integral of a PDF should be close to 1.0
2369+
2370+
defintegrate(func,low,high,steps=10_000):
2371+
"Numeric approximation of a definite function integral."
2372+
dx= (high-low)/steps
2373+
midpoints= (low+ (i+1/2)*dxforiinrange(steps))
2374+
returnsum(map(func,midpoints))*dx
2375+
2376+
forkernelinkernels:
2377+
withself.subTest(kernel=kernel):
2378+
f_hat=kde(sample,h=1.5,kernel=kernel)
2379+
area=integrate(f_hat,-20,20)
2380+
self.assertAlmostEqual(area,1.0,places=4)
2381+
2382+
# Check error cases
2383+
2384+
withself.assertRaises(StatisticsError):
2385+
kde([],h=1.0)# Empty dataset
2386+
withself.assertRaises(TypeError):
2387+
kde(['abc','def'],1.5)# Non-numeric data
2388+
withself.assertRaises(TypeError):
2389+
kde(iter(sample),1.5)# Data is not a sequence
2390+
withself.assertRaises(StatisticsError):
2391+
kde(sample,h=0.0)# Zero bandwidth
2392+
withself.assertRaises(StatisticsError):
2393+
kde(sample,h=0.0)# Negative bandwidth
2394+
withself.assertRaises(TypeError):
2395+
kde(sample,h='str')# Wrong bandwidth type
2396+
withself.assertRaises(StatisticsError):
2397+
kde(sample,h=1.0,kernel='bogus')# Invalid kernel
2398+
2399+
# Test name and docstring of the generated function
2400+
2401+
h=1.5
2402+
kernel='cosine'
2403+
f_hat=kde(sample,h,kernel)
2404+
self.assertEqual(f_hat.__name__,'pdf')
2405+
self.assertIn(kernel,f_hat.__doc__)
2406+
self.assertIn(str(h),f_hat.__doc__)
2407+
2408+
# Test closed interval for the support boundaries.
2409+
# In particular, 'uniform' should non-zero at the boundaries.
2410+
2411+
f_hat=kde([0],1.0,'uniform')
2412+
self.assertEqual(f_hat(-1.0),1/2)
2413+
self.assertEqual(f_hat(1.0),1/2)
2414+
2415+
23562416
classTestQuantiles(unittest.TestCase):
23572417

23582418
deftest_specific_cases(self):
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add kernel density estimation to the statistics module.

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp