Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitc48fdb3

Browse files
committed
Pushing the docs to dev/ for branch: master, commit b4db36d337a4ff83f1bcb37c5a8c615d3134d372
1 parent69acbb1 commitc48fdb3

File tree

1,238 files changed

+4184
-3859
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,238 files changed

+4184
-3859
lines changed
Binary file not shown.

‎dev/_downloads/a2486a67d0a96c8526fd62fbb80c78ba/plot_mahalanobis_distances.py

Lines changed: 137 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -3,67 +3,86 @@
33
Robust covariance estimation and Mahalanobis distances relevance
44
================================================================
55
6-
An exampleto showcovariance estimation with the Mahalanobis
6+
This exampleshowscovariance estimation with Mahalanobis
77
distances on Gaussian distributed data.
88
99
For Gaussian distributed data, the distance of an observation
1010
:math:`x_i` to the mode of the distribution can be computed using its
11-
Mahalanobis distance: :math:`d_{(\mu,\Sigma)}(x_i)^2 = (x_i -
12-
\mu)'\Sigma^{-1}(x_i - \mu)` where :math:`\mu` and :math:`\Sigma` are
13-
the location and the covariance of the underlying Gaussian
14-
distribution.
11+
Mahalanobis distance:
12+
13+
.. math::
14+
15+
d_{(\mu,\Sigma)}(x_i)^2 = (x_i - \mu)^T\Sigma^{-1}(x_i - \mu)
16+
17+
where :math:`\mu` and :math:`\Sigma` are the location and the covariance of
18+
the underlying Gaussian distributions.
1519
1620
In practice, :math:`\mu` and :math:`\Sigma` are replaced by some
17-
estimates.Theusual covariance maximum likelihood estimate is very
18-
sensitive to the presence of outliers in the data set andtherefor,
19-
thecorresponding Mahalanobis distances are.One wouldbetter have to
21+
estimates. Thestandard covariance maximum likelihood estimate (MLE) is very
22+
sensitive to the presence of outliers in the data set andtherefore,
23+
thedownstream Mahalanobis distancesalsoare.It wouldbe better to
2024
use a robust estimator of covariance to guarantee that the estimation is
21-
resistant to "erroneous" observations in thedata set and that the
22-
associated Mahalanobis distances accurately reflect the true
23-
organisation of the observations.
25+
resistant to "erroneous" observations in thedataset and that the
26+
calculated Mahalanobis distances accurately reflect the true
27+
organization of the observations.
2428
25-
The Minimum Covariance Determinant estimator is a robust,
29+
The Minimum Covariance Determinant estimator(MCD)is a robust,
2630
high-breakdown point (i.e. it can be used to estimate the covariance
2731
matrix of highly contaminated datasets, up to
2832
:math:`\frac{n_\text{samples}-n_\text{features}-1}{2}` outliers)
29-
estimator of covariance. The idea is to find
33+
estimator of covariance. The ideabehind the MCDis to find
3034
:math:`\frac{n_\text{samples}+n_\text{features}+1}{2}`
3135
observations whose empirical covariance has the smallest determinant,
3236
yielding a "pure" subset of observations from which to compute
33-
standards estimates of location and covariance.
34-
35-
The Minimum Covariance Determinant estimator (MCD) has been introduced
36-
by P.J.Rousseuw in [1].
37+
standards estimates of location and covariance. The MCD was introduced by
38+
P.J.Rousseuw in [1]_.
3739
3840
This example illustrates how the Mahalanobis distances are affected by
39-
outlying data: observations drawn from a contaminating distribution
41+
outlying data. Observations drawn from a contaminating distribution
4042
are not distinguishable from the observations coming from the real,
41-
Gaussian distribution that one may want to work with. Using MCD-based
43+
Gaussian distribution when using standard covariance MLE based Mahalanobis
44+
distances. Using MCD-based
4245
Mahalanobis distances, the two populations become
43-
distinguishable. Associated applications are outliers detection,
44-
observations ranking, clustering, ...
45-
For visualization purpose, the cubic root of the Mahalanobis distances
46-
are represented in the boxplot, as Wilson and Hilferty suggest [2]
46+
distinguishable. Associated applications include outlier detection,
47+
observation ranking and clustering.
48+
49+
.. note::
50+
51+
See also :ref:`sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py`
4752
48-
[1] P. J. Rousseeuw. Least median of squares regression. J. Am
49-
Stat Ass, 79:871, 1984.
50-
[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square.
51-
Proceedings of the National Academy of Sciences of the United States
52-
of America, 17, 684-688.
53+
.. topic:: References:
5354
54-
"""
55-
print(__doc__)
55+
.. [1] P. J. Rousseeuw. `Least median of squares regression
56+
<http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf>`_. J. Am
57+
Stat Ass, 79:871, 1984.
58+
.. [2] Wilson, E. B., & Hilferty, M. M. (1931). `The distribution of chi-square.
59+
<https://water.usgs.gov/osw/bulletin17b/Wilson_Hilferty_1931.pdf>`_
60+
Proceedings of the National Academy of Sciences of the United States
61+
of America, 17, 684-688.
62+
63+
"""# noqa: E501
64+
65+
# %%
66+
# Generate data
67+
# --------------
68+
#
69+
# First, we generate a dataset of 125 samples and 2 features. Both features
70+
# are Gaussian distributed with mean of 0 but feature 1 has a standard
71+
# deviation equal to 2 and feature 2 has a standard deviation equal to 1. Next,
72+
# 25 samples are replaced with Gaussian outlier samples where feature 1 has
73+
# a standard devation equal to 1 and feature 2 has a standard deviation equal
74+
# to 7.
5675

5776
importnumpyasnp
58-
importmatplotlib.pyplotasplt
5977

60-
fromsklearn.covarianceimportEmpiricalCovariance,MinCovDet
78+
# for consistent results
79+
np.random.seed(7)
6180

6281
n_samples=125
6382
n_outliers=25
6483
n_features=2
6584

66-
# generate data
85+
# generateGaussiandata of shape (125, 2)
6786
gen_cov=np.eye(n_features)
6887
gen_cov[0,0]=2.
6988
X=np.dot(np.random.randn(n_samples,n_features),gen_cov)
@@ -72,73 +91,103 @@
7291
outliers_cov[np.arange(1,n_features),np.arange(1,n_features)]=7.
7392
X[-n_outliers:]=np.dot(np.random.randn(n_outliers,n_features),outliers_cov)
7493

75-
# fit a Minimum Covariance Determinant (MCD) robust estimator to data
76-
robust_cov=MinCovDet().fit(X)
94+
# %%
95+
# Comparison of results
96+
# ---------------------
97+
#
98+
# Below, we fit MCD and MLE based covariance estimators to our data and print
99+
# the estimated covariance matrices. Note that the estimated variance of
100+
# feature 2 is much higher with the MLE based estimator (7.5) than
101+
# that of the MCD robust estimator (1.2). This shows that the MCD based
102+
# robust estimator is much more resistant to the outlier samples, which were
103+
# designed to have a much larger variance in feature 2.
77104

78-
# compare estimators learnt from the full data set with true parameters
79-
emp_cov=EmpiricalCovariance().fit(X)
105+
importmatplotlib.pyplotasplt
106+
fromsklearn.covarianceimportEmpiricalCovariance,MinCovDet
80107

81-
# #############################################################################
82-
# Display results
83-
fig=plt.figure()
84-
plt.subplots_adjust(hspace=-.1,wspace=.4,top=.95,bottom=.05)
85-
86-
# Show data set
87-
subfig1=plt.subplot(3,1,1)
88-
inlier_plot=subfig1.scatter(X[:,0],X[:,1],
89-
color='black',label='inliers')
90-
outlier_plot=subfig1.scatter(X[:,0][-n_outliers:],X[:,1][-n_outliers:],
91-
color='red',label='outliers')
92-
subfig1.set_xlim(subfig1.get_xlim()[0],11.)
93-
subfig1.set_title("Mahalanobis distances of a contaminated data set:")
94-
95-
# Show contours of the distance functions
108+
# fit a MCD robust estimator to data
109+
robust_cov=MinCovDet().fit(X)
110+
# fit a MLE estimator to data
111+
emp_cov=EmpiricalCovariance().fit(X)
112+
print('Estimated covariance matrix:\n'
113+
'MCD (Robust):\n{}\n'
114+
'MLE:\n{}'.format(robust_cov.covariance_,emp_cov.covariance_))
115+
116+
# %%
117+
# To better visualize the difference, we plot contours of the
118+
# Mahalanobis distances calculated by both methods. Notice that the robust
119+
# MCD based Mahalanobis distances fit the inlier black points much better,
120+
# whereas the MLE based distances are more influenced by the outlier
121+
# red points.
122+
123+
fig,ax=plt.subplots(figsize=(10,5))
124+
# Plot data set
125+
inlier_plot=ax.scatter(X[:,0],X[:,1],
126+
color='black',label='inliers')
127+
outlier_plot=ax.scatter(X[:,0][-n_outliers:],X[:,1][-n_outliers:],
128+
color='red',label='outliers')
129+
ax.set_xlim(ax.get_xlim()[0],10.)
130+
ax.set_title("Mahalanobis distances of a contaminated data set")
131+
132+
# Create meshgrid of feature 1 and feature 2 values
96133
xx,yy=np.meshgrid(np.linspace(plt.xlim()[0],plt.xlim()[1],100),
97134
np.linspace(plt.ylim()[0],plt.ylim()[1],100))
98135
zz=np.c_[xx.ravel(),yy.ravel()]
99-
136+
# Calculate the MLE based Mahalanobis distances of the meshgrid
100137
mahal_emp_cov=emp_cov.mahalanobis(zz)
101138
mahal_emp_cov=mahal_emp_cov.reshape(xx.shape)
102-
emp_cov_contour=subfig1.contour(xx,yy,np.sqrt(mahal_emp_cov),
103-
cmap=plt.cm.PuBu_r,
104-
linestyles='dashed')
105-
139+
emp_cov_contour=plt.contour(xx,yy,np.sqrt(mahal_emp_cov),
140+
cmap=plt.cm.PuBu_r,linestyles='dashed')
141+
# Calculate the MCD based Mahalanobis distances
106142
mahal_robust_cov=robust_cov.mahalanobis(zz)
107143
mahal_robust_cov=mahal_robust_cov.reshape(xx.shape)
108-
robust_contour=subfig1.contour(xx,yy,np.sqrt(mahal_robust_cov),
109-
cmap=plt.cm.YlOrBr_r,linestyles='dotted')
144+
robust_contour=ax.contour(xx,yy,np.sqrt(mahal_robust_cov),
145+
cmap=plt.cm.YlOrBr_r,linestyles='dotted')
110146

111-
subfig1.legend([emp_cov_contour.collections[1],robust_contour.collections[1],
112-
inlier_plot,outlier_plot],
113-
['MLE dist','robust dist','inliers','outliers'],
114-
loc="upper right",borderaxespad=0)
115-
plt.xticks(())
116-
plt.yticks(())
147+
# Add legend
148+
ax.legend([emp_cov_contour.collections[1],robust_contour.collections[1],
149+
inlier_plot,outlier_plot],
150+
['MLE dist','MCD dist','inliers','outliers'],
151+
loc="upper right",borderaxespad=0)
117152

118-
# Plot the scores for each point
119-
emp_mahal=emp_cov.mahalanobis(X-np.mean(X,0))** (0.33)
120-
subfig2=plt.subplot(2,2,3)
121-
subfig2.boxplot([emp_mahal[:-n_outliers],emp_mahal[-n_outliers:]],widths=.25)
122-
subfig2.plot(np.full(n_samples-n_outliers,1.26),
123-
emp_mahal[:-n_outliers],'+k',markeredgewidth=1)
124-
subfig2.plot(np.full(n_outliers,2.26),
125-
emp_mahal[-n_outliers:],'+k',markeredgewidth=1)
126-
subfig2.axes.set_xticklabels(('inliers','outliers'),size=15)
127-
subfig2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)
128-
subfig2.set_title("1. from non-robust estimates\n(Maximum Likelihood)")
129-
plt.yticks(())
153+
plt.show()
154+
155+
# %%
156+
# Finally, we highlight the ability of MCD based Mahalanobis distances to
157+
# distinguish outliers. We take the cubic root of the Mahalanobis distances,
158+
# yielding approximately normal distributions (as suggested by Wilson and
159+
# Hilferty [2]_), then plot the values of inlier and outlier samples with
160+
# boxplots. The distribution of outlier samples is more separated from the
161+
# distribution of inlier samples for robust MCD based Mahalanobis distances.
130162

163+
fig, (ax1,ax2)=plt.subplots(1,2)
164+
plt.subplots_adjust(wspace=.6)
165+
166+
# Calculate cubic root of MLE Mahalanobis distances for samples
167+
emp_mahal=emp_cov.mahalanobis(X-np.mean(X,0))** (0.33)
168+
# Plot boxplots
169+
ax1.boxplot([emp_mahal[:-n_outliers],emp_mahal[-n_outliers:]],widths=.25)
170+
# Plot individual samples
171+
ax1.plot(np.full(n_samples-n_outliers,1.26),emp_mahal[:-n_outliers],
172+
'+k',markeredgewidth=1)
173+
ax1.plot(np.full(n_outliers,2.26),emp_mahal[-n_outliers:],
174+
'+k',markeredgewidth=1)
175+
ax1.axes.set_xticklabels(('inliers','outliers'),size=15)
176+
ax1.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)
177+
ax1.set_title("Using non-robust estimates\n(Maximum Likelihood)")
178+
179+
# Calculate cubic root of MCD Mahalanobis distances for samples
131180
robust_mahal=robust_cov.mahalanobis(X-robust_cov.location_)** (0.33)
132-
subfig3=plt.subplot(2,2,4)
133-
subfig3.boxplot([robust_mahal[:-n_outliers],robust_mahal[-n_outliers:]],
134-
widths=.25)
135-
subfig3.plot(np.full(n_samples-n_outliers,1.26),
136-
robust_mahal[:-n_outliers],'+k',markeredgewidth=1)
137-
subfig3.plot(np.full(n_outliers,2.26),
138-
robust_mahal[-n_outliers:],'+k',markeredgewidth=1)
139-
subfig3.axes.set_xticklabels(('inliers','outliers'),size=15)
140-
subfig3.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)
141-
subfig3.set_title("2. from robust estimates\n(Minimum Covariance Determinant)")
142-
plt.yticks(())
181+
# Plot boxplots
182+
ax2.boxplot([robust_mahal[:-n_outliers],robust_mahal[-n_outliers:]],
183+
widths=.25)
184+
# Plot individual samples
185+
ax2.plot(np.full(n_samples-n_outliers,1.26),robust_mahal[:-n_outliers],
186+
'+k',markeredgewidth=1)
187+
ax2.plot(np.full(n_outliers,2.26),robust_mahal[-n_outliers:],
188+
'+k',markeredgewidth=1)
189+
ax2.axes.set_xticklabels(('inliers','outliers'),size=15)
190+
ax2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)
191+
ax2.set_title("Using robust estimates\n(Minimum Covariance Determinant)")
143192

144193
plt.show()

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp