Commitc48fdb3

committed

Pushing the docs to dev/ for branch: master, commit b4db36d337a4ff83f1bcb37c5a8c615d3134d372

1 parent69acbb1 commitc48fdb3Copy full SHA for c48fdb3

File tree

1,238 files changed

+4184

-3859

lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,238 files changed

+4184

-3859

lines changed

`‎dev/_downloads/3409d9766d352cc9f9b169d4a799a87a/auto_examples_python.zip`

1.88 KB

Binary file not shown.

`‎dev/_downloads/a2486a67d0a96c8526fd62fbb80c78ba/plot_mahalanobis_distances.py`

Lines changed: 137 additions & 88 deletions

Original file line number	Diff line number	Diff line change
`@@ -3,67 +3,86 @@`
`3`	`3`	`Robust covariance estimation and Mahalanobis distances relevance`
`4`	`4`	`================================================================`
`5`	`5`
`6`		`-An exampleto showcovariance estimation with the Mahalanobis`
	`6`	`+This exampleshowscovariance estimation with Mahalanobis`
`7`	`7`	`distances on Gaussian distributed data.`
`8`	`8`
`9`	`9`	`For Gaussian distributed data, the distance of an observation`
`10`	`10`	:math:`x_i` to the mode of the distribution can be computed using its
`11`		-Mahalanobis distance: :math:`d_{(\mu,\Sigma)}(x_i)^2 = (x_i -
`12`		-\mu)'\Sigma^{-1}(x_i - \mu)` where :math:`\mu` and :math:`\Sigma` are
`13`		`-the location and the covariance of the underlying Gaussian`
`14`		`-distribution.`
	`11`	`+Mahalanobis distance:`
	`12`	`+`
	`13`	`+.. math::`
	`14`	`+`
	`15`	`+ d_{(\mu,\Sigma)}(x_i)^2 = (x_i - \mu)^T\Sigma^{-1}(x_i - \mu)`
	`16`	`+`
	`17`	+where :math:`\mu` and :math:`\Sigma` are the location and the covariance of
	`18`	`+the underlying Gaussian distributions.`
`15`	`19`
`16`	`20`	In practice, :math:`\mu` and :math:`\Sigma` are replaced by some
`17`		`-estimates.Theusual covariance maximum likelihood estimate is very`
`18`		`-sensitive to the presence of outliers in the data set andtherefor,`
`19`		`-thecorresponding Mahalanobis distances are.One wouldbetter have to`
	`21`	`+estimates. Thestandard covariance maximum likelihood estimate (MLE) is very`
	`22`	`+sensitive to the presence of outliers in the data set andtherefore,`
	`23`	`+thedownstream Mahalanobis distancesalsoare.It wouldbe better to`
`20`	`24`	`use a robust estimator of covariance to guarantee that the estimation is`
`21`		`-resistant to "erroneous" observations in thedata set and that the`
`22`		`-associated Mahalanobis distances accurately reflect the true`
`23`		`-organisation of the observations.`
	`25`	`+resistant to "erroneous" observations in thedataset and that the`
	`26`	`+calculated Mahalanobis distances accurately reflect the true`
	`27`	`+organization of the observations.`
`24`	`28`
`25`		`-The Minimum Covariance Determinant estimator is a robust,`
	`29`	`+The Minimum Covariance Determinant estimator(MCD)is a robust,`
`26`	`30`	`high-breakdown point (i.e. it can be used to estimate the covariance`
`27`	`31`	`matrix of highly contaminated datasets, up to`
`28`	`32`	:math:`\frac{n_\text{samples}-n_\text{features}-1}{2}` outliers)
`29`		`-estimator of covariance. The idea is to find`
	`33`	`+estimator of covariance. The ideabehind the MCDis to find`
`30`	`34`	:math:`\frac{n_\text{samples}+n_\text{features}+1}{2}`
`31`	`35`	`observations whose empirical covariance has the smallest determinant,`
`32`	`36`	`yielding a "pure" subset of observations from which to compute`
`33`		`-standards estimates of location and covariance.`
`34`		`-`
`35`		`-The Minimum Covariance Determinant estimator (MCD) has been introduced`
`36`		`-by P.J.Rousseuw in [1].`
	`37`	`+standards estimates of location and covariance. The MCD was introduced by`
	`38`	`+P.J.Rousseuw in [1]_.`
`37`	`39`
`38`	`40`	`This example illustrates how the Mahalanobis distances are affected by`
`39`		`-outlying data: observations drawn from a contaminating distribution`
	`41`	`+outlying data. Observations drawn from a contaminating distribution`
`40`	`42`	`are not distinguishable from the observations coming from the real,`
`41`		`-Gaussian distribution that one may want to work with. Using MCD-based`
	`43`	`+Gaussian distribution when using standard covariance MLE based Mahalanobis`
	`44`	`+distances. Using MCD-based`
`42`	`45`	`Mahalanobis distances, the two populations become`
`43`		`-distinguishable. Associated applications are outliers detection,`
`44`		`-observations ranking, clustering, ...`
`45`		`-For visualization purpose, the cubic root of the Mahalanobis distances`
`46`		`-are represented in the boxplot, as Wilson and Hilferty suggest [2]`
	`46`	`+distinguishable. Associated applications include outlier detection,`
	`47`	`+observation ranking and clustering.`
	`48`	`+`
	`49`	`+.. note::`
	`50`	`+`
	`51`	+ See also :ref:`sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py`
`47`	`52`
`48`		`-[1] P. J. Rousseeuw. Least median of squares regression. J. Am`
`49`		`- Stat Ass, 79:871, 1984.`
`50`		`-[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square.`
`51`		`- Proceedings of the National Academy of Sciences of the United States`
`52`		`- of America, 17, 684-688.`
	`53`	`+.. topic:: References:`
`53`	`54`
`54`		`-"""`
`55`		`-print(__doc__)`
	`55`	+ .. [1] P. J. Rousseeuw. `Least median of squares regression
	`56`	+ <http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf>`_. J. Am
	`57`	`+ Stat Ass, 79:871, 1984.`
	`58`	+ .. [2] Wilson, E. B., & Hilferty, M. M. (1931). `The distribution of chi-square.
	`59`	+ <https://water.usgs.gov/osw/bulletin17b/Wilson_Hilferty_1931.pdf>`_
	`60`	`+ Proceedings of the National Academy of Sciences of the United States`
	`61`	`+ of America, 17, 684-688.`
	`62`	`+`
	`63`	`+"""# noqa: E501`
	`64`	`+`
	`65`	`+# %%`
	`66`	`+# Generate data`
	`67`	`+# --------------`
	`68`	`+#`
	`69`	`+# First, we generate a dataset of 125 samples and 2 features. Both features`
	`70`	`+# are Gaussian distributed with mean of 0 but feature 1 has a standard`
	`71`	`+# deviation equal to 2 and feature 2 has a standard deviation equal to 1. Next,`
	`72`	`+# 25 samples are replaced with Gaussian outlier samples where feature 1 has`
	`73`	`+# a standard devation equal to 1 and feature 2 has a standard deviation equal`
	`74`	`+# to 7.`
`56`	`75`
`57`	`76`	`importnumpyasnp`
`58`		`-importmatplotlib.pyplotasplt`
`59`	`77`
`60`		`-fromsklearn.covarianceimportEmpiricalCovariance,MinCovDet`
	`78`	`+# for consistent results`
	`79`	`+np.random.seed(7)`
`61`	`80`
`62`	`81`	`n_samples=125`
`63`	`82`	`n_outliers=25`
`64`	`83`	`n_features=2`
`65`	`84`
`66`		`-# generate data`
	`85`	`+# generateGaussiandata of shape (125, 2)`
`67`	`86`	`gen_cov=np.eye(n_features)`
`68`	`87`	`gen_cov[0,0]=2.`
`69`	`88`	`X=np.dot(np.random.randn(n_samples,n_features),gen_cov)`
`@@ -72,73 +91,103 @@`
`72`	`91`	`outliers_cov[np.arange(1,n_features),np.arange(1,n_features)]=7.`
`73`	`92`	`X[-n_outliers:]=np.dot(np.random.randn(n_outliers,n_features),outliers_cov)`
`74`	`93`
`75`		`-# fit a Minimum Covariance Determinant (MCD) robust estimator to data`
`76`		`-robust_cov=MinCovDet().fit(X)`
	`94`	`+# %%`
	`95`	`+# Comparison of results`
	`96`	`+# ---------------------`
	`97`	`+#`
	`98`	`+# Below, we fit MCD and MLE based covariance estimators to our data and print`
	`99`	`+# the estimated covariance matrices. Note that the estimated variance of`
	`100`	`+# feature 2 is much higher with the MLE based estimator (7.5) than`
	`101`	`+# that of the MCD robust estimator (1.2). This shows that the MCD based`
	`102`	`+# robust estimator is much more resistant to the outlier samples, which were`
	`103`	`+# designed to have a much larger variance in feature 2.`
`77`	`104`
`78`		`-# compare estimators learnt from the full data set with true parameters`
`79`		`-emp_cov=EmpiricalCovariance().fit(X)`
	`105`	`+importmatplotlib.pyplotasplt`
	`106`	`+fromsklearn.covarianceimportEmpiricalCovariance,MinCovDet`
`80`	`107`
`81`		`-# #############################################################################`
`82`		`-# Display results`
`83`		`-fig=plt.figure()`
`84`		`-plt.subplots_adjust(hspace=-.1,wspace=.4,top=.95,bottom=.05)`
`85`		`-`
`86`		`-# Show data set`
`87`		`-subfig1=plt.subplot(3,1,1)`
`88`		`-inlier_plot=subfig1.scatter(X[:,0],X[:,1],`
`89`		`-color='black',label='inliers')`
`90`		`-outlier_plot=subfig1.scatter(X[:,0][-n_outliers:],X[:,1][-n_outliers:],`
`91`		`-color='red',label='outliers')`
`92`		`-subfig1.set_xlim(subfig1.get_xlim()[0],11.)`
`93`		`-subfig1.set_title("Mahalanobis distances of a contaminated data set:")`
`94`		`-`
`95`		`-# Show contours of the distance functions`
	`108`	`+# fit a MCD robust estimator to data`
	`109`	`+robust_cov=MinCovDet().fit(X)`
	`110`	`+# fit a MLE estimator to data`
	`111`	`+emp_cov=EmpiricalCovariance().fit(X)`
	`112`	`+print('Estimated covariance matrix:\n'`
	`113`	`+'MCD (Robust):\n{}\n'`
	`114`	`+'MLE:\n{}'.format(robust_cov.covariance_,emp_cov.covariance_))`
	`115`	`+`
	`116`	`+# %%`
	`117`	`+# To better visualize the difference, we plot contours of the`
	`118`	`+# Mahalanobis distances calculated by both methods. Notice that the robust`
	`119`	`+# MCD based Mahalanobis distances fit the inlier black points much better,`
	`120`	`+# whereas the MLE based distances are more influenced by the outlier`
	`121`	`+# red points.`
	`122`	`+`
	`123`	`+fig,ax=plt.subplots(figsize=(10,5))`
	`124`	`+# Plot data set`
	`125`	`+inlier_plot=ax.scatter(X[:,0],X[:,1],`
	`126`	`+color='black',label='inliers')`
	`127`	`+outlier_plot=ax.scatter(X[:,0][-n_outliers:],X[:,1][-n_outliers:],`
	`128`	`+color='red',label='outliers')`
	`129`	`+ax.set_xlim(ax.get_xlim()[0],10.)`
	`130`	`+ax.set_title("Mahalanobis distances of a contaminated data set")`
	`131`	`+`
	`132`	`+# Create meshgrid of feature 1 and feature 2 values`
`96`	`133`	`xx,yy=np.meshgrid(np.linspace(plt.xlim()[0],plt.xlim()[1],100),`
`97`	`134`	`np.linspace(plt.ylim()[0],plt.ylim()[1],100))`
`98`	`135`	`zz=np.c_[xx.ravel(),yy.ravel()]`
`99`		`-`
	`136`	`+# Calculate the MLE based Mahalanobis distances of the meshgrid`
`100`	`137`	`mahal_emp_cov=emp_cov.mahalanobis(zz)`
`101`	`138`	`mahal_emp_cov=mahal_emp_cov.reshape(xx.shape)`
`102`		`-emp_cov_contour=subfig1.contour(xx,yy,np.sqrt(mahal_emp_cov),`
`103`		`-cmap=plt.cm.PuBu_r,`
`104`		`-linestyles='dashed')`
`105`		`-`
	`139`	`+emp_cov_contour=plt.contour(xx,yy,np.sqrt(mahal_emp_cov),`
	`140`	`+cmap=plt.cm.PuBu_r,linestyles='dashed')`
	`141`	`+# Calculate the MCD based Mahalanobis distances`
`106`	`142`	`mahal_robust_cov=robust_cov.mahalanobis(zz)`
`107`	`143`	`mahal_robust_cov=mahal_robust_cov.reshape(xx.shape)`
`108`		`-robust_contour=subfig1.contour(xx,yy,np.sqrt(mahal_robust_cov),`
`109`		`-cmap=plt.cm.YlOrBr_r,linestyles='dotted')`
	`144`	`+robust_contour=ax.contour(xx,yy,np.sqrt(mahal_robust_cov),`
	`145`	`+cmap=plt.cm.YlOrBr_r,linestyles='dotted')`
`110`	`146`
`111`		`-subfig1.legend([emp_cov_contour.collections[1],robust_contour.collections[1],`
`112`		`-inlier_plot,outlier_plot],`
`113`		`- ['MLE dist','robust dist','inliers','outliers'],`
`114`		`-loc="upper right",borderaxespad=0)`
`115`		`-plt.xticks(())`
`116`		`-plt.yticks(())`
	`147`	`+# Add legend`
	`148`	`+ax.legend([emp_cov_contour.collections[1],robust_contour.collections[1],`
	`149`	`+inlier_plot,outlier_plot],`
	`150`	`+ ['MLE dist','MCD dist','inliers','outliers'],`
	`151`	`+loc="upper right",borderaxespad=0)`
`117`	`152`
`118`		`-# Plot the scores for each point`
`119`		`-emp_mahal=emp_cov.mahalanobis(X-np.mean(X,0))** (0.33)`
`120`		`-subfig2=plt.subplot(2,2,3)`
`121`		`-subfig2.boxplot([emp_mahal[:-n_outliers],emp_mahal[-n_outliers:]],widths=.25)`
`122`		`-subfig2.plot(np.full(n_samples-n_outliers,1.26),`
`123`		`-emp_mahal[:-n_outliers],'+k',markeredgewidth=1)`
`124`		`-subfig2.plot(np.full(n_outliers,2.26),`
`125`		`-emp_mahal[-n_outliers:],'+k',markeredgewidth=1)`
`126`		`-subfig2.axes.set_xticklabels(('inliers','outliers'),size=15)`
`127`		`-subfig2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)`
`128`		`-subfig2.set_title("1. from non-robust estimates\n(Maximum Likelihood)")`
`129`		`-plt.yticks(())`
	`153`	`+plt.show()`
	`154`	`+`
	`155`	`+# %%`
	`156`	`+# Finally, we highlight the ability of MCD based Mahalanobis distances to`
	`157`	`+# distinguish outliers. We take the cubic root of the Mahalanobis distances,`
	`158`	`+# yielding approximately normal distributions (as suggested by Wilson and`
	`159`	`+# Hilferty [2]_), then plot the values of inlier and outlier samples with`
	`160`	`+# boxplots. The distribution of outlier samples is more separated from the`
	`161`	`+# distribution of inlier samples for robust MCD based Mahalanobis distances.`
`130`	`162`
	`163`	`+fig, (ax1,ax2)=plt.subplots(1,2)`
	`164`	`+plt.subplots_adjust(wspace=.6)`
	`165`	`+`
	`166`	`+# Calculate cubic root of MLE Mahalanobis distances for samples`
	`167`	`+emp_mahal=emp_cov.mahalanobis(X-np.mean(X,0))** (0.33)`
	`168`	`+# Plot boxplots`
	`169`	`+ax1.boxplot([emp_mahal[:-n_outliers],emp_mahal[-n_outliers:]],widths=.25)`
	`170`	`+# Plot individual samples`
	`171`	`+ax1.plot(np.full(n_samples-n_outliers,1.26),emp_mahal[:-n_outliers],`
	`172`	`+'+k',markeredgewidth=1)`
	`173`	`+ax1.plot(np.full(n_outliers,2.26),emp_mahal[-n_outliers:],`
	`174`	`+'+k',markeredgewidth=1)`
	`175`	`+ax1.axes.set_xticklabels(('inliers','outliers'),size=15)`
	`176`	`+ax1.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)`
	`177`	`+ax1.set_title("Using non-robust estimates\n(Maximum Likelihood)")`
	`178`	`+`
	`179`	`+# Calculate cubic root of MCD Mahalanobis distances for samples`
`131`	`180`	`robust_mahal=robust_cov.mahalanobis(X-robust_cov.location_)** (0.33)`
`132`		`-subfig3=plt.subplot(2,2,4)`
`133`		`-subfig3.boxplot([robust_mahal[:-n_outliers],robust_mahal[-n_outliers:]],`
`134`		`-widths=.25)`
`135`		`-subfig3.plot(np.full(n_samples-n_outliers,1.26),`
`136`		`-robust_mahal[:-n_outliers],'+k',markeredgewidth=1)`
`137`		`-subfig3.plot(np.full(n_outliers,2.26),`
`138`		`-robust_mahal[-n_outliers:],'+k',markeredgewidth=1)`
`139`		`-subfig3.axes.set_xticklabels(('inliers','outliers'),size=15)`
`140`		`-subfig3.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)`
`141`		`-subfig3.set_title("2. from robust estimates\n(Minimum Covariance Determinant)")`
`142`		`-plt.yticks(())`
	`181`	`+# Plot boxplots`
	`182`	`+ax2.boxplot([robust_mahal[:-n_outliers],robust_mahal[-n_outliers:]],`
	`183`	`+widths=.25)`
	`184`	`+# Plot individual samples`
	`185`	`+ax2.plot(np.full(n_samples-n_outliers,1.26),robust_mahal[:-n_outliers],`
	`186`	`+'+k',markeredgewidth=1)`
	`187`	`+ax2.plot(np.full(n_outliers,2.26),robust_mahal[-n_outliers:],`
	`188`	`+'+k',markeredgewidth=1)`
	`189`	`+ax2.axes.set_xticklabels(('inliers','outliers'),size=15)`
	`190`	`+ax2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$",size=16)`
	`191`	`+ax2.set_title("Using robust estimates\n(Minimum Covariance Determinant)")`
`143`	`192`
`144`	`193`	`plt.show()`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitc48fdb3

File tree

1,238 files changed

Some content is hidden

1,238 files changed

`‎dev/_downloads/3409d9766d352cc9f9b169d4a799a87a/auto_examples_python.zip`

`‎dev/_downloads/a2486a67d0a96c8526fd62fbb80c78ba/plot_mahalanobis_distances.py`

0 commit comments