Commitc4a897f

committed

Pushing the docs to dev/ for branch: master, commit 62e9bb8deff5bb96606d766a17ca88331cb0756a

1 parent1d5b525 commitc4a897fCopy full SHA for c4a897f

File tree

1,034 files changed

+4977

-3168

lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,034 files changed

+4977

-3168

lines changed

`‎dev/_downloads/auto_examples_jupyter.zip`

5.46 KB

Binary file not shown.

`‎dev/_downloads/auto_examples_python.zip`

4.32 KB

Binary file not shown.

`‎dev/_downloads/plot_all_scaling.ipynb`

Lines changed: 22 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -15,7 +15,7 @@`
`15`	`15`	`"cell_type":"markdown",`
`16`	`16`	`"metadata": {},`
`17`	`17`	`"source": [`
`18`		- "\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides a non-linear transformation in which distances\nbetween marginal outliers and inliers are shrunk.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n"
	`18`	+ "\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides non-linear transformations in which distances\nbetween marginal outliers and inliers are shrunk. ``PowerTransformer`` provides\nnon-linear transformations in which data is mapped to a normal distribution to\nstabilize variance and minimize skewness.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n"
`19`	`19`	`]`
`20`	`20`	`},`
`21`	`21`	`{`
`@@ -26,7 +26,7 @@`
`26`	`26`	`},`
`27`	`27`	`"outputs": [],`
`28`	`28`	`"source": [`
`29`		- "# Author: Raghav RV <rvraghav93@gmail.com>\n# Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing.data import QuantileTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X))\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')"
	`29`	+ "# Author: Raghav RV <rvraghav93@gmail.com>\n# Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing import QuantileTransformer\nfrom sklearn.preprocessing import PowerTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after power transformation (Box-Cox)',\n PowerTransformer(method='box-cox').fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X)),\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')"
`30`	`30`	`]`
`31`	`31`	`},`
`32`	`32`	`{`
`@@ -141,7 +141,7 @@`
`141`	`141`	`"cell_type":"markdown",`
`142`	`142`	`"metadata": {},`
`143`	`143`	`"source": [`
`144`		-"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies anon-linear transformationsuch that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped inthe range [0, 1],\neven theoutliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n"
	`144`	+"PowerTransformer (Box-Cox)\n--------------------------\n\n``PowerTransformer`` applies apower transformationto each\nfeature to make the data more Gaussian-like. Currently,\n``PowerTransformer`` implements the Box-Cox transform. It differs from\nQuantileTransformer (Gaussian output) in that it does not mapthe\ndata to a zero-mean, unit-variance Gaussian distribution. Instead, Box-Cox\nfinds theoptimal scaling factor to stabilize variance and mimimize skewness\nthrough maximum likelihood estimation. Note that Box-Cox can only be applied\nto positive, non-zero data. Income and number of households happen to be\nstrictly positive, but if negative values are present, a constant can be\nadded to each feature to shift it into the positive range - this is known as\nthe two-parameter Box-Cox transform.\n\n"
`145`	`145`	`]`
`146`	`146`	`},`
`147`	`147`	`{`
`@@ -173,6 +173,24 @@`
`173`	`173`	`"make_plot(6)"`
`174`	`174`	`]`
`175`	`175`	`},`
	`176`	`+ {`
	`177`	`+"cell_type":"markdown",`
	`178`	`+"metadata": {},`
	`179`	`+"source": [`
	`180`	+"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies a non-linear transformation such that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped in the range [0, 1],\neven the outliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n"
	`181`	`+ ]`
	`182`	`+ },`
	`183`	`+ {`
	`184`	`+"cell_type":"code",`
	`185`	`+"execution_count":null,`
	`186`	`+"metadata": {`
	`187`	`+"collapsed":false`
	`188`	`+ },`
	`189`	`+"outputs": [],`
	`190`	`+"source": [`
	`191`	`+"make_plot(7)"`
	`192`	`+ ]`
	`193`	`+ },`
`176`	`194`	`{`
`177`	`195`	`"cell_type":"markdown",`
`178`	`196`	`"metadata": {},`
`@@ -188,7 +206,7 @@`
`188`	`206`	`},`
`189`	`207`	`"outputs": [],`
`190`	`208`	`"source": [`
`191`		`-"make_plot(7)\nplt.show()"`
	`209`	`+"make_plot(8)\n\nplt.show()"`
`192`	`210`	`]`
`193`	`211`	`}`
`194`	`212`	`],`

`‎dev/_downloads/plot_all_scaling.py`

Lines changed: 44 additions & 20 deletions

Original file line number	Diff line number	Diff line change
`@@ -29,8 +29,10 @@`
`29`	`29`	`other in the way to estimate the parameters used to shift and scale each`
`30`	`30`	`feature.`
`31`	`31`
`32`		-``QuantileTransformer`` provides a non-linear transformation in which distances
`33`		`-between marginal outliers and inliers are shrunk.`
	`32`	+``QuantileTransformer`` provides non-linear transformations in which distances
	`33`	+between marginal outliers and inliers are shrunk. ``PowerTransformer`` provides
	`34`	`+non-linear transformations in which data is mapped to a normal distribution to`
	`35`	`+stabilize variance and minimize skewness.`
`34`	`36`
`35`	`37`	`Unlike the previous transformations, normalization refers to a per sample`
`36`	`38`	`transformation instead of a per feature transformation.`
`@@ -59,7 +61,8 @@`
`59`	`61`	`fromsklearn.preprocessingimportStandardScaler`
`60`	`62`	`fromsklearn.preprocessingimportRobustScaler`
`61`	`63`	`fromsklearn.preprocessingimportNormalizer`
`62`		`-fromsklearn.preprocessing.dataimportQuantileTransformer`
	`64`	`+fromsklearn.preprocessingimportQuantileTransformer`
	`65`	`+fromsklearn.preprocessingimportPowerTransformer`
`63`	`66`
`64`	`67`	`fromsklearn.datasetsimportfetch_california_housing`
`65`	`68`
`@@ -84,14 +87,16 @@`
`84`	`87`	`MaxAbsScaler().fit_transform(X)),`
`85`	`88`	`('Data after robust scaling',`
`86`	`89`	`RobustScaler(quantile_range=(25,75)).fit_transform(X)),`
`87`		`- ('Data after quantile transformation (uniform pdf)',`
`88`		`-QuantileTransformer(output_distribution='uniform')`
`89`		`- .fit_transform(X)),`
	`90`	`+ ('Data after power transformation (Box-Cox)',`
	`91`	`+PowerTransformer(method='box-cox').fit_transform(X)),`
`90`	`92`	`('Data after quantile transformation (gaussian pdf)',`
`91`	`93`	`QuantileTransformer(output_distribution='normal')`
`92`	`94`	`.fit_transform(X)),`
	`95`	`+ ('Data after quantile transformation (uniform pdf)',`
	`96`	`+QuantileTransformer(output_distribution='uniform')`
	`97`	`+ .fit_transform(X)),`
`93`	`98`	`('Data after sample-wise L2 normalizing',`
`94`		`-Normalizer().fit_transform(X))`
	`99`	`+Normalizer().fit_transform(X)),`
`95`	`100`	`]`
`96`	`101`
`97`	`102`	`# scale the output between 0 and 1 for the colorbar`
`@@ -286,6 +291,35 @@ def make_plot(item_idx):`
`286`	`291`
`287`	`292`	`make_plot(4)`
`288`	`293`
	`294`	`+##############################################################################`
	`295`	`+# PowerTransformer (Box-Cox)`
	`296`	`+# --------------------------`
	`297`	`+#`
	`298`	+# ``PowerTransformer`` applies a power transformation to each
	`299`	`+# feature to make the data more Gaussian-like. Currently,`
	`300`	+# ``PowerTransformer`` implements the Box-Cox transform. It differs from
	`301`	`+# QuantileTransformer (Gaussian output) in that it does not map the`
	`302`	`+# data to a zero-mean, unit-variance Gaussian distribution. Instead, Box-Cox`
	`303`	`+# finds the optimal scaling factor to stabilize variance and mimimize skewness`
	`304`	`+# through maximum likelihood estimation. Note that Box-Cox can only be applied`
	`305`	`+# to positive, non-zero data. Income and number of households happen to be`
	`306`	`+# strictly positive, but if negative values are present, a constant can be`
	`307`	`+# added to each feature to shift it into the positive range - this is known as`
	`308`	`+# the two-parameter Box-Cox transform.`
	`309`	`+`
	`310`	`+make_plot(5)`
	`311`	`+`
	`312`	`+##############################################################################`
	`313`	`+# QuantileTransformer (Gaussian output)`
	`314`	`+# -------------------------------------`
	`315`	`+#`
	`316`	+# ``QuantileTransformer`` has an additional ``output_distribution`` parameter
	`317`	`+# allowing to match a Gaussian distribution instead of a uniform distribution.`
	`318`	`+# Note that this non-parametetric transformer introduces saturation artifacts`
	`319`	`+# for extreme values.`
	`320`	`+`
	`321`	`+make_plot(6)`
	`322`	`+`
`289`	`323`	`###################################################################`
`290`	`324`	`# QuantileTransformer (uniform output)`
`291`	`325`	`# ------------------------------------`
`@@ -302,18 +336,7 @@ def make_plot(item_idx):`
`302`	`336`	`# any outlier by setting them to the a priori defined range boundaries (0 and`
`303`	`337`	`# 1).`
`304`	`338`
`305`		`-make_plot(5)`
`306`		`-`
`307`		`-##############################################################################`
`308`		`-# QuantileTransformer (Gaussian output)`
`309`		`-# -------------------------------------`
`310`		`-#`
`311`		-# ``QuantileTransformer`` has an additional ``output_distribution`` parameter
`312`		`-# allowing to match a Gaussian distribution instead of a uniform distribution.`
`313`		`-# Note that this non-parametetric transformer introduces saturation artifacts`
`314`		`-# for extreme values.`
`315`		`-`
`316`		`-make_plot(6)`
	`339`	`+make_plot(7)`
`317`	`340`
`318`	`341`	`##############################################################################`
`319`	`342`	`# Normalizer`
`@@ -326,5 +349,6 @@ def make_plot(item_idx):`
`326`	`349`	`# transformed data only lie in the positive quadrant. This would not be the`
`327`	`350`	`# case if some original features had a mix of positive and negative values.`
`328`	`351`
`329`		`-make_plot(7)`
	`352`	`+make_plot(8)`
	`353`	`+`
`330`	`354`	`plt.show()`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitc4a897f

File tree

1,034 files changed

Some content is hidden

1,034 files changed

`‎dev/_downloads/auto_examples_jupyter.zip`

`‎dev/_downloads/auto_examples_python.zip`

`‎dev/_downloads/plot_all_scaling.ipynb`

`‎dev/_downloads/plot_all_scaling.py`

0 commit comments