|
15 | 15 | "cell_type":"markdown",
|
16 | 16 | "metadata": {},
|
17 | 17 | "source": [
|
18 |
| - "\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides a non-linear transformation in which distances\nbetween marginal outliers and inliers are shrunk.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n" |
| 18 | + "\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides non-linear transformations in which distances\nbetween marginal outliers and inliers are shrunk. ``PowerTransformer`` provides\nnon-linear transformations in which data is mapped to a normal distribution to\nstabilize variance and minimize skewness.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n" |
19 | 19 | ]
|
20 | 20 | },
|
21 | 21 | {
|
|
26 | 26 | },
|
27 | 27 | "outputs": [],
|
28 | 28 | "source": [
|
29 |
| - "# Author: Raghav RV <rvraghav93@gmail.com>\n# Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing.data import QuantileTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X))\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')" |
| 29 | + "# Author: Raghav RV <rvraghav93@gmail.com>\n# Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing import QuantileTransformer\nfrom sklearn.preprocessing import PowerTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after power transformation (Box-Cox)',\n PowerTransformer(method='box-cox').fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X)),\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')" |
30 | 30 | ]
|
31 | 31 | },
|
32 | 32 | {
|
|
141 | 141 | "cell_type":"markdown",
|
142 | 142 | "metadata": {},
|
143 | 143 | "source": [
|
144 |
| -"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies anon-linear transformationsuch that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped inthe range [0, 1],\neven theoutliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n" |
| 144 | +"PowerTransformer (Box-Cox)\n--------------------------\n\n``PowerTransformer`` applies apower transformationto each\nfeature to make the data more Gaussian-like. Currently,\n``PowerTransformer`` implements the Box-Cox transform. It differs from\nQuantileTransformer (Gaussian output) in that it does not mapthe\ndata to a zero-mean, unit-variance Gaussian distribution. Instead, Box-Cox\nfinds theoptimal scaling factor to stabilize variance and mimimize skewness\nthrough maximum likelihood estimation. Note that Box-Cox can only be applied\nto positive, non-zero data. Income and number of households happen to be\nstrictly positive, but if negative values are present, a constant can be\nadded to each feature to shift it into the positive range - this is known as\nthe two-parameter Box-Cox transform.\n\n" |
145 | 145 | ]
|
146 | 146 | },
|
147 | 147 | {
|
|
173 | 173 | "make_plot(6)"
|
174 | 174 | ]
|
175 | 175 | },
|
| 176 | + { |
| 177 | +"cell_type":"markdown", |
| 178 | +"metadata": {}, |
| 179 | +"source": [ |
| 180 | +"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies a non-linear transformation such that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped in the range [0, 1],\neven the outliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n" |
| 181 | + ] |
| 182 | + }, |
| 183 | + { |
| 184 | +"cell_type":"code", |
| 185 | +"execution_count":null, |
| 186 | +"metadata": { |
| 187 | +"collapsed":false |
| 188 | + }, |
| 189 | +"outputs": [], |
| 190 | +"source": [ |
| 191 | +"make_plot(7)" |
| 192 | + ] |
| 193 | + }, |
176 | 194 | {
|
177 | 195 | "cell_type":"markdown",
|
178 | 196 | "metadata": {},
|
|
188 | 206 | },
|
189 | 207 | "outputs": [],
|
190 | 208 | "source": [
|
191 |
| -"make_plot(7)\nplt.show()" |
| 209 | +"make_plot(8)\n\nplt.show()" |
192 | 210 | ]
|
193 | 211 | }
|
194 | 212 | ],
|
|