Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitc4a897f

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 62e9bb8deff5bb96606d766a17ca88331cb0756a
1 parent1d5b525 commitc4a897f

File tree

1,034 files changed

+4977
-3168
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,034 files changed

+4977
-3168
lines changed
5.46 KB
Binary file not shown.
4.32 KB
Binary file not shown.

‎dev/_downloads/plot_all_scaling.ipynb

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type":"markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides a non-linear transformation in which distances\nbetween marginal outliers and inliers are shrunk.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n"
18+
"\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides non-linear transformations in which distances\nbetween marginal outliers and inliers are shrunk. ``PowerTransformer`` provides\nnon-linear transformations in which data is mapped to a normal distribution to\nstabilize variance and minimize skewness.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n"
1919
]
2020
},
2121
{
@@ -26,7 +26,7 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"# Author: Raghav RV <rvraghav93@gmail.com>\n# Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing.data import QuantileTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X))\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')"
29+
"# Author: Raghav RV <rvraghav93@gmail.com>\n# Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing import QuantileTransformer\nfrom sklearn.preprocessing import PowerTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after power transformation (Box-Cox)',\n PowerTransformer(method='box-cox').fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X)),\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')"
3030
]
3131
},
3232
{
@@ -141,7 +141,7 @@
141141
"cell_type":"markdown",
142142
"metadata": {},
143143
"source": [
144-
"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies anon-linear transformationsuch that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped inthe range [0, 1],\neven theoutliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n"
144+
"PowerTransformer (Box-Cox)\n--------------------------\n\n``PowerTransformer`` applies apower transformationto each\nfeature to make the data more Gaussian-like. Currently,\n``PowerTransformer`` implements the Box-Cox transform. It differs from\nQuantileTransformer (Gaussian output) in that it does not mapthe\ndata to a zero-mean, unit-variance Gaussian distribution. Instead, Box-Cox\nfinds theoptimal scaling factor to stabilize variance and mimimize skewness\nthrough maximum likelihood estimation. Note that Box-Cox can only be applied\nto positive, non-zero data. Income and number of households happen to be\nstrictly positive, but if negative values are present, a constant can be\nadded to each feature to shift it into the positive range - this is known as\nthe two-parameter Box-Cox transform.\n\n"
145145
]
146146
},
147147
{
@@ -173,6 +173,24 @@
173173
"make_plot(6)"
174174
]
175175
},
176+
{
177+
"cell_type":"markdown",
178+
"metadata": {},
179+
"source": [
180+
"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies a non-linear transformation such that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped in the range [0, 1],\neven the outliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n"
181+
]
182+
},
183+
{
184+
"cell_type":"code",
185+
"execution_count":null,
186+
"metadata": {
187+
"collapsed":false
188+
},
189+
"outputs": [],
190+
"source": [
191+
"make_plot(7)"
192+
]
193+
},
176194
{
177195
"cell_type":"markdown",
178196
"metadata": {},
@@ -188,7 +206,7 @@
188206
},
189207
"outputs": [],
190208
"source": [
191-
"make_plot(7)\nplt.show()"
209+
"make_plot(8)\n\nplt.show()"
192210
]
193211
}
194212
],

‎dev/_downloads/plot_all_scaling.py

Lines changed: 44 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,10 @@
2929
other in the way to estimate the parameters used to shift and scale each
3030
feature.
3131
32-
``QuantileTransformer`` provides a non-linear transformation in which distances
33-
between marginal outliers and inliers are shrunk.
32+
``QuantileTransformer`` provides non-linear transformations in which distances
33+
between marginal outliers and inliers are shrunk. ``PowerTransformer`` provides
34+
non-linear transformations in which data is mapped to a normal distribution to
35+
stabilize variance and minimize skewness.
3436
3537
Unlike the previous transformations, normalization refers to a per sample
3638
transformation instead of a per feature transformation.
@@ -59,7 +61,8 @@
5961
fromsklearn.preprocessingimportStandardScaler
6062
fromsklearn.preprocessingimportRobustScaler
6163
fromsklearn.preprocessingimportNormalizer
62-
fromsklearn.preprocessing.dataimportQuantileTransformer
64+
fromsklearn.preprocessingimportQuantileTransformer
65+
fromsklearn.preprocessingimportPowerTransformer
6366

6467
fromsklearn.datasetsimportfetch_california_housing
6568

@@ -84,14 +87,16 @@
8487
MaxAbsScaler().fit_transform(X)),
8588
('Data after robust scaling',
8689
RobustScaler(quantile_range=(25,75)).fit_transform(X)),
87-
('Data after quantile transformation (uniform pdf)',
88-
QuantileTransformer(output_distribution='uniform')
89-
.fit_transform(X)),
90+
('Data after power transformation (Box-Cox)',
91+
PowerTransformer(method='box-cox').fit_transform(X)),
9092
('Data after quantile transformation (gaussian pdf)',
9193
QuantileTransformer(output_distribution='normal')
9294
.fit_transform(X)),
95+
('Data after quantile transformation (uniform pdf)',
96+
QuantileTransformer(output_distribution='uniform')
97+
.fit_transform(X)),
9398
('Data after sample-wise L2 normalizing',
94-
Normalizer().fit_transform(X))
99+
Normalizer().fit_transform(X)),
95100
]
96101

97102
# scale the output between 0 and 1 for the colorbar
@@ -286,6 +291,35 @@ def make_plot(item_idx):
286291

287292
make_plot(4)
288293

294+
##############################################################################
295+
# PowerTransformer (Box-Cox)
296+
# --------------------------
297+
#
298+
# ``PowerTransformer`` applies a power transformation to each
299+
# feature to make the data more Gaussian-like. Currently,
300+
# ``PowerTransformer`` implements the Box-Cox transform. It differs from
301+
# QuantileTransformer (Gaussian output) in that it does not map the
302+
# data to a zero-mean, unit-variance Gaussian distribution. Instead, Box-Cox
303+
# finds the optimal scaling factor to stabilize variance and mimimize skewness
304+
# through maximum likelihood estimation. Note that Box-Cox can only be applied
305+
# to positive, non-zero data. Income and number of households happen to be
306+
# strictly positive, but if negative values are present, a constant can be
307+
# added to each feature to shift it into the positive range - this is known as
308+
# the two-parameter Box-Cox transform.
309+
310+
make_plot(5)
311+
312+
##############################################################################
313+
# QuantileTransformer (Gaussian output)
314+
# -------------------------------------
315+
#
316+
# ``QuantileTransformer`` has an additional ``output_distribution`` parameter
317+
# allowing to match a Gaussian distribution instead of a uniform distribution.
318+
# Note that this non-parametetric transformer introduces saturation artifacts
319+
# for extreme values.
320+
321+
make_plot(6)
322+
289323
###################################################################
290324
# QuantileTransformer (uniform output)
291325
# ------------------------------------
@@ -302,18 +336,7 @@ def make_plot(item_idx):
302336
# any outlier by setting them to the a priori defined range boundaries (0 and
303337
# 1).
304338

305-
make_plot(5)
306-
307-
##############################################################################
308-
# QuantileTransformer (Gaussian output)
309-
# -------------------------------------
310-
#
311-
# ``QuantileTransformer`` has an additional ``output_distribution`` parameter
312-
# allowing to match a Gaussian distribution instead of a uniform distribution.
313-
# Note that this non-parametetric transformer introduces saturation artifacts
314-
# for extreme values.
315-
316-
make_plot(6)
339+
make_plot(7)
317340

318341
##############################################################################
319342
# Normalizer
@@ -326,5 +349,6 @@ def make_plot(item_idx):
326349
# transformed data only lie in the positive quadrant. This would not be the
327350
# case if some original features had a mix of positive and negative values.
328351

329-
make_plot(7)
352+
make_plot(8)
353+
330354
plt.show()

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp