quantile_transform#
- sklearn.preprocessing.quantile_transform(X,*,axis=0,n_quantiles=1000,output_distribution='uniform',ignore_implicit_zeros=False,subsample=100000,random_state=None,copy=True)[source]#
Transform features using quantiles information.
This method transforms the features to follow a uniform or a normaldistribution. Therefore, for a given feature, this transformation tendsto spread out the most frequent values. It also reduces the impact of(marginal) outliers: this is therefore a robust preprocessing scheme.
The transformation is applied on each feature independently. First anestimate of the cumulative distribution function of a feature isused to map the original values to a uniform distribution. The obtainedvalues are then mapped to the desired output distribution using theassociated quantile function. Features values of new/unseen data that fallbelow or above the fitted range will be mapped to the bounds of the outputdistribution. Note that this transform is non-linear. It may distort linearcorrelations between variables measured at the same scale but rendersvariables measured at different scales more directly comparable.
Read more in theUser Guide.
- Parameters:
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The data to transform.
- axisint, default=0
Axis used to compute the means and standard deviations along. If 0,transform each feature, otherwise (if 1) transform each sample.
- n_quantilesint, default=1000 or n_samples
Number of quantiles to be computed. It corresponds to the numberof landmarks used to discretize the cumulative distribution function.If n_quantiles is larger than the number of samples, n_quantiles is setto the number of samples as a larger number of quantiles does not givea better approximation of the cumulative distribution functionestimator.
- output_distribution{‘uniform’, ‘normal’}, default=’uniform’
Marginal distribution for the transformed data. The choices are‘uniform’ (default) or ‘normal’.
- ignore_implicit_zerosbool, default=False
Only applies to sparse matrices. If True, the sparse entries of thematrix are discarded to compute the quantile statistics. If False,these entries are treated as zeros.
- subsampleint or None, default=1e5
Maximum number of samples used to estimate the quantiles forcomputational efficiency. Note that the subsampling procedure maydiffer for value-identical sparse and dense matrices.Disable subsampling by setting
subsample=None.Added in version 1.5:The option
Noneto disable subsampling was added.- random_stateint, RandomState instance or None, default=None
Determines random number generation for subsampling and smoothingnoise.Please see
subsamplefor more details.Pass an int for reproducible results across multiple function calls.SeeGlossary.- copybool, default=True
If False, try to avoid a copy and transform in place.This is not guaranteed to always work in place; e.g. if the data isa numpy array with an int dtype, a copy will be returned even withcopy=False.
Changed in version 0.23:The default value of
copychanged from False to True in 0.23.
- Returns:
- Xt{ndarray, sparse matrix} of shape (n_samples, n_features)
The transformed data.
See also
QuantileTransformerPerforms quantile-based scaling using the Transformer API (e.g. as part of a preprocessing
Pipeline).power_transformMaps data to a normal distribution using a power transformation.
scalePerforms standardization that is faster, but less robust to outliers.
robust_scalePerforms robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained intransform.
Warning
Risk of data leak
Do not use
quantile_transformunlessyou know what you are doing. A common mistake is to apply itto the entire databefore splitting into training andtest sets. This will bias the model evaluation becauseinformation would have leaked from the test set to thetraining set.In general, we recommend usingQuantileTransformerwithin aPipeline in order to prevent most risks of dataleaking:pipe=make_pipeline(QuantileTransformer(),LogisticRegression()).For a comparison of the different scalers, transformers, and normalizers,see:Compare the effect of different scalers on data with outliers.
Examples
>>>importnumpyasnp>>>fromsklearn.preprocessingimportquantile_transform>>>rng=np.random.RandomState(0)>>>X=np.sort(rng.normal(loc=0.5,scale=0.25,size=(25,1)),axis=0)>>>quantile_transform(X,n_quantiles=10,random_state=0,copy=True)array([...])
Gallery examples#
Effect of transforming the targets in regression model
