f_regression#

sklearn.feature_selection.f_regression(X,y,*,center=True,force_finite=True)[source]#

Univariate linear regression tests returning F-statistic and p-values.

Quick linear model for testing the effect of a single regressor,sequentially for many regressors.

This is done in 2 steps:

  1. The cross correlation between each regressor and the target is computedusingr_regression as:

    E[(X[:,i]-mean(X[:,i]))*(y-mean(y))]/(std(X[:,i])*std(y))
  2. It is converted to an F score and then to a p-value.

f_regression is derived fromr_regression and will rankfeatures in the same order if all the features are positively correlatedwith the target.

Note however that contrary tof_regression,r_regressionvalues lie in [-1, 1] and can thus be negative.f_regression istherefore recommended as a feature selection criterion to identifypotentially predictive feature for a downstream classifier, irrespective ofthe sign of the association with the target variable.

Furthermoref_regression returns p-values whiler_regression does not.

Read more in theUser Guide.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

The data matrix.

yarray-like of shape (n_samples,)

The target vector.

centerbool, default=True

Whether or not to center the data matrixX and the target vectory.By default,X andy will be centered.

force_finitebool, default=True

Whether or not to force the F-statistics and associated p-values tobe finite. There are two cases where the F-statistic is expected to notbe finite:

  • when the targety or some features inX are constant. In thiscase, the Pearson’s R correlation is not defined leading to obtainnp.nan values in the F-statistic and p-value. Whenforce_finite=True, the F-statistic is set to0.0 and theassociated p-value is set to1.0.

  • when a feature inX is perfectly correlated (oranti-correlated) with the targety. In this case, the F-statisticis expected to benp.inf. Whenforce_finite=True, the F-statisticis set tonp.finfo(dtype).max and the associated p-value is set to0.0.

Added in version 1.1.

Returns:
f_statisticndarray of shape (n_features,)

F-statistic for each feature.

p_valuesndarray of shape (n_features,)

P-values associated with the F-statistic.

See also

r_regression

Pearson’s R between label/feature for regression tasks.

f_classif

ANOVA F-value between label/feature for classification tasks.

chi2

Chi-squared stats of non-negative features for classification tasks.

SelectKBest

Select features based on the k highest scores.

SelectFpr

Select features based on a false positive rate test.

SelectFdr

Select features based on an estimated false discovery rate.

SelectFwe

Select features based on family-wise error rate.

SelectPercentile

Select features based on percentile of the highest scores.

Examples

>>>fromsklearn.datasetsimportmake_regression>>>fromsklearn.feature_selectionimportf_regression>>>X,y=make_regression(...n_samples=50,n_features=3,n_informative=1,noise=1e-4,random_state=42...)>>>f_statistic,p_values=f_regression(X,y)>>>f_statisticarray([1.21, 2.67e13, 2.66])>>>p_valuesarray([0.276, 1.54e-283, 0.11])

Gallery examples#

Feature agglomeration vs. univariate selection

Feature agglomeration vs. univariate selection

Comparison of F-test and mutual information

Comparison of F-test and mutual information