Mean or median imputation consists of replacing missing data with the variable’s mean or median value. To avoid data leakage, we determine the mean or median using the train set, and then use these values to impute the train and test sets, and allfuture data.
Scikit-learn and Feature-engine learn the mean or median from the train set and store these parameters for future use out ofthe box.
In this recipe, we will perform mean and median imputation usingpandas
,scikit
-learn
,andfeature-engine
.
Note
Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the variable distribution if there is a high percentageofmissing data.
How to do it...
Let’s beginthis recipe:
- First, we’ll import
pandas
and the required functions and classes from scikit-learn
andfeature-engine
:import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom feature_engine.imputation import MeanMedianImputer
- Let’s load the dataset that we prepared in theTechnicalrequirements section:
data = pd.read_csv("credit_approval_uci.csv")
- Let’s split the data into train and test sets with theirrespective targets:
X_train, X_test, y_train, y_test = train_test_split( data.drop("target", axis=1), data["target"], test_size=0.3, random_state=0,)
- Let’s make a list with the numerical variables by excluding variables oftype object:
numeric_vars = X_train.select_dtypes( exclude="O").columns.to_list()
If you executenumeric_vars
, you will see the names of the numerical variables:['A2', 'A3', 'A8', 'A11', '
A14', 'A15']
.
- Let’s capture the variables’ median values ina dictionary:
median_values = X_train[ numeric_vars].median().to_dict()
Tip
Note how we calculate the median using the train set. We will use these values to replace missing data in the train and test sets. To calculate the mean, use pandasmean()
insteadofmedian()
.
If you executemedian_values
, you will see a dictionary with the median value per variable:{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, '
A15': 6.0}.
- Let’s replace missing data withthe median:
X_train_t = X_train.fillna(value=median_values)X_test_t = X_test.fillna(value=median_values)
If you executeX_train_t[numeric_vars].isnull().sum()
after the imputation, the number of missing values in the numerical variables shouldbe0
.
Note
pandas
fillna()
returns a new dataset with imputed values by default. To replace missing data in the original DataFrame, set theinplace
parameter toTrue
:X_train.fillna(value=median_values, inplace=True)
.
Now, let’s impute missing values with the medianusingscikit-learn
.
- Let’s set up the imputer to replace missing data withthe median:
imputer = SimpleImputer(strategy="median")
Note
To perform mean imputation, setSimpleImputer()
as follows:imputer =
SimpleImputer(strategy = "
mean")
.
- We restrict the imputation to the numerical variables byusing
ColumnTransformer()
:ct = ColumnTransformer( [("imputer", imputer, numeric_vars)], remainder="passthrough", force_int_remainder_cols=False,).set_output(transform="pandas")
Note
Scikit-learn can returnnumpy
arrays,pandas
DataFrames, orpolar
frames, depending on how we set out the transform output. By default, it returnsnumpy
arrays.
- Let’s fit the imputer to the train set so that it learns themedian values:
ct.fit(X_train)
- Let’s check out the learnedmedian values:
ct.named_transformers_.imputer.statistics_
The previous command returns the median valuesper variable:
array([ 28.835, 2.75, 1., 0., 160., 6.])
- Let’s replace missing values withthe median:
X_train_t = ct.transform(X_train)X_test_t = ct.transform(X_test)
- Let’s display the resultingtraining set:
print(X_train_t.head())
We see the resulting DataFrame in thefollowing image:
Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder
Finally, let’s perform median imputationusingfeature-engine
.
- Let’s set up the imputer to replace missing data in numerical variables withthe median:
imputer = MeanMedianImputer( imputation_method="median", variables=numeric_vars,)
Note
To perform mean imputation, changeimputation_method
to"mean"
. By defaultMeanMedianImputer()
will impute all numerical variables in the DataFrame, ignoring categorical variables. Use thevariables
argument to restrict the imputation to a subset ofnumerical variables.
- Fit the imputer so that it learns themedian values:
imputer.fit(X_train)
- Inspect thelearned medians:
imputer.imputer_dict_
The previous command returns the median values ina dictionary:
{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}
- Finally, let’s replace the missing values withthe median:
X_train = imputer.transform(X_train)X_test = imputer.transform(X_test)
Feature-engine’sMeanMedianImputer()
returns aDataFrame
. You can check that the imputed variables do not contain missing valuesusingX_train[numeric_vars].isnull().mean()
.
How it works...
In this recipe, we replaced missing data with the variable’s median values usingpandas
,scikit-learn
,andfeature-engine
.
We divided the dataset into train and test sets using scikit-learn’strain_test_split()
function. The function takes the predictor variables, the target, the fraction of observations to retain in the test set, and arandom_state
value for reproducibility, as arguments. It returned a train set with 70% of the original observations and a test set with 30% of the original observations. The 70:30 split was doneat random.
To impute missing data with pandas, instep 5, we created a dictionary with the numerical variable names as keys and their medians as values. The median values were learned from the training set to avoid data leakage. To replace missing data, we appliedpandas
’fillna()
to train and test sets, passing the dictionary with the median values per variable asa parameter.
To replace the missing values with the median usingscikit-learn
, we usedSimpleImputer()
with thestrategy
set to"median"
. To restrict the imputation to numerical variables, we usedColumnTransformer()
. With theremainder
argument set topassthrough
, we madeColumnTransformer()
returnall the variables seen in the training set in the transformed output; the imputed ones followed by those that werenot transformed.
Note
ColumnTransformer()
changes the names of the variables in the output. The transformed variables show the prefiximputer
and the unchanged variables show theprefixremainder
.
Instep 8, we set the output of the column transformer topandas
to obtain a DataFrame as a result. By default,ColumnTransformer()
returnsnumpy
arrays.
Note
From version 1.4.0,scikit-learn
transformers can returnnumpy
arrays,pandas
DataFrames, orpolar
frames as a result of thetransform()
method.
Withfit()
,SimpleImputer()
learned the median of each numerical variable in the train set and stored them in itsstatistics_
attribute. Withtransform()
, it replaced the missing values withthe medians.
To replace missing values with the median using Feature-engine, we used theMeanMedianImputer()
with theimputation_method
set tomedian
. To restrict the imputation to a subset of variables, we passed the variable names in a list to thevariables
parameter. Withfit()
, the transformer learned and stored the median values per variable in a dictionary in itsimputer_dict_
attribute. Withtransform()
, it replaced the missing values, returning apandas DataFrame.