Movatterモバイル変換


[0]ホーム

URL:


Packt
Search iconClose icon
Search icon CANCEL
Subscription
0
Cart icon
Your Cart(0 item)
Close icon
You have no products in your basket yet
Save more on your purchases!discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Profile icon
Account
Close icon

Change country

Modal Close icon
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timerSALE ENDS IN
0Days
:
00Hours
:
00Minutes
:
00Seconds
Home> Data> Data Science> Python Feature Engineering Cookbook
Python Feature Engineering Cookbook
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook: A complete guide to crafting powerful features for your machine learning models , Third Edition

Arrow left icon
Profile Icon Galli
Arrow right icon
NZ$46.99NZ$52.99
eBookAug 2024396 pages3rd Edition
eBook
NZ$46.99 NZ$52.99
Paperback
NZ$65.99
Subscription
Free Trial
Arrow left icon
Profile Icon Galli
Arrow right icon
NZ$46.99NZ$52.99
eBookAug 2024396 pages3rd Edition
eBook
NZ$46.99 NZ$52.99
Paperback
NZ$65.99
Subscription
Free Trial
eBook
NZ$46.99 NZ$52.99
Paperback
NZ$65.99
Subscription
Free Trial

What do you get with eBook?

Product feature iconInstant access to your Digital eBook purchase
Product feature icon Download this book inEPUB andPDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature iconDRM FREE - Read whenever, wherever and however you want
Product feature iconAI Assistant (beta) to help accelerate your learning
OR

Contact Details

Modal Close icon
Payment Processing...
tickCompleted

Billing Address

Table of content iconView table of contentsPreview book icon Preview Book

Python Feature Engineering Cookbook

Imputing Missing Data

Missing data—meaning the absence of values for certain observations—is an unavoidable problem in most data sources. Some machine learning model implementations can handle missing data out of the box. To train other models, we must remove observations with missing data or transform them intopermitted values.

The act of replacing missing data with their statistical estimates is calledimputation. The goal of any imputation technique is to produce a complete dataset. There are multiple imputation methods. We select which one to use, depending on whether the data is missing at random, the proportion of missing values, and the machine learning model we intend to use. In this chapter, we will discuss severalimputation methods.

This chapter will cover thefollowing recipes:

  • Removing observations withmissing data
  • Performing mean ormedian imputation
  • Imputingcategorical variables
  • Replacing missing values with anarbitrary number
  • Finding extreme valuesfor imputation
  • Markingimputed values
  • Implementing forward andbackward fill
  • Carryingout interpolation
  • Performing multivariate imputation bychained equations
  • Estimating missing data withnearest neighbors

Technical requirements

In this chapter, we will use the Python libraries Matplotlib, pandas, NumPy, scikit-learn, and Feature-engine. If you need to install Python, the free Anaconda Python distribution (https://www.anaconda.com/) includes most numericalcomputing libraries.

feature-engine can be installed withpipas follows:

pip install feature-engine

If you use Anaconda, you can installfeature-enginewithconda:

conda install -c conda-forge feature_engine

Note

The recipes from this chapter were created using the latest versions of the Python libraries at the time of publishing. You can check the versions in therequirements.txt file in the accompanying GitHub repository,athttps://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/requirements.txt.

We will use theCredit Approval dataset from theUCI Machine Learning Repository (https://archive.ics.uci.edu/), licensed under the CC BY 4.0 creative commons attribution:https://creativecommons.org/licenses/by/4.0/legalcode. You’ll find the dataset at thislink:http://archive.ics.uci.edu/dataset/27/credit+approval.

I downloaded and modified the data as shown in thisnotebook:https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/credit-approval-dataset.ipynb

We will also use theair passenger dataset located in Facebook’s Prophet GitHub repository (https://github.com/facebook/prophet/blob/main/examples/example_air_passengers.csv), licensed under the MITlicense:https://github.com/facebook/prophet/blob/main/LICENSE

I modified the data as shown in thisnotebook:https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/air-passengers-dataset.ipynb

You’ll find a copy of the modified data sets in the accompanying GitHubrepository:https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion ofthe dataset.

Note

Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important toyour model.

How to do it...

Let’s begin by making some imports and loadingthe dataset:

  1. Let’s importpandas,matplotlib, and the train/test split functionfrom scikit-learn:
    import matplotlib.pyplot as pltimport pandas as pdfrom sklearn.model_selection import train_test_split
  2. Let’s load and display the dataset described in theTechnicalrequirements section:
    data = pd.read_csv("credit_approval_uci.csv")data.head()

    In the following image, we see the first 5 rowsof data:

Figure 1.1 – First 5 rows of the dataset

Figure 1.1 – First 5 rows of the dataset

  1. Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and atest set:
    X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.30,    random_state=42,)
  2. Let’s now make a bar plot with the proportion of missing data per variable in the training andtest sets:
    fig, axes = plt.subplots(    2, 1, figsize=(15, 10), squeeze=False)X_train.isnull().mean().plot(    kind='bar', color='grey', ax=axes[0, 0], title="train")X_test.isnull().mean().plot(    kind='bar', color='black', ax=axes[1, 0], title="test")axes[0, 0].set_ylabel('Fraction of NAN')axes[1, 0].set_ylabel('Fraction of NAN')plt.show()

    The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and testsets (bottom):

Figure 1.2 – Proportion of missing data per variable

Figure 1.2 – Proportion of missing data per variable

  1. Now, we’ll remove observations if they have missing values inany variable:
    train_cca = X_train.dropna()test_cca = X_test.dropna()

Note

pandas’dropna()drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this:data.dropna(subset=["A3", "A4"]).

  1. Let’s print and compare the size of the original and completecase datasets:
    print(f"Total observations: {len(X_train)}")print(f"Observations without NAN: {len(train_cca)}")

    We removed more than 200 observations with missing data from the training set, as shown in thefollowing output:

    Total observations: 483Observations without NAN: 264
  2. After removing observations from the training and test sets, we need to align thetarget variables:
    y_train_cca = y_train.loc[train_cca.index]y_test_cca = y_test.loc[test_cca.index]

    Now, the datasets and target variables contain the rows withoutmissing data.

  3. To drop observations with missing data utilizingfeature-engine, let’s import therequired transformer:
    from feature_engine.imputation import DropMissingData
  4. Let’s set up the imputer to automatically find the variables withmissing data:
    cca = DropMissingData(variables=None, missing_only=True)
  5. Let’s fit the transformer so that it finds the variables withmissing data:
    cca.fit(X_train)
  6. Let’s inspect the variables with NAN that thetransformer found:
    cca.variables_

    The previous command returns the names of the variables withmissing data:

    ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
  7. Let’s remove the rows with missing data in the training andtest sets:
    train_cca = cca.transform(X_train)test_cca = cca.transform(X_test)

    Usetrain_cca.isnull().sum() to corroborate the absence of missing data in the completecase dataset.

  8. DropMissingData can automatically adjust the target after removing missing data from thetraining set:
    train_c, y_train_c = cca.transform_x_y( X_train, y_train)test_c, y_test_c = cca.transform_x_y(X_test, y_test)

The previous code removed rows withnan from the training and test sets and then re-aligned thetarget variables.

Note

To remove observations with missing data in a subset of variables, useDropMissingData(variables=['A3', 'A4']). To remove rows withnan in at least 5% of the variables,useDropMissingData(threshold=0.95).

How it works...

In this recipe, we plotted the proportion of missing data in each variable and then removed all observations withmissing values.

We usedpandasisnull() andmean() methods to determine the proportion of missing observations in each variable. Theisnull() method created a Boolean vector per variable withTrue andFalse values indicating whether a value was missing. Themean() method took the average of these values and returned the proportion ofmissing data.

We usedpandasplot.bar() to create a bar plot of the fraction of missing data per variable. InFigure 1.2, we saw the fraction ofnan per variable in the training andtest sets.

To remove observations with missing values inany variable, we used pandas’dropna(), thereby obtaining a completecase dataset.

Finally, we removed missing data using Feature-engine’sDropMissingData(). This imputer automatically identified and stored the variables with missing data from the train set when we called thefit() method. With thetransform() method, the imputer removed observations withnan in those variables. Withtransform_x_y(), the imputer removed rows withnan from the data sets and then realigned thetarget variable.

See also

If you want to useDropMissingData() within a pipeline together with other Feature-engine or scikit-learn transformers, check out Feature-engine’sPipeline:https://Feature-engine.trainindata.com/en/latest/user_guide/pipeline/Pipeline.html. This pipeline can align the target with the training and test sets afterremoving rows.

Performing mean or median imputation

Mean or median imputation consists of replacing missing data with the variable’s mean or median value. To avoid data leakage, we determine the mean or median using the train set, and then use these values to impute the train and test sets, and allfuture data.

Scikit-learn and Feature-engine learn the mean or median from the train set and store these parameters for future use out ofthe box.

In this recipe, we will perform mean and median imputation usingpandas,scikit-learn,andfeature-engine.

Note

Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the variable distribution if there is a high percentageofmissing data.

How to do it...

Let’s beginthis recipe:

  1. First, we’ll importpandas and the required functions and classes from scikit-learnandfeature-engine:
    import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom feature_engine.imputation import MeanMedianImputer
  2. Let’s load the dataset that we prepared in theTechnicalrequirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets with theirrespective targets:
    X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.3,    random_state=0,)
  4. Let’s make a list with the numerical variables by excluding variables oftype object:
    numeric_vars = X_train.select_dtypes(    exclude="O").columns.to_list()

    If you executenumeric_vars, you will see the names of the numerical variables:['A2', 'A3', 'A8', 'A11', 'A14', 'A15'].

  5. Let’s capture the variables’ median values ina dictionary:
    median_values = X_train[    numeric_vars].median().to_dict()

Tip

Note how we calculate the median using the train set. We will use these values to replace missing data in the train and test sets. To calculate the mean, use pandasmean() insteadofmedian().

If you executemedian_values, you will see a dictionary with the median value per variable:{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}.

  1. Let’s replace missing data withthe median:
    X_train_t = X_train.fillna(value=median_values)X_test_t = X_test.fillna(value=median_values)

    If you executeX_train_t[numeric_vars].isnull().sum() after the imputation, the number of missing values in the numerical variables shouldbe0.

Note

pandasfillna() returns a new dataset with imputed values by default. To replace missing data in the original DataFrame, set theinplace parameter toTrue:X_train.fillna(value=median_values, inplace=True).

Now, let’s impute missing values with the medianusingscikit-learn.

  1. Let’s set up the imputer to replace missing data withthe median:
    imputer = SimpleImputer(strategy="median")

Note

To perform mean imputation, setSimpleImputer() as follows:imputer =SimpleImputer(strategy = "mean").

  1. We restrict the imputation to the numerical variables byusingColumnTransformer():
    ct = ColumnTransformer(    [("imputer", imputer, numeric_vars)],    remainder="passthrough",    force_int_remainder_cols=False,).set_output(transform="pandas")

Note

Scikit-learn can returnnumpy arrays,pandas DataFrames, orpolar frames, depending on how we set out the transform output. By default, it returnsnumpy arrays.

  1. Let’s fit the imputer to the train set so that it learns themedian values:
    ct.fit(X_train)
  2. Let’s check out the learnedmedian values:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the median valuesper variable:

    array([ 28.835,   2.75,   1.,   0., 160.,   6.])
  3. Let’s replace missing values withthe median:
    X_train_t = ct.transform(X_train)X_test_t = ct.transform(X_test)
  4. Let’s display the resultingtraining set:
    print(X_train_t.head())

    We see the resulting DataFrame in thefollowing image:

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Finally, let’s perform median imputationusingfeature-engine.

  1. Let’s set up the imputer to replace missing data in numerical variables withthe median:
    imputer = MeanMedianImputer(    imputation_method="median",    variables=numeric_vars,)

Note

To perform mean imputation, changeimputation_method to"mean". By defaultMeanMedianImputer() will impute all numerical variables in the DataFrame, ignoring categorical variables. Use thevariables argument to restrict the imputation to a subset ofnumerical variables.

  1. Fit the imputer so that it learns themedian values:
    imputer.fit(X_train)
  2. Inspect thelearned medians:
    imputer.imputer_dict_

    The previous command returns the median values ina dictionary:

    {'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}
  3. Finally, let’s replace the missing values withthe median:
    X_train = imputer.transform(X_train)X_test = imputer.transform(X_test)

Feature-engine’sMeanMedianImputer() returns aDataFrame. You can check that the imputed variables do not contain missing valuesusingX_train[numeric_vars].isnull().mean().

How it works...

In this recipe, we replaced missing data with the variable’s median values usingpandas,scikit-learn,andfeature-engine.

We divided the dataset into train and test sets using scikit-learn’strain_test_split() function. The function takes the predictor variables, the target, the fraction of observations to retain in the test set, and arandom_state value for reproducibility, as arguments. It returned a train set with 70% of the original observations and a test set with 30% of the original observations. The 70:30 split was doneat random.

To impute missing data with pandas, instep 5, we created a dictionary with the numerical variable names as keys and their medians as values. The median values were learned from the training set to avoid data leakage. To replace missing data, we appliedpandasfillna() to train and test sets, passing the dictionary with the median values per variable asa parameter.

To replace the missing values with the median usingscikit-learn, we usedSimpleImputer() with thestrategy set to"median". To restrict the imputation to numerical variables, we usedColumnTransformer(). With theremainder argument set topassthrough, we madeColumnTransformer() returnall the variables seen in the training set in the transformed output; the imputed ones followed by those that werenot transformed.

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefiximputer and the unchanged variables show theprefixremainder.

Instep 8, we set the output of the column transformer topandas to obtain a DataFrame as a result. By default,ColumnTransformer() returnsnumpy arrays.

Note

From version 1.4.0,scikit-learn transformers can returnnumpy arrays,pandas DataFrames, orpolar frames as a result of thetransform() method.

Withfit(),SimpleImputer() learned the median of each numerical variable in the train set and stored them in itsstatistics_ attribute. Withtransform(), it replaced the missing values withthe medians.

To replace missing values with the median using Feature-engine, we used theMeanMedianImputer() with theimputation_method set tomedian. To restrict the imputation to a subset of variables, we passed the variable names in a list to thevariables parameter. Withfit(), the transformer learned and stored the median values per variable in a dictionary in itsimputer_dict_ attribute. Withtransform(), it replaced the missing values, returning apandas DataFrame.

Imputing categorical variables

We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets.scikit-learn andfeature-engine find and store the frequent categories for the imputation, out ofthe box.

In this recipe, we will replace missing data in categorical variables with the most frequent category,or with anarbitrary string.

How to do it...

To begin, let’s make a few imports and preparethe data:

  1. Let’s importpandas and the required functions and classes fromscikit-learnandfeature-engine:
    import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom feature_engine.imputation import CategoricalImputer
  2. Let’s load the dataset that we prepared in theTechnicalrequirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets and theirrespective targets:
    X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.3,    random_state=0,)
  4. Let’s capture the categorical variables ina list:
    categorical_vars = X_train.select_dtypes(    include="O").columns.to_list()
  5. Let’s store the variables’ most frequent categories ina dictionary:
    frequent_values = X_train[    categorical_vars].mode().iloc[0].to_dict()
  6. Let’s replace missing values with thefrequent categories:
    X_train_t = X_train.fillna(value=frequent_values)X_test_t = X_test.fillna(value=frequent_values)

Note

fillna() returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executingX_train.fillna(value=frequent_values, inplace=True).

  1. To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string asthe values:
    imputation_dict = {var:     "no_data" for var in categorical_vars}

    Now, we can use this dictionary and the code instep 6 to replacemissing data.

Note

Withpandasvalue_counts() we can see the string added by the imputation. Try executing, forexample,X_train["A1"].value_counts().

Now, let’s impute missing values with the most frequent categoryusingscikit-learn.

  1. Let’s set up the imputer to find the most frequent categoryper variable:
    imputer = SimpleImputer(strategy='most_frequent')

Note

SimpleImputer() will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categoricalvariables only.

  1. Let’s restrict the imputationto thecategorical variables:
    ct = ColumnTransformer(    [("imputer",imputer, categorical_vars)],    remainder="passthrough"    ).set_output(transform="pandas")

Note

To impute missing data with a string instead of the most frequent category, setSimpleImputer() as follows:imputer =SimpleImputer(strategy="constant",fill_value="missing").

  1. Fit the imputer to the train set so that it learns the mostfrequent values:
    ct.fit(X_train)
  2. Let’s take a look at the most frequent values learned bythe imputer:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the most frequent valuesper variable:

    array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
  3. Finally, let’s replace missing values with thefrequent categories:
    X_train_t = ct.transform(X_train)X_test_t = ct.transform(X_test)

    Make sure to inspect the resulting DataFrames byexecutingX_train_t.head().

Note

TheColumnTransformer() changes the names of the variables. The imputed variables show the prefiximputer and the untransformed variables theprefixremainder.

Finally, let’s impute missing valuesusingfeature-engine.

  1. Let’s set up the imputer to replace the missing data in categorical variables with their mostfrequent value:
    imputer = CategoricalImputer(    imputation_method="frequent",    variables=categorical_vars,)

Note

With thevariables parameter set toNone,CategoricalImputer() will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown instep 13.

  1. Fit the imputer to the train set so that it learns the mostfrequent categories:
    imputer.fit(X_train)

Note

To impute categorical variables with a specific string, setimputation_method tomissing andfill_value to thedesired string.

  1. Let’s check out thelearned categories:
    imputer.imputer_dict_

    We can see the dictionary with the most frequent values in thefollowing output:

    {'A1': 'b', 'A4': 'u', 'A5': 'g', 'A6': 'c', 'A7': 'v', 'A9': 't', 'A10': 'f', 'A12': 'f', 'A13': 'g'}
  2. Finally, let’s replace the missing values withfrequent categories:
    X_train_t = imputer.transform(X_train)X_test_t = imputer.transform(X_test)

    If you want to impute numerical variables with a string or the most frequent value usingCategoricalImputer(), set theignore_format parametertoTrue.

CategoricalImputer() returns a pandas DataFrame asa result.

How it works...

In this recipe, we replaced missing values in categoricalvariables with the most frequent categories or an arbitrary string. We usedpandas,scikit-learn,andfeature-engine.

Instep 5, we created a dictionarywith the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandasmode(), and to return a dictionary, we used pandasto_dict(). To replace the missing data, we usedpandasfillna(), passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values byusing.iloc[0].

To replace the missing values usingscikit-learn, we usedSimpleImputer() with thestrategy set tomost_frequent. To restrict the imputation to categorical variables, we usedColumnTransformer(). Withremainder set topassthrough, we madeColumnTransformer() return all the variables present in the training set as a result of thetransform()method .

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefiximputer and the unchanged variables show theprefixremainder.

Withfit(),SimpleImputer() learned the variables’ most frequent categories and stored them in itsstatistics_ attribute. Withtransform(), it replaced the missing data with thelearned parameters.

SimpleImputer() andColumnTransformer() return NumPy arrays by default. We can change this behavior with theset_output() parameter.

To replace missing values withfeature-engine, we used theCategoricalImputer() withimputation_method set tofrequent. Withfit(), the transformer learned and stored the most frequent categories in a dictionary in itsimputer_dict_ attribute. Withtransform(), it replaced the missing values with thelearned parameters.

UnlikeSimpleImputer(),CategoricalImputer() will only impute categorical variables, unless specifically told not to do so by setting theignore_format parameter toTrue. In addition, withfeature-engine transformers we can restrict the transformations to a subset of variables through the transformer itself. Forscikit-learn transformers, we need the additionalColumnTransformer() class to apply the transformation to a subset ofthe variables.

Replacing missing values with an arbitrary number

We can replace missing data with an arbitrary value. Commonly used values are999,9999, or-1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in theImputing categoricalvariables recipe.

When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value ofthe distribution.

Note

We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the originalvariable distribution.

In this recipe, we will impute missing data with arbitrary numbers usingpandas,scikit-learn,andfeature-engine.

How to do it...

Let’s begin by importing the necessary tools and loadingthe data:

  1. Importpandas and the required functionsand classes:
    import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom feature_engine.imputation import ArbitraryNumberImputer
  2. Let’s load the dataset described in theTechnicalrequirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s separate the data into train andtest sets:
    X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.3,    random_state=0,)

    We will select arbitrary values greater than the maximum value ofthe distribution.

  4. Let’s find the maximum value of fournumerical variables:
    X_train[['A2','A3', 'A8', 'A11']].max()

    The previous command returns thefollowing output:

    A2     76.750A3     26.335A8     28.500A11    67.000dtype: float64

    We’ll use99 for the imputation because it is bigger than the maximum values of the numerical variables instep 4.

  5. Let’s make a copy of theoriginal DataFrames:
    X_train_t = X_train.copy()X_test_t = X_test.copy()
  6. Now, we replace the missing valueswith99:
    X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[    "A2", "A3", "A8", "A11"]].fillna(99)X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[    "A2", "A3", "A8", "A11"]].fillna(99)

Note

To impute different variables with different values usingpandasfillna(), use a dictionary like this:imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.

Now, we’ll impute missing values with an arbitrary numberusingscikit-learn.

  1. Let’s set upimputer to replace missing valueswith99:
    imputer = SimpleImputer(strategy='constant', fill_value=99)

Note

If your dataset contains categorical variables,SimpleImputer() will add99 to those variables as well if any valuesare missing.

  1. Let’s fitimputer to a slice of the train set containing the variablesto impute:
    vars = ["A2", "A3", "A8", "A11"]imputer.fit(X_train[vars])
  2. Replace the missing values with99 in thedesired variables:
    X_train_t[vars] = imputer.transform(X_train[vars])X_test_t[vars] = imputer.transform(X_test[vars])

    Go ahead and check the lack of missing values by executingX_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().

    To finish, let’s impute missing valuesusingfeature-engine.

  3. Let’s set up theimputer to replace missing values with99 in 4specific variables:
    imputer = ArbitraryNumberImputer(    arbitrary_number=99,    variables=["A2", "A3", "A8", "A11"],)

Note

ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set thevariables parametertoNone.

  1. Finally, let’s replace the missing valueswith99:
    X_train = imputer.fit_transform(X_train)X_test = imputer.transform(X_test)

Note

To impute different variables with different numbers, set upArbitraryNumberImputer() as follows:ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).

We have now replaced missing data with arbitrary numbers using three differentopen-source libraries.

How it works...

In this recipe, we replaced missing values in numerical variables with an arbitrary number usingpandas,scikit-learn,andfeature-engine.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’max(). We chose99 because it was greater than the maximum values of the selected variables. Instep 5, we usedpandasfillna() to replace themissing data.

To replace missing values usingscikit-learn, we utilizedSimpleImputer(), with thestrategy set toconstant, and specified99 in thefill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing valuesusingtransform().

To replace missing values withfeature-engine we usedArbitraryValueImputer(), specifying the value99 and the variables to impute as parameters. Next, we applied thefit_transform() method to replace missing data in the train set and thetransform() method to replace missing data in thetest set.

Finding extreme values for imputation

Replacing missing values with a value at the end of the variable distribution (extreme values) is like replacing them with an arbitrary value, but instead of setting the arbitrary values manually, the values are automatically selected from the end of thevariable distribution.

We can replace missing data with a value that is greater or smaller than most values in the variable. To select a value that is greater, we can use the mean plus a factor of the standard deviation. Alternatively, we can set it to the 75th quantile + IQR × 1.5.IQR stands forinter-quartile range and is the difference between the 75th and 25th quantile. To replace missing data with values that are smaller than the remaining values, we can use the mean minus a factor of the standard deviation, or the 25th quantile – IQR ×1.5.

Note

End-of-tail imputation may distort the distribution of the original variables, so it may not be suitable forlinear models.

In this recipe, we will implement end-of-tail or extreme value imputation usingpandasandfeature-engine.

How to do it...

To begin this recipe, let’s import the necessary tools and loadthe data:

  1. Let’s importpandas and the required functionand class:
    import pandas as pdfrom sklearn.model_selection import train_test_splitfrom feature_engine.imputation import EndTailImputer
  2. Let’s load the dataset we described in theTechnicalrequirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s capture the numerical variables in a list, excludingthe target:
    numeric_vars = [    var for var in data.select_dtypes(        exclude="O").columns.to_list()    if var !="target"]
  4. Let’s split the data into train and test sets, keeping only thenumerical variables:
    X_train, X_test, y_train, y_test = train_test_split(    data[numeric_vars],    data["target"],    test_size=0.3,    random_state=0,)
  5. We’ll now determinethe IQR:
    IQR = X_train.quantile(0.75) - X_train.quantile(0.25)

    We can visualize the IQR values by executingIQRorprint(IQR):

    A2      16.4200A3       6.5825A8       2.8350A11      3.0000A14    212.0000A15    450.0000dtype: float64
  6. Let’s create a dictionary with the variable names and theimputation values:
    imputation_dict = (    X_train.quantile(0.75) + 1.5 * IQR).to_dict()

Note

If we use the inter-quartile range proximity rule, we determine the imputation values by adding 1.5 times the IQR to the 75th quantile. If variables are normally distributed, we can calculate the imputation values as the mean plus a factor of the standard deviation,imputation_dict = (X_train.mean() + 3 *X_train.std()).to_dict().

  1. Finally, let’s replace themissing data:
    X_train_t = X_train.fillna(value=imputation_dict)X_test_t = X_test.fillna(value=imputation_dict)

Note

We can also replace missing data with values at the left tail of the distribution usingvalue = X_train[var].quantile(0.25) - 1.5 * IQR orvalue = X_train[var].mean() – 3 *X_train[var].std().

To finish, let’s impute missing valuesusingfeature-engine.

  1. Let’s set upimputer to estimate a value at the right of the distribution using the IQRproximity rule:
    imputer = EndTailImputer(    imputation_method="iqr",    tail="right",    fold=3,    variables=None,)

Note

To use the mean and standard deviation to calculate the replacement values, setimputation_method="Gaussian". Useleft orright in thetail argument to specify the side of the distribution to consider when finding values forthe imputation.

  1. Let’s fitEndTailImputer() to the train set so that it learns thevalues forthe imputation:
    imputer.fit(X_train)
  2. Let’s inspect thelearned values:
    imputer.imputer_dict_

    The previous command returns a dictionary with the values to use to imputeeach variable:

    {'A2': 88.18, 'A3': 27.31, 'A8': 11.504999999999999, 'A11': 12.0, 'A14': 908.0, 'A15': 1800.0}
  3. Finally, let’s replace themissing values:
    X_train = imputer.transform(X_train)X_test = imputer.transform(X_test)

Remember that you can corroborate that the missing values were replaced by usingX_train[['A2','A3', 'A8', 'A11', 'A14', 'A15']].isnull().mean().

How it works...

In this recipe, we replaced missing values in numerical variables with a number at the endof the distribution usingpandasandfeature-engine.

We determined the imputation values according to the formulas described in the introduction to this recipe. We used pandasquantile() to find specific quantile values, orpandasmean() andstd() for the mean and standard deviation. With pandasfillna() we replaced themissing values.

To replace missing values withEndTailImputer() fromfeature-engine, we setdistribution toiqr to calculate the values based on the IQR proximity rule. Withtail set toright the transformer found the imputation values from the right of the distribution. Withfit(), the imputer learned and stored the values for the imputation in a dictionary in theimputer_dict_ attribute. Withtransform(), we replaced the missing values,returning DataFrames.

Marking imputed values

In the previous recipes, we focusedon replacing missing data with estimates of their values. In addition, we can add missing indicators tomark observations where valueswere missing.

A missing indicator is a binary variable that takes the value1 orTrue to indicate whether a value was missing, and0 orFalse otherwise. It is common practice to replace missing observations with the mean, median, or most frequent category while simultaneously marking those missing observations with missing indicators. In this recipe, we will learn how to add missingindicators usingpandas,scikit-learn,andfeature-engine.

How to do it...

Let’s begin by making some imports and loadingthe data:

  1. Let’s import the required libraries, functions,and classes:
    import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom feature_engine.imputation import(    AddMissingIndicator,    CategoricalImputer,    MeanMedianImputer)
  2. Let’s load and split the dataset described in theTechnicalrequirements section:
    data = pd.read_csv("credit_approval_uci.csv")X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.3,    random_state=0,)
  3. Let’s capture the variable names ina list:
    varnames = ["A1", "A3", "A4", "A5", "A6", "A7", "A8"]
  4. Let’s create names for the missing indicators and store them ina list:
    indicators = [f"{var}_na" for var in varnames]

    If we executeindicators, we will see the names we will use for the new variables:['A1_na', 'A3_na', 'A4_na', 'A5_na', 'A6_na', 'A7_na', 'A8_na'].

  5. Let’s make a copy of theoriginal DataFrames:
    X_train_t = X_train.copy()X_test_t = X_test.copy()
  6. Let’s add themissing indicators:
    X_train_t[indicators] =X_train[    varnames].isna().astype(int)X_test_t[indicators] = X_test[    varnames].isna().astype(int)

Note

If you want the indicators to haveTrue andFalse as values instead of0 and1, removeastype(int) instep 6.

  1. Let’s inspect theresulting DataFrame:
    X_train_t.head()

    We can see the newly added variables at the right of the DataFrame in thefollowing image:

Figure 1.4 – DataFrame with the missing indicators

Figure 1.4 – DataFrame with the missing indicators

Now, let’s add missing indicators usingFeature-engine instead.

  1. Set up the imputer to add binary indicators to every variable withmissing data:
    imputer = AddMissingIndicator(    variables=None, missing_only=True    )
  2. Fit the imputer to the train set so that it finds the variables withmissing data:
    imputer.fit(X_train)

Note

If we executeimputer.variables_, we will find the variables for which missing indicators willbe added.

  1. Finally, let’s add themissing indicators:
    X_train_t = imputer.transform(X_train)X_test_t = imputer.transform(X_test)

    So far, we just added missing indicators. But we still have the missing data in our variables. We need to replace them with numbers. In the rest of this recipe, we will combine the use of missing indicators with mean andmode imputation.

  2. Let’s create a pipeline to add missing indicators to categorical and numerical variables, then impute categorical variables with the most frequent category, and numerical variables withthe mean:
    pipe = Pipeline([    ("indicators",        AddMissingIndicator(missing_only=True)),    ("categorical", CategoricalImputer(        imputation_method="frequent")),    ("numerical", MeanMedianImputer()),])

Note

feature-engine imputers automatically identify numerical or categorical variables. So there is no need to slice the data or pass the variable names as arguments to the transformers inthis case.

  1. Let’s add the indicators and imputemissing values:
    X_train_t = pipe.fit_transform(X_train)X_test_t = pipe.transform(X_test)

Note

UseX_train_t.isnull().sum() to corroborate that there is no data missing. ExecuteX_train_t.head() to get a view of theresulting datafame.

Finally, let’s add missing indicators and simultaneously impute numerical and categorical variables with the mean and most frequent categories respectively,utilizing scikit-learn.

  1. Let’s make a list with the names of the numerical andcategorical variables:
    numvars = X_train.select_dtypes(    exclude="O").columns.to_list()catvars = X_train.select_dtypes(    include="O").columns.to_list()
  2. Let’s set up a pipeline to perform mean and frequent category imputation while marking themissing data:
    pipe = ColumnTransformer([    ("num_imputer", SimpleImputer(        strategy="mean",        add_indicator=True),    numvars),    ("cat_imputer", SimpleImputer(        strategy="most_frequent",        add_indicator=True),    catvars),]).set_output(transform="pandas")
  3. Now, let’s carry outthe imputation:
    X_train_t = pipe.fit_transform(X_train)X_test_t = pipe.transform(X_test)

Make sure to exploreX_train_t.head() to get familiar with thepipeline’s output.

How it works...

To add missing indicators using pandas, we usedisna(), which created a new vector assigning the value ofTrue if there was a missing value orFalse otherwise. We usedastype(int) to convert the Boolean vectors into binary vectors with values1and0.

To add a missing indicator withfeature-engine, we usedAddMissingIndicator(). Withfit() the transformer found the variables with missing data. Withtransform() it appended the missing indicators to the right of the train andtest sets.

To sequentially add missing indicators and then replace thenan values with the most frequent category or the mean, we lined up Feature-engine’sAddMissingIndicator(),CategoricalImputer(), andMeanMedianImputer() within apipeline. Thefit() method from thepipeline made the transformers find the variables withnan and calculate the mean of the numerical variables and the mode of the categorical variables. Thetransform() method from thepipeline made the transformers add the missing indicators and then replace the missing values withtheir estimates.

Note

Feature-engine transformations return DataFrames respecting the original names and order of the variables. Scikit-learn’sColumnTransformer(), on the other hand, changes the variable’s names and order in theresulting data.

Finally, we added missing indicators and replaced missing data with the mean and most frequent category usingscikit-learn. We lined up two instances ofSimpleImputer(), the first to impute data with the mean and the second to impute data with the most frequent category. In both cases, we set theadd_indicator parameter toTrue to add the missing indicators. We wrappedSimpleImputer() withColumnTransformer() to specifically modify numerical or categorical variables. Then we used thefit() andtransform() methods from thepipeline to train the transformers and then add the indicators and replace themissing data.

When returning DataFrames,ColumnTransformer() changes the names of the variables and their order. Take a look at the result fromstep 15 by executingX_train_t.head(). You’ll see that the name given to each step of the pipeline is added as a prefix to the variables to flag which variable was modified with each transformer. Then,num_imputer__A2 was returned by the first step of the pipeline, whilecat_imputer__A12 was returned by the second step ofthe pipeline.

There’s more…

Scikit-learn has theMissingIndicator() transformer that just adds missing indicators. Check it out in the documentation:https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html and find an example in the accompanying GitHub repositoryathttps://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/Recipe-06-Marking-imputed-values.ipynb.

Implementing forward and backward fill

Time series data also show missing values. To impute missing data in time series, we use specific methods. Forward fill imputation involves filling missing values in a dataset with the most recent non-missing value that precedes it in the data sequence. In other words, we carry forward the last seen value to the next valid value. Backward fill imputation involves filling missing values with the next non-missing value that follows it in the data sequence. In other words, we carry the last valid value backward to its precedingvalid value.

In this recipe, we will replace missing data in a time series with forward andbackward fills.

How to do it...

Let’s begin by importing the required libraries and timeseries dataset:

  1. Let’s importpandasandmatplotlib:
    import matplotlib.pyplot as pltimport pandas as pd
  2. Let’s load the air passengers dataset that we described in theTechnical requirements section and display the first five rows of thetime series:
    df = pd.read_csv(    "air_passengers.csv",    parse_dates=["ds"],    index_col=["ds"],)print(df.head())

    We see the time series in thefollowing output:

                    yds1949-01-01  112.01949-02-01  118.01949-03-01  132.01949-04-01  129.01949-05-01  121.0

Note

You can determine the percentage of missing data byexecutingdf.isnull().mean().

  1. Let’s plot the time series to spot any obviousdata gaps:
    ax = df.plot(marker=".", figsize=[10, 5], legend=None)ax.set_title("Air passengers")ax.set_ylabel("Number of passengers")ax.set_xlabel("Time")

    The previous code returns the following plot, where we see intervals of time where datais missing:

Figure 1.5 – Time series data showing missing values

Figure 1.5 – Time series data showing missing values

  1. Let’s impute missing data by carrying the last observed value in any interval to the nextvalid value:
    df_imputed = df.ffill()

    You can verify the absence of missing data byexecutingdf_imputed.isnull().sum().

  2. Let’s now plot the complete dataset and overlay as a dotted line the values used forthe imputation:
    ax = df_imputed.plot(    linestyle="-", marker=".", figsize=[10, 5])df_imputed[df.isnull()].plot(    ax=ax, legend=None, marker=".", color="r")ax.set_title("Air passengers")ax.set_ylabel("Number of passengers")ax.set_xlabel("Time")

    The previous code returns the following plot, where we see the values used to replacenan as dotted lines overlaid in between the continuous timeseries lines:

Figure 1.6 – Time series data where missing values were replaced by the last seen observations (dotted line)

Figure 1.6 – Time series data where missing values were replaced by the last seen observations (dotted line)

  1. Alternatively, we can impute missing data usingbackward fill:
    df_imputed = df.bfill()

    If we plot the imputed dataset and overlay the imputation values as we did instep 5, we’ll see thefollowing plot:

Figure 1.7 – Time series data where missing values were replaced by backward fill (dotted line)

Figure 1.7 – Time series data where missing values were replaced by backward fill (dotted line)

Note

The heights of the values used in the imputation are different inFigures 1.6 and 1.7. InFigure 1.6, we carry the last value forward, hence the height is lower. InFigure 1.7, we carry the next value backward, hence the heightis higher.

We’ve now obtained complete datasets that we can use for time series analysisand modeling.

How it works...

pandasffill() takes the last seen value in any temporal gap in a time series and propagates it forward to the next observed value. Hence, inFigure 1.6 we see the dotted overlay corresponding to the imputation values at the height of the lastseen observation.

pandasbfill() takes the next valid value in any temporal gap in a time series and propagates it backward to the previously observed value. Hence, inFigure 1.7 we see the dotted overlay corresponding to the imputation values at the height of the next observation inthe gap.

By default,ffill() andbfill() will impute all values between valid observations. We can restrict the imputation to a maximum number of data points in any interval by setting a limit, using thelimit parameter in both methods. For example,ffill(limit=10) will only replace the first 10 data points inany gap.

Carrying out interpolation

We can impute missing data in time series by using interpolation between two non-missing data points. Interpolation is the estimation of one or more values in a range by means of a function. In linear interpolation, we fit a linear function between the last observed value and the next valid point. In spline interpolation, we fit a low-degree polynomial between the last and next observed values. The idea of using interpolation is to obtain better estimates of themissing data.

In this recipe, we’ll carry out linear and spline interpolation in atime series.

How to do it...

Let’s begin by importing the required libraries and timeseries dataset.

  1. Let’s importpandasandmatplotlib:
    import matplotlib.pyplot as pltimport pandas as pd
  2. Let’s load the time series data described in theTechnicalrequirements section:
    df = pd.read_csv(    "air_passengers.csv",    parse_dates=["ds"],    index_col=["ds"],)

Note

You can plot the time series to find data gaps as we did instep 3 of theImplementing forward and backwardfill recipe.

  1. Let’s impute the missing data bylinear interpolation:
    df_imputed = df.interpolate(method="linear")

Note

If the time intervals between rows are not uniform then themethod should be set totime to achieve alinear fit.

You can verify the absence of missing data byexecutingdf_imputed.isnull().sum().

  1. Let’s now plot the complete dataset and overlay as a dotted line the values used forthe imputation:
    ax = df_imputed.plot(    linestyle="-", marker=".", figsize=[10, 5])df_imputed[df.isnull()].plot(    ax=ax, legend=None, marker=".", color="r")ax.set_title("Air passengers")ax.set_ylabel("Number of passengers")ax.set_xlabel("Time")

    The previous code returns the following plot, where we see the values used to replacenan as dotted lines in between the continuous line of thetime series:

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

  1. Alternatively, we can impute missing data by doing spline interpolation. We’ll use a polynomial of thesecond degree:
    df_imputed = df.interpolate(method="spline", order=2)

    If we plot the imputed dataset and overlay the imputation values as we did instep 4, we’ll see thefollowing plot:

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Note

Change the degree of the polynomial used in the interpolation to see how the replacementvalues vary.

We’ve now obtained complete datasets that we can use for analysisand modeling.

How it works...

pandasinterpolate() fills missing values in a range by using an interpolation method. When we set themethod tolinear,interpolate() treats all data points as equidistant and fits a line between the last and next valid points in an interval withmissing data.

Note

If you want to perform linear interpolation, but your data points are not equally distanced, setmethodtotime.

We then performed spline interpolation with a second-degree polynomial by settingmethod tospline andorderto2.

pandasinterpolate() usesscipy.interpolate.interp1d andscipy.interpolate.UnivariateSpline under the hood, and can therefore implement other interpolation methods. Check out pandas documentation for more detailsathttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html.

See also

While interpolation aims to get better estimates of the missing data compared to forward and backward fill, these estimates may still not be accurate if the times series show strong trend and seasonality. To obtain better estimates of the missing data in these types of time series, check out time series decomposition followed by interpolation in theFeature Engineering for Time Series Courseathttps://www.trainindata.com/p/feature-engineering-for-forecasting.

Performing multivariate imputation by chained equations

Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values.Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables in the dataset. The output of that function is used to replacemissing data.

MICE involves thefollowing steps:

  1. First, it performs a simple univariate imputation to every variable with missing data. For example,median imputation.
  2. Next, it selects one specific variable, say,var_1, and sets the missing values backto missing.
  3. It trains a model to predictvar_1 using the other variables asinput features.
  4. Finally, it replaces the missing values ofvar_1 with the output ofthe model.

MICE repeatssteps 2 to4 for each of theremaining variables.

An imputation cycle concludes once all the variables have been modeled. MICE carries out multiple imputation cycles, typically 10. That is, we repeatsteps 2 to4 for each variable 10 times. The idea is that by the end of the cycles, we should have found the best possible estimates of the missing data foreach variable.

Note

Multivariate imputation can be a useful alternative to univariate imputation in situations where we don’t want to distort the variable distributions. Multivariate imputation is also useful when we are interested in having good estimates of themissing data.

In this recipe, we will implement MICEusing scikit-learn.

How to do it...

To begin the recipe, let’s import the required libraries and loadthe data:

  1. Let’s import the required Python libraries, classes,and functions:
    import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import BayesianRidgefrom sklearn.experimental import (    enable_iterative_imputer)from sklearn.impute import (    IterativeImputer,    SimpleImputer)
  2. Let’s load some numerical variables from the dataset described in theTechnicalrequirements section:
    variables = [    "A2", "A3", "A8", "A11", "A14", "A15", "target"]data = pd.read_csv(    "credit_approval_uci.csv",    usecols=variables)
  3. Let’s divide the data into train andtest sets:
    X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.3,    random_state=0,)
  4. Let’s create a MICE imputer using Bayes regression, specifying the number of iteration cycles and settingrandom_statefor reproducibility:
    imputer = IterativeImputer(    estimator= BayesianRidge(),    max_iter=10,    random_state=0,).set_output(transform="pandas")

Note

IterativeImputer() contains other useful arguments. For example, we can specify the first imputation strategy using theinitial_strategy parameter. We can choose from the mean, median, mode, or arbitrary imputation. We can also specify how we want to cycle over the variables, either randomly or from the one with the fewest missing values to the one withthe most.

  1. Let’s fitIterativeImputer() so that it trains the estimators to predict the missing values ineach variable:
    imputer.fit(X_train)

Note

We can use any regression model to estimate the missing datawithIterativeImputer().

  1. Finally, let’s fill in the missing values in both the train andtest sets:
    X_train_t = imputer.transform(X_train)X_test_t = imputer.transform(X_test)

Note

To corroborate the lack of missing data, we canexecuteX_train_t.isnull().sum().

To wrap up the recipe, let’s impute the variables with a simple univariate imputation method and compare the effect on thevariables’ distribution.

  1. Let’s set up scikit-learn’sSimpleImputer() to perform mean imputation, and then transformthe datasets:
    imputer_simple = SimpleImputer(    strategy="mean").set_output(transform="pandas")X_train_s = imputer_simple.fit_transform(X_train)X_test_s = imputer_simple.transform(X_test)
  2. Let’s now make a histogram of theA3 variable after MICE imputation, followed by a histogram of the same variable aftermean imputation:
    fig, axes = plt.subplots(    2, 1, figsize=(10, 10), squeeze=False)X_test_t["A3"].hist(    bins=50, ax=axes[0, 0], color="blue")X_test_s["A3"].hist(    bins=50, ax=axes[1, 0], color="green")axes[0, 0].set_ylabel('Number of observations')axes[1, 0].set_ylabel('Number of observations')axes[0, 0].set_xlabel('A3')axes[1, 0].set_xlabel('A3')axes[0, 0].set_title('MICE')axes[1, 0].set_title('Mean imputation')plt.show()

    In the following plot, we see that mean imputation distorts the variable distribution, with more observations toward themean value:

Figure 1.10 –  Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

Figure 1.10 – Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

How it works...

In this recipe, we performed multivariate imputation usingIterativeImputer() fromscikit-learn. When we fit the model,IterativeImputer() carried out the steps that we described in the introduction of the recipe. That is, it imputed all variables with the mean. Then it selected one variable and set its missing values back to missing. And finally, it fitted a Bayes regressor to estimate that variable based on the others. It repeated this procedure for each variable. That was one cycle of imputation. We set it to repeat this process 10 times. By the end of this procedure,IterativeImputer() had one Bayes regressor trained to predict the values of each variable based on the other variables in the dataset. Withtransform(), it uses the predictions of these Bayes models to impute themissing data.

IterativeImputer() can only impute missing data in numerical variables based on numerical variables. If you want to use categorical variables as input, you need to encode them first. However, keep in mind that it will only carryout regression.Hence it isnot suitable to estimate missing data in discrete orcategorical variables.

See also

To learn more about MICE, take a look at thefollowing resources:

Estimating missing data with nearest neighbors

Imputation withK-Nearest Neighbors (KNN) involves estimating missing values in a dataset by considering the values of their nearestneighbors, where similarity between data points is determined based on a distance metric, such as the Euclidean distance. It assigns the missing value the average of the nearest neighbors’ values, weighted bytheir distance.

Consider the following data set containing 4 variables (columns) and 11 observations (rows). We want to impute the dark value in the fifth row of the second variable. First, we find the row’s k-nearest neighbors, wherek=3 in our example, and they are highlighted by the rectangular boxes (middle panel). Next, we take the average value shown by the closest neighbors forvariable 2.

Figure 1.11 – Diagram showing a value to impute (dark box), the three closest rows to the value to impute (square boxes), and the values considered to take the average for the imputation

Figure 1.11 – Diagram showing a value to impute (dark box), the three closest rows to the value to impute (square boxes), and the values considered to take the average for the imputation

The value for the imputation is given by (value1 × w1 + value2 × w2 + value3 × w3) / 3, where w1, w2, and w3 are proportional to the distance of the neighbor to the datato impute.

In this recipe, we will perform KNN imputationusing scikit-learn.

How to do it...

To proceed with the recipe, let’s import the required libraries and preparethe data:

  1. Let’s import the required libraries, classes,and functions:
    import matplotlib.pyplot as pltimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import KNNImputer
  2. Let’s load the dataset described in theTechnical requirements section (only somenumerical variables):
    variables = [    "A2", "A3", "A8", "A11", "A14", "A15","target"]data = pd.read_csv(    "credit_approval_uci.csv",    usecols=variables,)
  3. Let’s divide the data into train andtest sets:
    X_train, X_test, y_train, y_test = train_test_split(    data.drop("target", axis=1),    data["target"],    test_size=0.3,    random_state=0,)
  4. Let’s set up the imputer to replace missing data with the weighted mean of its closestfive neighbors:
    imputer = KNNImputer(    n_neighbors=5, weights="distance",).set_output(transform="pandas")

Note

The replacement values can be calculated as the uniform mean of the k-nearest neighbors, by settingweights touniform or as the weighted average, as we do in the recipe. The weight is based on the distance of the neighbor to the observation to impute. The nearest neighbors carrymore weight.

  1. Find thenearest neighbors:
    imputer.fit(X_train)
  2. Replace the missing values with the weighted mean of the values shown bythe neighbors:
    X_train_t = imputer.transform(X_train)X_test_t = imputer.transform(X_test)

The result is a pandas DataFrame with the missingdata replaced.

How it works...

In this recipe, we replaced missing data with the average value shown by each observation’s k-nearest neighbors. We set upKNNImputer() to find each observation’s five closest neighbors based on the Euclidean distance. The replacement values were estimated as the weighted average of the values shown by the five closest neighbors for the variable to impute. Withtransform(), the imputer calculated the replacement value and replaced themissing data.

Left arrow icon

Page1 of 12

Right arrow icon
Download code iconDownload Code

Key benefits

  • Craft powerful features from tabular, transactional, and time-series data
  • Develop efficient and reproducible real-world feature engineering pipelines
  • Optimize data transformation and save valuable time
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Streamline data preprocessing and feature engineering in your machine learning project with this third edition of the Python Feature Engineering Cookbook to make your data preparation more efficient.This guide addresses common challenges, such as imputing missing values and encoding categorical variables using practical solutions and open source Python libraries. You’ll learn advanced techniques for transforming numerical variables, discretizing variables, and dealing with outliers. Each chapter offers step-by-step instructions and real-world examples, helping you understand when and how to apply various transformations for well-prepared data.The book explores feature extraction from complex data types such as dates, times, and text. You’ll see how to create new features through mathematical operations and decision trees and use advanced tools like Featuretools and tsfresh to extract features from relational data and time series.By the end, you’ll be ready to build reproducible feature engineering pipelines that can be easily deployed into production, optimizing data preprocessing workflows and enhancing machine learning model performance.

Who is this book for?

If you're a machine learning or data science enthusiast who wants to learn more about feature engineering, data preprocessing, and how to optimize these tasks, this book is for you. If you already know the basics of feature engineering and are looking to learn more advanced methods to craft powerful features, this book will help you. You should have basic knowledge of Python programming and machine learning to get started.

What you will learn

  • Discover multiple methods to impute missing data effectively
  • Encode categorical variables while tackling high cardinality
  • Find out how to properly transform, discretize, and scale your variables
  • Automate feature extraction from date and time data
  • Combine variables strategically to create new and powerful features
  • Extract features from transactional data and time series
  • Learn methods to extract meaningful features from text data

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date :Aug 30, 2024
Length:396 pages
Edition :3rd
Language :English
ISBN-13 :9781835883594
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature iconInstant access to your Digital eBook purchase
Product feature icon Download this book inEPUB andPDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature iconDRM FREE - Read whenever, wherever and however you want
Product feature iconAI Assistant (beta) to help accelerate your learning
OR

Contact Details

Modal Close icon
Payment Processing...
tickCompleted

Billing Address

Product Details

Publication date :Aug 30, 2024
Length:396 pages
Edition :3rd
Language :English
ISBN-13 :9781835883594
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99billed monthly
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconSimple pricing, no contract
$199.99billed annually
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconChoose a DRM-free eBook or Video every month to keep
Feature tick iconPLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick iconExclusive print discounts
$279.99billed in 18 months
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconChoose a DRM-free eBook or Video every month to keep
Feature tick iconPLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick iconExclusive print discounts

Frequently bought together


Python for Algorithmic Trading Cookbook
Python for Algorithmic Trading Cookbook
Read more
Aug 2024404 pages
Full star icon4.2 (20)
eBook
eBook
NZ$62.99NZ$70.99
NZ$87.99
AI-Assisted Programming for Web and Machine Learning
AI-Assisted Programming for Web and Machine Learning
Read more
Aug 2024602 pages
Full star icon4.9 (11)
eBook
eBook
NZ$50.99NZ$56.99
NZ$70.99
Python Feature Engineering Cookbook
Python Feature Engineering Cookbook
Read more
Aug 2024396 pages
eBook
eBook
NZ$46.99NZ$52.99
NZ$65.99
Stars icon
TotalNZ$224.97
Python for Algorithmic Trading Cookbook
NZ$87.99
AI-Assisted Programming for Web and Machine Learning
NZ$70.99
Python Feature Engineering Cookbook
NZ$65.99
TotalNZ$224.97Stars icon

Table of Contents

13 Chapters
Chapter 1: Imputing Missing DataChevron down iconChevron up icon
Chapter 1: Imputing Missing Data
Technical requirements
Removing observations with missing data
Performing mean or median imputation
Imputing categorical variables
Replacing missing values with an arbitrary number
Finding extreme values for imputation
Marking imputed values
Implementing forward and backward fill
Carrying out interpolation
Performing multivariate imputation by chained equations
Estimating missing data with nearest neighbors
Chapter 2: Encoding Categorical VariablesChevron down iconChevron up icon
Chapter 2: Encoding Categorical Variables
Technical requirements
Creating binary variables through one-hot encoding
Performing one-hot encoding of frequent categories
Replacing categories with counts or the frequency of observations
Replacing categories with ordinal numbers
Performing ordinal encoding based on the target value
Implementing target mean encoding
Encoding with Weight of Evidence
Grouping rare or infrequent categories
Performing binary encoding
Chapter 3: Transforming Numerical VariablesChevron down iconChevron up icon
Chapter 3: Transforming Numerical Variables
Transforming variables with the logarithm function
Transforming variables with the reciprocal function
Using the square root to transform variables
Using power transformations
Performing Box-Cox transformations
Performing Yeo-Johnson transformations
Chapter 4: Performing Variable DiscretizationChevron down iconChevron up icon
Chapter 4: Performing Variable Discretization
Technical requirements
Performing equal-width discretization
Implementing equal-frequency discretization
Discretizing the variable into arbitrary intervals
Performing discretization with k-means clustering
Implementing feature binarization
Using decision trees for discretization
Chapter 5: Working with OutliersChevron down iconChevron up icon
Chapter 5: Working with Outliers
Technical requirements
Visualizing outliers with boxplots and the inter-quartile proximity rule
Finding outliers using the mean and standard deviation
Using the median absolute deviation to find outliers
Removing outliers
Bringing outliers back within acceptable limits
Applying winsorization
Chapter 6: Extracting Features from Date and Time VariablesChevron down iconChevron up icon
Chapter 6: Extracting Features from Date and Time Variables
Technical requirements
Extracting features from dates with pandas
Extracting features from time with pandas
Capturing the elapsed time between datetime variables
Working with time in different time zones
Automating the datetime feature extraction with Feature-engine
Chapter 7: Performing Feature ScalingChevron down iconChevron up icon
Chapter 7: Performing Feature Scaling
Technical requirements
Standardizing the features
Scaling to the maximum and minimum values
Scaling with the median and quantiles
Performing mean normalization
Implementing maximum absolute scaling
Scaling to vector unit length
Chapter 8: Creating New FeaturesChevron down iconChevron up icon
Chapter 8: Creating New Features
Technical requirements
Combining features with mathematical functions
Comparing features to reference variables
Performing polynomial expansion
Combining features with decision trees
Creating periodic features from cyclical variables
Creating spline features
Chapter 9: Extracting Features from Relational Data with FeaturetoolsChevron down iconChevron up icon
Chapter 9: Extracting Features from Relational Data with Featuretools
Technical requirements
Setting up an entity set and creating features automatically
Creating features with general and cumulative operations
Combining numerical features
Extracting features from date and time
Extracting features from text
Creating features with aggregation primitives
Chapter 10: Creating Features from a Time Series with tsfreshChevron down iconChevron up icon
Chapter 10: Creating Features from a Time Series with tsfresh
Technical requirements
Extracting hundreds of features automatically from a time series
Automatically creating and selecting predictive features from time-series data
Extracting different features from different time series
Creating a subset of features identified through feature selection
Embedding feature creation into a scikit-learn pipeline
Chapter 11: Extracting Features from Text VariablesChevron down iconChevron up icon
Chapter 11: Extracting Features from Text Variables
Technical requirements
Counting characters, words, and vocabulary
Estimating text complexity by counting sentences
Creating features with bag-of-words and n-grams
Implementing term frequency-inverse document frequency
Cleaning and stemming text variables
IndexChevron down iconChevron up icon
Index
Why subscribe?
Other Books You May EnjoyChevron down iconChevron up icon
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book

Recommendations for you

Left arrow icon
LLM Engineer's Handbook
LLM Engineer's Handbook
Read more
Oct 2024522 pages
Full star icon4.9 (28)
eBook
eBook
NZ$70.99
NZ$87.99
Getting Started with Tableau 2018.x
Getting Started with Tableau 2018.x
Read more
Sep 2018396 pages
Full star icon4 (3)
eBook
eBook
NZ$57.99NZ$64.99
NZ$80.99
Python for Algorithmic Trading Cookbook
Python for Algorithmic Trading Cookbook
Read more
Aug 2024404 pages
Full star icon4.2 (20)
eBook
eBook
NZ$62.99NZ$70.99
NZ$87.99
RAG-Driven Generative AI
RAG-Driven Generative AI
Read more
Sep 2024338 pages
Full star icon4.3 (18)
eBook
eBook
NZ$45.99NZ$51.99
NZ$64.99
Machine Learning with PyTorch and Scikit-Learn
Machine Learning with PyTorch and Scikit-Learn
Read more
Feb 2022774 pages
Full star icon4.4 (96)
eBook
eBook
NZ$57.99NZ$64.99
NZ$80.99
NZ$116.99
Building LLM Powered  Applications
Building LLM Powered Applications
Read more
May 2024342 pages
Full star icon4.2 (22)
eBook
eBook
NZ$52.99NZ$58.99
NZ$73.99
Python Machine Learning By Example
Python Machine Learning By Example
Read more
Jul 2024518 pages
Full star icon4.9 (9)
eBook
eBook
NZ$47.99NZ$53.99
NZ$67.99
AI Product Manager's Handbook
AI Product Manager's Handbook
Read more
Nov 2024488 pages
eBook
eBook
NZ$46.99NZ$52.99
NZ$65.99
Right arrow icon

About the author

Profile icon Galli
Galli
LinkedIn icon
Soledad Galli is a bestselling data science instructor, author, and open-source Python developer. As the leading instructor at Train in Data, she teaches intermediate and advanced courses in machine learning that have enrolled over 64,000 students worldwide and continue to receive positive reviews. Sole is also the developer and maintainer of the Python open-source library Feature-engine, which provides an extensive array of methods for feature engineering and selection. With extensive experience as a data scientist in finance and insurance sectors, Sole has developed and deployed machine learning models for assessing insurance claims, evaluating credit risk, and preventing fraud. She is a frequent speaker at podcasts, meetups, and webinars, sharing her expertise with the broader data science community.
Read more
See other products by Galli
Getfree access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook?Chevron down iconChevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website?Chevron down iconChevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook?Chevron down iconChevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support?Chevron down iconChevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks?Chevron down iconChevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook?Chevron down iconChevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.


[8]ページ先頭

©2009-2025 Movatter.jp