Navigation

Making developers awesome at machine learning

Click to Take the FREE Data Preparation Crash-Course

How to Use the ColumnTransformer for Data Preparation

By Jason BrownleeonDecember 31, 2020in Data Preparation 69

You must prepare your raw data using data transforms prior to fitting a machine learning model.

This is required to ensure that you best expose the structure of your predictive modeling problem to the learning algorithms.

Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type. It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but not all, input features.

Thankfully, the scikit-learn Python machine learning library provides theColumnTransformer that allows you to selectively apply data transforms to different columns in your dataset.

In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed data types.

After completing this tutorial, you will know:

The challenge of using data transformations with datasets that have mixed data types.
How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.
How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to categorical and numerical data columns.

Kick-start your project with my new bookData Preparation for Machine Learning, includingstep-by-step tutorials and thePython source code files for all examples.

Let’s get started.

Update Dec/2020: Fixed small typo in API example.

Use the ColumnTransformer for Numerical and Categorical Data in Python
Photo byKari, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Challenge of Transforming Different Data Types
How to use the ColumnTransformer
Data Preparation for the Abalone Regression Dataset

Challenge of Transforming Different Data Types

It is important to prepare data prior to modeling.

This may involve replacing missing values, scaling numerical values, and one hot encoding categorical data.

Data transforms can be performed using the scikit-learn library; for example, theSimpleImputer class can be used to replace missing values, theMinMaxScaler class can be used to scale numerical values, and theOneHotEncoder can be used to encode categorical variables.

For example:

...# prepare transformscaler = MinMaxScaler()# fit transform on training datascaler.fit(train_X)# transform training datatrain_X = scaler.transform(train_X)

...

# prepare transform

scaler=MinMaxScaler()

# fit transform on training data

scaler.fit(train_X)

# transform training data

train_X=scaler.transform(train_X)

Sequences of different transforms can also be chained together using thePipeline, such as imputing missing values, then scaling numerical values.

For example:

...# define pipelinepipeline = Pipeline(steps=[('i', SimpleImputer(strategy='median')), ('s', MinMaxScaler())])# transform training datatrain_X = pipeline.fit_transform(train_X)

...

# define pipeline

pipeline=Pipeline(steps=[('i',SimpleImputer(strategy='median')),('s',MinMaxScaler())])

# transform training data

train_X=pipeline.fit_transform(train_X)

It is very common to want to perform different data preparation techniques on different columns in your input data.

For example, you may want to impute missing numerical values with a median value, then scale the values and impute missing categorical values using the most frequent value and one hot encode the categories.

Traditionally, this would require you to separate the numerical and categorical data and then manually apply the transforms on those groups of features before combining the columns back together in order to fit and evaluate a model.

Now, you can use the ColumnTransformer to perform this operation for you.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

How to use the ColumnTransformer

TheColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.

For example, it allows you to apply a specific transform or sequence of transforms to just the numerical columns, and a separate sequence of transforms to just the categorical columns.

To use the ColumnTransformer, you must specify a list of transformers.

Each transformer is a three-element tuple that defines the name of the transformer, the transform to apply, and the column indices to apply it to. For example:

(Name, Object, Columns)

For example, the ColumnTransformer below applies a OneHotEncoder to columns 0 and 1.

...transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])

1 2	... transformer=ColumnTransformer(transformers=[('cat',OneHotEncoder(),[0,1])])

The example below applies a SimpleImputer with median imputing for numerical columns 0 and 1, and SimpleImputer with most frequent imputing to categorical columns 2 and 3.

...t = [('num', SimpleImputer(strategy='median'), [0, 1]), ('cat', SimpleImputer(strategy='most_frequent'), [2, 3])]transformer = ColumnTransformer(transformers=t)

...

t=[('num',SimpleImputer(strategy='median'),[0,1]),('cat',SimpleImputer(strategy='most_frequent'),[2,3])]

transformer=ColumnTransformer(transformers=t)

Any columns not specified in the list of “transformers” are dropped from the dataset by default; this can be changed by setting the “remainder” argument.

Settingremainder=’passthrough’ will mean that all columns not specified in the list of “transformers” will be passed through without transformation, instead of being dropped.

For example, if columns 0 and 1 were numerical and columns 2 and 3 were categorical and we wanted to just transform the categorical data and pass through the numerical columns unchanged, we could define the ColumnTransformer as follows:

...transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [2, 3])], remainder='passthrough')

1 2	... transformer=ColumnTransformer(transformers=[('cat',OneHotEncoder(),[2,3])],remainder='passthrough')

Once the transformer is defined, it can be used to transform a dataset.

For example:

...transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])# transform training datatrain_X = transformer.fit_transform(train_X)

...

transformer=ColumnTransformer(transformers=[('cat',OneHotEncoder(),[0,1])])

# transform training data

train_X=transformer.fit_transform(train_X)

A ColumnTransformer can also be used in a Pipeline to selectively prepare the columns of your dataset before fitting a model on the transformed data.

This is the most likely use case as it ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a test dataset via cross-validation or making predictions on new data in the future.

For example:

...# define modelmodel = LogisticRegression()# define transformtransformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])# define pipelinepipeline = Pipeline(steps=[('t', transformer), ('m',model)])# fit the pipeline on the transformed datapipeline.fit(train_X, train_y)# make predictionsyhat = pipeline.predict(test_X)

...

# define model

model=LogisticRegression()

# define transform

transformer=ColumnTransformer(transformers=[('cat',OneHotEncoder(),[0,1])])

# define pipeline

pipeline=Pipeline(steps=[('t',transformer),('m',model)])

# fit the pipeline on the transformed data

pipeline.fit(train_X,train_y)

# make predictions

yhat=pipeline.predict(test_X)

Now that we are familiar with how to configure and use the ColumnTransformer in general, let’s look at a worked example.

Data Preparation for the Abalone Regression Dataset

The abalone dataset is a standard machine learning problem that involves predicting the age of an abalone given measurements of an abalone.

You can download the dataset and learn more about it here:

The dataset has 4,177 examples, 8 input variables, and the target variable is an integer.

A naive model can achieve a mean absolute error (MAE) of about 2.363 (std 0.092) by predicting the mean value, evaluated via 10-fold cross-validation.

We can model this as a regression predictive modeling problem with a support vector machine model (SVR).

Reviewing the data, you can see the first few rows as follows:

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7...

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15

M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7

F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9

M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10

I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7

...

We can see that the first column is categorical and the remainder of the columns are numerical.

We may want to one hot encode the first column and normalize the remaining numerical columns, and this can be achieved using the ColumnTransformer.

First, we need to load the dataset. We can load the dataset directly from the URL using theread_csv() Pandas function, then split the data into two data frames: one for input and one for the output.

The complete example of loading the dataset is listed below.

# load the datasetfrom pandas import read_csv# load dataseturl = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'dataframe = read_csv(url, header=None)# split into inputs and outputslast_ix = len(dataframe.columns) - 1X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]print(X.shape, y.shape)

# load the dataset

frompandasimportread_csv

# load dataset

url='https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'

dataframe=read_csv(url,header=None)

# split into inputs and outputs

last_ix=len(dataframe.columns)-1

X,y=dataframe.drop(last_ix,axis=1),dataframe[last_ix]

print(X.shape,y.shape)

Note: if you have trouble loading the dataset from a URL, you can download the CSV file with the name ‘abalone.csv‘ and place it in the same directory as your Python file and change the call toread_csv() as follows:

...dataframe = read_csv('abalone.csv', header=None)

1 2	... dataframe=read_csv('abalone.csv',header=None)

Running the example, we can see that the dataset is loaded correctly and split into eight input columns and one target column.

(4177, 8) (4177,)

1	(4177, 8) (4177,)

Next, we can use theselect_dtypes() function to select the column indexes that match different data types.

We are interested in a list of columns that are numerical columns marked as ‘float64‘ or ‘int64‘ in Pandas, and a list of categorical columns, marked as ‘object‘ or ‘bool‘ type in Pandas.

...# determine categorical and numerical featuresnumerical_ix = X.select_dtypes(include=['int64', 'float64']).columnscategorical_ix = X.select_dtypes(include=['object', 'bool']).columns

...

# determine categorical and numerical features

numerical_ix=X.select_dtypes(include=['int64','float64']).columns

categorical_ix=X.select_dtypes(include=['object','bool']).columns

We can then use these lists in the ColumnTransformer to one hot encode the categorical variables, which should just be the first column.

We can also use the list of numerical columns to normalize the remaining data.

...# define the data preparation for the columnst = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]col_transform = ColumnTransformer(transformers=t)

...

# define the data preparation for the columns

t=[('cat',OneHotEncoder(),categorical_ix),('num',MinMaxScaler(),numerical_ix)]

col_transform=ColumnTransformer(transformers=t)

Next, we can define our SVR model and define a Pipeline that first uses the ColumnTransformer, then fits the model on the prepared dataset.

...# define the modelmodel = SVR(kernel='rbf',gamma='scale',C=100)# define the data preparation and modeling pipelinepipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])

...

# define the model

model=SVR(kernel='rbf',gamma='scale',C=100)

# define the data preparation and modeling pipeline

pipeline=Pipeline(steps=[('prep',col_transform),('m',model)])

Finally, we can evaluate the model using 10-fold cross-validation and calculate the mean absolute error, averaged across all 10 evaluations of the pipeline.

...# define the model cross-validation configurationcv = KFold(n_splits=10, shuffle=True, random_state=1)# evaluate the pipeline using cross validation and calculate MAEscores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)# convert MAE scores to positive valuesscores = absolute(scores)# summarize the model performanceprint('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

...

# define the model cross-validation configuration

cv=KFold(n_splits=10,shuffle=True,random_state=1)

# evaluate the pipeline using cross validation and calculate MAE

scores=cross_val_score(pipeline,X,y,scoring='neg_mean_absolute_error',cv=cv,n_jobs=-1)

# convert MAE scores to positive values

scores=absolute(scores)

# summarize the model performance

print('MAE: %.3f (%.3f)'%(mean(scores),std(scores)))

Tying this all together, the complete example is listed below.

# example of using the ColumnTransformer for the Abalone datasetfrom numpy import meanfrom numpy import stdfrom numpy import absolutefrom pandas import read_csvfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import KFoldfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import OneHotEncoderfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.svm import SVR# load dataseturl = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'dataframe = read_csv(url, header=None)# split into inputs and outputslast_ix = len(dataframe.columns) - 1X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]print(X.shape, y.shape)# determine categorical and numerical featuresnumerical_ix = X.select_dtypes(include=['int64', 'float64']).columnscategorical_ix = X.select_dtypes(include=['object', 'bool']).columns# define the data preparation for the columnst = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]col_transform = ColumnTransformer(transformers=t)# define the modelmodel = SVR(kernel='rbf',gamma='scale',C=100)# define the data preparation and modeling pipelinepipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])# define the model cross-validation configurationcv = KFold(n_splits=10, shuffle=True, random_state=1)# evaluate the pipeline using cross validation and calculate MAEscores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)# convert MAE scores to positive valuesscores = absolute(scores)# summarize the model performanceprint('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

# example of using the ColumnTransformer for the Abalone dataset

fromnumpyimportmean

fromnumpyimportstd

fromnumpyimportabsolute

frompandasimportread_csv

fromsklearn.model_selectionimportcross_val_score

fromsklearn.model_selectionimportKFold

fromsklearn.composeimportColumnTransformer

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportOneHotEncoder

fromsklearn.preprocessingimportMinMaxScaler

fromsklearn.svmimportSVR

# load dataset

url='https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'

dataframe=read_csv(url,header=None)

# split into inputs and outputs

last_ix=len(dataframe.columns)-1

X,y=dataframe.drop(last_ix,axis=1),dataframe[last_ix]

print(X.shape,y.shape)

# determine categorical and numerical features

numerical_ix=X.select_dtypes(include=['int64','float64']).columns

categorical_ix=X.select_dtypes(include=['object','bool']).columns

# define the data preparation for the columns

t=[('cat',OneHotEncoder(),categorical_ix),('num',MinMaxScaler(),numerical_ix)]

col_transform=ColumnTransformer(transformers=t)

# define the model

model=SVR(kernel='rbf',gamma='scale',C=100)

# define the data preparation and modeling pipeline

pipeline=Pipeline(steps=[('prep',col_transform),('m',model)])

# define the model cross-validation configuration

cv=KFold(n_splits=10,shuffle=True,random_state=1)

# evaluate the pipeline using cross validation and calculate MAE

scores=cross_val_score(pipeline,X,y,scoring='neg_mean_absolute_error',cv=cv,n_jobs=-1)

# convert MAE scores to positive values

scores=absolute(scores)

# summarize the model performance

print('MAE: %.3f (%.3f)'%(mean(scores),std(scores)))

Running the example evaluates the data preparation pipeline using 10-fold cross-validation.

Note: Yourresults may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we achieve an average MAE of about 1.4, which is better than the baseline score of 2.3.

(4177, 8) (4177,)MAE: 1.465 (0.047)

1 2	(4177, 8) (4177,) MAE: 1.465 (0.047)

You now have a template for using the ColumnTransformer on a dataset with mixed data types that you can use and adapt for your own projects in the future.

Summary

In this tutorial, you discovered how to use the ColumnTransformer to selectively apply data transforms to columns in datasets with mixed data types.

Specifically, you learned:

The challenge of using data transformations with datasets that have mixed data types.
How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.
How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to categorical and numerical data columns.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It providesself-study tutorials withfull working code on:
Feature Selection,RFE,Data Cleaning,Data Transforms,Scaling,Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

See What's Inside

69 Responses toHow to Use the ColumnTransformer for Data Preparation

Venkatesh GandiJanuary 20, 2020 at 7:06 am#
Wow this function with the pipeline seems to be magical. I really wanted to know what pipelines can do(the power of pipelines). I have seen the documentation of pipeline(), it’s not as simple as you explain 🙂 I really likes your way of explanation. Have you written any blogs on introduction to pipelines which can start with simple and explain complicated pipelines. if so, please share the links.
if not, Can you please share some references where we can learn more about the sklearn pipeline?
Reply
- Jason BrownleeJanuary 20, 2020 at 8:47 am#
  Thanks.
  Yes, maybe here:
  https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/
  Reply
rahul malikApril 22, 2020 at 8:18 am#
I have followed your example , I am having a dataset with 3 columns as feature and 1 as label and all are categorical, no numeric values. I have printed X,y and seems they are correct but out is coming as MAE: nan (nan). Can you please suggest what wrong I am doing.
Reply
- Jason BrownleeApril 22, 2020 at 10:12 am#
  Perhaps check if you have a nan in your input data?
  Reply
Grzegorz KępistyApril 30, 2020 at 4:47 pm#
Very useful utility from sklearn!
I was wondering how to get the full list of transformations that can be applied to ColumnTransformer and the best reference I found is below:
https://scikit-learn.org/stable/modules/preprocessing.html#
Do you know maybe some broader source for this topic?
Regards!
Reply
- Jason BrownleeMay 1, 2020 at 6:32 am#
  That is a great start.
  No, but I have a ton of tutorials on this theme written and scheduled. Get ready!
  Reply
ashar138May 19, 2020 at 2:57 pm#
Love this. Makes you realize how ColumnTransformer can make your life radically easy.
But what I didn’t get is are the transformations applied in series ?
Cause I want to use an Imputer first and then normalize/onehot it.
transformers =[ (imputer cat) , (imputer num) , (normalize num), (onehot cat)]
I assume this can be done ?
Reply
- Jason BrownleeMay 20, 2020 at 6:19 am#
  It sure does!
  Yes, you can apply different sequences to different subsets of features.
  Reply
  - ashar138May 20, 2020 at 5:58 pm#
    Turns out you can\t :\
    it throws error when dealing with NaN values, even though the imputer is the first transformation.
    Need to use Imputer first and then Minmax AGAIN
    Reply
    - Jason BrownleeMay 21, 2020 at 6:12 am#
      I give an example of imputing missing values and transforming sequentially using the columntransformer here:
      https://machinelearningmastery.com/results-for-standard-classification-and-regression-machine-learning-datasets/
      Specifically the Auto Imports example.
      Reply
      - ashar138May 26, 2020 at 1:12 pm#
        Okay I am a little confused now..
        So what we do is we create a Pipeline, feed it to a ColumnTransformer and we feed that to yet another Pipeline.
        But then why do we even need Pipeline ? For example,
        trans1 = ColumnTransformer([('catimp', SimpleImputer(strategy='most_frequent'), cat_var), ('enc', OneHotEncoder(handle_unknown='ignore'), cat_var), ('imp', SimpleImputer(strategy= 'median'),num_var )], remainder='passthrough')
        steps = [('c', Pipeline(steps=[('s',SimpleImputer(strategy='most_frequent')), ('oe',OneHotEncoder(handle_unknown='ignore'))]), cat_var), ('n', SimpleImputer(strategy='median'), num_var)]
        trans2 = ColumnTransformer(transformers=steps, remainder='passthrough')
        Why are trans1 and trans2 different ?
      - Jason BrownleeMay 26, 2020 at 1:24 pm#
        Good question. You don’t have to use it, it’s just another tool we have available.
        Each group of features can be prepared with a pipeline.
        We can also have a master pipeline with data prep and modeling – then use cross-validation to evaluate it.
      - ashar138May 26, 2020 at 1:45 pm#
        I see.. Think I need more time with it..
        Actually the example I shared below produces 2 different results. The one without the pipeline gives a NaN found error. Perhaps the Pipeline is needed to execute in a sequence while ColumnTransformer doesn’t do that ?
        Another issue I observed was while doing transformations of train and valid datasets.
        The resultant train dataset returned ascipy.sparse.csr.csr_matrix while the valid data just returned an ndarray.
        I reduced thesparse_threshold but that results in a feature mismatch while predicting on the validation dataset.
        Anyways I have bothered you enough already, I will figure it out somehow 🙂
      - Jason BrownleeMay 27, 2020 at 7:41 am#
        Transforms like one hot encoding will create a sparse matrix by default, you can set an argument to force them to create a dense matrix instead.
ManideepMay 27, 2020 at 4:45 am#
if i have to use simple imputer and onehotencoder both for a set of categorical columns.could u please tell me what should i do?
Reply
- Jason BrownleeMay 27, 2020 at 8:02 am#
  Impute then encode in a pipeline.
  Reply
MSJuly 1, 2020 at 1:47 am#
numerical_x = reduced_df.select_dtypes(include=[‘int64’, ‘float64’]).columns
categorical_x = reduced_df.select_dtypes(include=[‘object’, ‘bool’]).columns
t = [((‘le’,LabelEncoder(),categorical_x),(‘ohe’,OneHotEncoder(),categorical_x),
(‘catimp’, SimpleImputer(strategy=’most_frequent’),categorical_x)),
((‘num’,SimpleImputer(strategy=’median’),numerical_x),(‘sts’,StandardScaler(), numerical_x))]
col_transform = ColumnTransformer(transformers=t)
dt= DecisionTreeClassifier()
pl= Pipeline(steps=[(‘prep’,col_transform), (‘dt’, dt)])
pl.fit(reduced_df,y_train)
pl.score(reduced_df,y_train)
the above code gives this error=>
names, transformers, _ = zip(*self.transformers)
ValueError: not enough values to unpack (expected 3, got 2)
can u please help me
Reply
- Jason BrownleeJuly 1, 2020 at 5:54 am#
  I’m sorry to hear that, this may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  Reply
Volkan YurtsevenJuly 1, 2020 at 5:50 am#
In the example just after the paragraph “This is the most likely use case as it ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a test dataset via cross-validation or making predictions on new data in the future..” you say
model.fit(train_X, train_y),
but i think it should be:
pipeline.fit(train_X, train_y)
Reply
- Jason BrownleeJuly 1, 2020 at 5:59 am#
  Agreed. Fixed, thank you!
  Reply
NavishJuly 11, 2020 at 12:04 am#
Hi,
Thanks for the excellent article.
Had three clarifications:
1. So if I ever have to do more than just transform values in preprocessing (like impute & then transform), I need to put each individual combination in a Pipeline? These cannot directly go into the column transformer?
2. I assume we can do multiple splits of the dataframe too? Not just 2? Like if we want to LabelEncode some cat columns, ‘most_frequent’ impute & OneHotEncode other cat columns, impute ‘median’ for a few num columns and impute ‘mean’ for others followed by scaling them?
3. Lastly, if you want to drop certain columns – is it recommended to do so at first outside of any pipeline/transformer?
Reply
- Jason BrownleeJuly 11, 2020 at 6:17 am#
  Correct, a pipeline.
  Yes, any subsets you like.
  Hmmm, good question. Probably – off the cuff that makes sense.
  Reply
NavishJuly 12, 2020 at 4:06 am#
Thanks for the reply.
I was reading the entire syntax of Pipeline & ColumnTransformer.
There is an explicit drop option in there too, where you can specify specific columns to be dropped. Might actually be better to use that, just to make it easier for validation/test/production data preprocessing.
This way you can use the ‘pass_through’ option while dropping columns you want.
Reply
- Jason BrownleeJuly 12, 2020 at 6:00 am#
  Agreed!
  Reply
NallaJuly 27, 2020 at 2:13 am#
Hi Jason,
I have some clarifications.
1. If I am using SMOTE and if i prefer using imblearn pipeline.
Is there a way in imblearn pipeline to do a imputer or should we rely on sklearn pipeline
2. How to integrate imblearn pipeline and sklearn or is it advisable to do so?
Thanks,
Nalla
Reply
- Jason BrownleeJuly 27, 2020 at 5:50 am#
  Yes, you can use any data prep you like in the imbalanced learn pipeline.
  Use the imbalanced learn pipeline just like you would an sklearn pipeline.
  Reply
  - NallaJuly 29, 2020 at 1:56 am#
    Thanks much for the clarification but I would also like to know something in detail
    What is the difference or similarity between the sklearn and imblearn pipeline then?
    As far as I know – SMOTE can be performed only in imblearn pipeline it cannot be done in sklearn
    But how do other functionality work?
    Could you give an example to illustrate both of them (similarly and dissimilarity )- if you have any.
    And also the possibility of integrating the above two pipelines (i.e)
    imblearn and sklearn?
    Is the above possible? Like say if I notice an imbalance in dataset so I have decided to go with imblearn pipeline – any means of adding up sklearn pipeline for Imputer or scaling after doing SMOTE in imblearn.
    Reply
    - Jason BrownleeJuly 29, 2020 at 5:54 am#
      The imblearn pipeline allows you to use data sampling methods as well – e.g. methods that change the number of rows.
      That is the only notable difference.
      Reply
BinyaminAugust 23, 2020 at 11:39 pm#
Hi
I was looking to apply a scaler and one hot encoding in place so that I don’t lose the index of my data frame . I came across your blog demonstrating the column transformer and, this was exactly what i needed. however there seems to be one drawback to this method, you lose the names of the features. Column transformer returns an sparse matrix scaled and encoded.
Is there a way to retrieve to column names and recreate a data frame? This would help when it comes to model interpretation
Thanks for your time, your blogs are amazing
Reply
- Jason BrownleeAugust 24, 2020 at 6:25 am#
  Thanks!
  Hold the names of the features in a separate array. No need to have that information in your model.
  Reply
DiegoSeptember 13, 2020 at 2:34 am#
Dear Jason,
Let’s say we have a dataset with 5 features.
We apply column_transformer only to 2nd a 3rd columns, and passthrough the rest of them.
After applying the pipeline, the new order of columns will be: 2,3,1,4,5 (first the ones that have been transformed and then the rest of them)
At the end, the dataset ends up having a different column order, why ? Is there any way to avoid this ?
Thanks a lot
Reply
- Jason BrownleeSeptember 13, 2020 at 6:09 am#
  Does it matter? You only need the prediction output from the model. The row order is unchanged.
  Reply
  - DiegoSeptember 14, 2020 at 12:42 am#
    Good point actually 🙂
    I think it only matters if one of the steps is RFE or RFECV and we want to know what features were selected.
    Thanks!!
    Reply
    - Jason BrownleeSeptember 14, 2020 at 6:50 am#
      Sorry Diego, perhaps I’m missing something. How would it matter?
      Reply
      - DiegoSeptember 15, 2020 at 4:40 am#
        Hi Jason,
        Sorry may be I was not clear.
        Let’s say we have features 1,2,3,4 and 5 as part of a Pipeline
        Step 1 is MinMaxScaler for features 4 and 5 only (using column_transformer)
        Step 2 is RFE or RFECV
        Step 3 is a model
        After step 1, new column order is 4,5,1,2,3
        If step 2 (RFE) says that it kept only first two columns: which two columns were selected ? 4 and 5 or 1 and 2?
        I think 4 and 5, since at that point columns were re-ordered.
        It happened to me and couldn’t believe RFE had seleted those features, but when I thought they were re-ordered, then it made sense.
        I hope it is more clear now.
        Thanks a lot for your help
      - Jason BrownleeSeptember 15, 2020 at 5:29 am#
        I don’t see any problem.
        It does not matter what RFE selected, as you are evaluating the modeling pipeline. You are answering the question “what is the performance of this pipeline” not “what features would RFE select”. If you want to know the latter, you can study that in isolation.
        You allow RFE to choose features based on training data, just like allow the model to fit internal parameters based on the training data. You don’t ask “what are the vales of the coefficients internal to the model” because it doesn’t really matter when you’re focused on model skill.
        That all being said, you can access the RFE model in the pipeline or keep a reference to it and report the features that were selected from a single run.
  - RichardMarch 10, 2021 at 5:22 am#
    Hi, one occasion where it matters is with LightGBM and categorical features. Using the sklearn API with LightGBM, the categorical features are specified as a parameter to .fit(). Since the DataFrame is casted to a numpy array during transformation (with for instance StandardScaler()), it is practical to specify categorical features with a list of int. Reordering of columns then makes for a “hard to find” bug.
    Reply
DiegoSeptember 16, 2020 at 11:33 pm#
Hi Jason,
Yes. You’re right. I should compare an entire pipeline Vs. another one, no matter what variables were selected.
From that perspective it makes sense.
Thanks for your explanation and dedication!!
Reply
- Jason BrownleeSeptember 17, 2020 at 6:47 am#
  You’re welcome.
  Reply
KeoneNovember 18, 2020 at 11:37 am#
Is there a way to inverse_transform via ColumnTransformer?
Reply
- Jason BrownleeNovember 18, 2020 at 1:08 pm#
  Good question.
  It does not look like it:
  https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
  Reply
ZinebDecember 30, 2020 at 8:54 pm#
Hi Jason,
In the second example of the section , I think you want pipeline.fit_transform instead of scaler.fit_transform !
Reply
- ZinebDecember 30, 2020 at 8:56 pm#
  the section “Challenge of Transforming Different Data Types”
  Reply
- Jason BrownleeDecember 31, 2020 at 5:25 am#
  Thanks! Fixed.
  Reply
SULAIMAN KHANFebruary 15, 2021 at 10:38 pm#
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)
#############
InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: indices[55,1] = 170355 is not in [0, 5000)
[[node sequential_3/embedding_3/embedding_lookup (defined at :9) ]]
(1) Invalid argument: indices[55,1] = 170355 is not in [0, 5000)
[[node sequential_3/embedding_3/embedding_lookup (defined at :9) ]]
[[Adam/Adam/update/AssignSubVariableOp/_35]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_12567]
Errors may have originated from an input operation.
Input Source operations connected to node sequential_3/embedding_3/embedding_lookup:
sequential_3/embedding_3/embedding_lookup/11365 (defined at /usr/lib/python3.6/contextlib.py:81)
Input Source operations connected to node sequential_3/embedding_3/embedding_lookup:
sequential_3/embedding_3/embedding_lookup/11365 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function -> train_function
How to fix this error?
Reply
- Jason BrownleeFebruary 16, 2021 at 6:06 am#
  Sorry to hear that, these tips may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  Reply
SULAIMAN KHANFebruary 15, 2021 at 10:41 pm#
I combined all my numeric columns and categorical columns. they are giving error in Loss function.
Reply
- Jason BrownleeFebruary 16, 2021 at 6:06 am#
  Perhaps inspect the output of your data preparation step to confirm the data has the shape and values that you expect?
  Reply
LeenApril 17, 2021 at 2:07 pm#
Hello Jason,
Great tutorial. Can we make Column Transformer part of param_grid ? I want to transform some columns using categorical encoding mechanisms, but there are several of interest: Target Encoding, Weights of Evidence encoding, catboost encoding, etc. Therefore, I want each encoding mechanism to be a hyper parameter inside grid search param_grid. Do you think its possible ?
Reply
- Jason BrownleeApril 18, 2021 at 5:51 am#
  Hmm, you might have to run your grid search for loop manually.
  Reply
JGApril 19, 2021 at 6:05 pm#
Hi Jason,
I see ColumnTransformer() as a very powerful module to apply globally, but distinguishing transformation to every feature of dataset.
I think it is useful not only inside the Pipeline steps but also as stand alone in order to get an inside analysis of features
thank you very much for this tutorial !.
regards,
Reply
- Jason BrownleeApril 20, 2021 at 5:55 am#
  Agreed!
  Reply
LeoMay 7, 2021 at 11:48 am#
Thanks Jason!
I’m having a hard time applying columnTransform (or pipelines) to a group of: (i) one NLP-based feature (that require count vectorizer or tf-idf fit-transform); and (ii) other simpler features that require standard scaling – any chance you’ve ever faced a similar problem and know how to solve it?
The count vectorizer (or tf-idf) outputs a sparse matrix (while the rest remain as pandas series/df)… should I use numpy arrays for all of them?
Thanks again!
Reply
- Jason BrownleeMay 8, 2021 at 6:31 am#
  Perhaps get your columntransformer working with a minimal dataset or pipeline, then add elements back until you discover the cause of the issue.
  Reply
LeoMay 9, 2021 at 6:09 am#
Sorted w/ FeatureUnion and ColumnSelector (from mlxtended)!
cat_pipe = Pipeline([(‘selector’, ColumnSelector(cat_features)),
(‘imputer’, SimpleImputer(strategy=’constant’, fill_value=’missing’)),
(‘encoder’, OneHotEncoder(handle_unknown=’ignore’, sparse=False))])
num_pipe = Pipeline([(‘selector’, ColumnSelector(num_features)),
(‘imputer’, SimpleImputer(strategy=’mean’)),
(‘scaler’, MinMaxScaler())])
text_pipe = Pipeline([(‘selector’, ColumnSelector(text_features,drop_axis=True)),
(‘tf’, TfidfVectorizer(preprocessor=lambda x: x,tokenizer=lambda x: x))])
preprocessor = FeatureUnion(transformer_list=[(‘cat’, cat_pipe),
(‘num’, num_pipe),
(‘nlp’, text_pipe)])
Reply
- Jason BrownleeMay 10, 2021 at 6:18 am#
  Well done!
  Reply
JGJuly 3, 2021 at 9:43 pm#
Hi Jason,
thanks for the tutorial!
Always a valuable piece of code to perform other experiments!
my experiments:
Is true that ColumnTransformer() API perform different transformation by column or feature (great!) and there is an argument remainder= that we can set to ‘passthrough’. Ok.
But my complain to Sklearn API is that transformations are performed consecutively, so if you decide e.g. to perform column 3 transformation before than 2 …so column 2 is replaced by column 3 transformed.
In addition to that, even if you decide to ‘passthrough’ a remainder column without transformation, between two alternate columns transformation, the ColumnTransformer() replace previous column with the following transformation and leave column (without transformation) at the end of columns serie …
This behaviour introduce a lot of confusion to the way we would expected to work the ColumnTransformer, and it make you wast your time with problems that can be avoided by a good API design :-((
unless would be some other arguments where you can specify them to avoid this missbehaviour columns alterations.
Anyway, as I said before, thank you to your piece of code you can foreseen this behaviour.
regards,
Reply
- Jason BrownleeJuly 4, 2021 at 6:03 am#
  Agreed.
  It might be easier to stack ColumnTransformers into a pipeline and perform one subset/operation at a time in sequence.
  Reply
Vishwanath reddyJuly 22, 2021 at 4:43 am#
Thank you very much
It was very helpful
Reply
- Jason BrownleeJuly 22, 2021 at 5:38 am#
  You’re welcome.
  Reply
Cyanide LancerDecember 22, 2021 at 9:48 pm#
Just a basic question, why haven’t you considered splitting the dataset into train and test for features and labels?
Reply
Bobby KMay 8, 2022 at 1:13 pm#
Good stuff!
You manage to strike a nice balance between complexity and usability and the result is easy to read, easy to follow example that opens the door to a larger field to be explored at once’s own pace.
Reply
- James CarmichaelMay 9, 2022 at 11:03 am#
  Thank you for the feedback and support Bobby!
  Reply
GiovannaApril 30, 2023 at 1:33 am#
Hi,
I’ve been researching high and low for an answer and haven’t been lucky yet. I’m training a pipeline that consists of data transformation pipelines for both numerical and categorical, added to a column transformer pipeline:
num_pipeline = Pipeline([
(‘std_scaler’, StandardScaler()),
(‘imputer’, IterativeImputer(random_state=seed, max_iter=100,
estimator=DecisionTreeRegressor(max_features=’sqrt’,
random_state=seed)))])
cat_pipeline = Pipeline([(‘imputer’, SimpleImputer(strategy=’constant’, fill_value=’Missing’)),
(‘encoding’, OneHotEncoder(handle_unknown=’ignore’, sparse=False))])
data_pipeline = ColumnTransformer([
(‘numerical’, num_pipeline, num_feats),
(‘categorical’, cat_pipeline, cat_feats)])
The issue that I’m facing is that I will fit_transform this data_pipeline to my training data and save this trained pipeline with joblib dump to use it for transforming with .transform() the validation data and also for inference in production. So far so good, however, when my training data is large-ish the saved file is very large and takes a long time to load (example: trained with 1.4M rows, saved pipeline is 49GB).
I’m wondering what I could be doing to make this saved trained data transformation pipeline lighter. Any tips?
Reply
PascalAugust 7, 2023 at 10:20 pm#
Hi, thanks for writing these examples down.
In the last line of your code example: I am not 100% sure, but is it OK to take
std(scores)
here?
This is because the values were previously set to absolute. The distances between the values can therefore be different due to the “zero border”. So if I had a value of -0.1 before and the mean is 0.1, then the distance = 0.2; but after I apply absolute(), then the value is now 0.1 and has a distance of 0 to 0.1. This affects the standard deviation, doesn’t it?
So you would have to take the standard deviation from the scores BEFORE applying absolute(), or am I wrong?
Reply
- James CarmichaelAugust 8, 2023 at 10:22 am#
  Hi Pascal…The following may help clarify how to use std():
  https://numpy.org/doc/stable/reference/generated/numpy.std.html
  Reply
PascalAugust 10, 2023 at 7:14 pm#
This does not answer my question… see this:
series of values before taking the absolute values:
0,1-0,10,5-0,2-0,5-0,70,8
std here: 0,530498418
series of values before taking the absolute values:
0,10,10,50,20,50,70,8
std here: 0,285356919
Reply
neelanshuniMarch 3, 2024 at 9:57 pm#
Hi , Im facing one error regarding fit_transform , here is my code :
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
import category_encoders as ce
df = pd.read_csv(‘gurgaon_properties_post_feature_selection_v2.csv’)
X=df.drop(columns=[‘price’])
Y_t=np.log1p(Y)
t = [
(‘num’, StandardScaler(), [‘bedRoom’, ‘bathroom’, ‘built_up_area’, ‘servant room’, ‘store room’]),
(‘cat’, OrdinalEncoder(), columns_to_encode),
(‘cat1′,OneHotEncoder(drop=’first’,sparse_output=False),[‘agePossession’]),
(‘target_enc’, ce.TargetEncoder(), [‘sector’])
]
transformer = ColumnTransformer(transformers=t, remainder=’passthrough’)
transformer.fit_transform(X)
Getting below error :
—————————————————————————
TypeError Traceback (most recent call last)
Cell In[40], line 1
—-> 1 transformer.fit_transform(X)
File ~\anaconda3\lib\site-packages\sklearn\utils\_set_output.py:142, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
140 @wraps(f)
141 def wrapped(self, X, *args, **kwargs):
–> 142 data_to_wrap = f(self, X, *args, **kwargs)
143 if isinstance(data_to_wrap, tuple):
144 # only wrap the first output for cross decomposition
145 return (
146 _wrap_data_with_container(method, data_to_wrap[0], X, self),
147 *data_to_wrap[1:],
148 )
File ~\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py:727, in ColumnTransformer.fit_transform(self, X, y)
724 self._validate_column_callables(X)
725 self._validate_remainder(X)
–> 727 result = self._fit_transform(X, y, _fit_transform_one)
729 if not result:
730 self._update_fitted_transformers([])
File ~\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py:658, in ColumnTransformer._fit_transform(self, X, y, func, fitted, column_as_strings)
652 transformers = list(
653 self._iter(
654 fitted=fitted, replace_strings=True, column_as_strings=column_as_strings
655 )
656 )
657 try:
–> 658 return Parallel(n_jobs=self.n_jobs)(
659 delayed(func)(
660 transformer=clone(trans) if not fitted else trans,
661 X=_safe_indexing(X, column, axis=1),
662 y=y,
663 weight=weight,
664 message_clsname=”ColumnTransformer”,
665 message=self._log_message(name, idx, len(transformers)),
666 )
667 for idx, (name, trans, column, weight) in enumerate(transformers, 1)
668 )
669 except ValueError as e:
670 if “Expected 2D array, got 1D array instead” in str(e):
File ~\anaconda3\lib\site-packages\sklearn\utils\parallel.py:63, in Parallel.__call__(self, iterable)
58 config = get_config()
59 iterable_with_config = (
60 (_with_config(delayed_func, config), args, kwargs)
61 for delayed_func, args, kwargs in iterable
62 )
—> 63 return super().__call__(iterable_with_config)
File ~\anaconda3\lib\site-packages\joblib\parallel.py:1051, in Parallel.__call__(self, iterable)
1048 if self.dispatch_one_batch(iterator):
1049 self._iterating = self._original_iterator is not None
-> 1051 while self.dispatch_one_batch(iterator):
1052 pass
1054 if pre_dispatch == “all” or n_jobs == 1:
1055 # The iterable was consumed all at once by the above for loop.
1056 # No need to wait for async callbacks to trigger to
1057 # consumption.
File ~\anaconda3\lib\site-packages\joblib\parallel.py:864, in Parallel.dispatch_one_batch(self, iterator)
862 return False
863 else:
–> 864 self._dispatch(tasks)
865 return True
File ~\anaconda3\lib\site-packages\joblib\parallel.py:782, in Parallel._dispatch(self, batch)
780 with self._lock:
781 job_idx = len(self._jobs)
–> 782 job = self._backend.apply_async(batch, callback=cb)
783 # A job can complete so quickly than its callback is
784 # called before we get here, causing self._jobs to
785 # grow. To ensure correct results ordering, .insert is
786 # used (rather than .append) in the following line
787 self._jobs.insert(job_idx, job)
File ~\anaconda3\lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 “””Schedule a func to be run”””
–> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
File ~\anaconda3\lib\site-packages\joblib\_parallel_backends.py:572, in ImmediateResult.__init__(self, batch)
569 def __init__(self, batch):
570 # Don’t delay the application, to avoid keeping the input
571 # arguments in memory
–> 572 self.results = batch()
File ~\anaconda3\lib\site-packages\joblib\parallel.py:263, in BatchedCalls.__call__(self)
259 def __call__(self):
260 # Set the default nested backend to self._backend but do not set the
261 # change the default number of processes to -1
262 with parallel_backend(self._backend, n_jobs=self._n_jobs):
–> 263 return [func(*args, **kwargs)
264 for func, args, kwargs in self.items]
File ~\anaconda3\lib\site-packages\joblib\parallel.py:263, in (.0)
259 def __call__(self):
260 # Set the default nested backend to self._backend but do not set the
261 # change the default number of processes to -1
262 with parallel_backend(self._backend, n_jobs=self._n_jobs):
–> 263 return [func(*args, **kwargs)
264 for func, args, kwargs in self.items]
File ~\anaconda3\lib\site-packages\sklearn\utils\parallel.py:123, in _FuncWrapper.__call__(self, *args, **kwargs)
121 config = {}
122 with config_context(**config):
–> 123 return self.function(*args, **kwargs)
File ~\anaconda3\lib\site-packages\sklearn\pipeline.py:893, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, “fit_transform”):
–> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
File ~\anaconda3\lib\site-packages\sklearn\utils\_set_output.py:142, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
140 @wraps(f)
141 def wrapped(self, X, *args, **kwargs):
–> 142 data_to_wrap = f(self, X, *args, **kwargs)
143 if isinstance(data_to_wrap, tuple):
144 # only wrap the first output for cross decomposition
145 return (
146 _wrap_data_with_container(method, data_to_wrap[0], X, self),
147 *data_to_wrap[1:],
148 )
File ~\anaconda3\lib\site-packages\category_encoders\utils.py:458, in SupervisedTransformerMixin.fit_transform(self, X, y, **fit_params)
451 “””
452 Encoders that utilize the target must make sure that the training data are transformed with:
453 transform(X, y)
454 and not with:
455 transform(X)
456 “””
457 if y is None:
–> 458 raise TypeError(‘fit_transform() missing argument: ”y”’)
459 return self.fit(X, y, **fit_params).transform(X, y)
TypeError: fit_transform() missing argument: y
Reply
- James CarmichaelMarch 4, 2024 at 1:26 am#
  Hi neelanshuni…You may wish to try your implementation in Google Colab to determine if there may be an issue with your local Python environment. Otherwise, ensure that there are no issues that may have resulted from copy and paste of code.
  Reply

Movatterモバイル変換

Navigation

How to Use the ColumnTransformer for Data Preparation

Tutorial Overview

Challenge of Transforming Different Data Types

Want to Get Started With Data Preparation?

How to use the ColumnTransformer

Data Preparation for the Abalone Regression Dataset

Further Reading

API

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

More On This Topic

About Jason Brownlee

69 Responses toHow to Use the ColumnTransformer for Data Preparation

Leave a ReplyClick here to cancel reply.

Never miss a tutorial:

Picked for you:

Loving the Tutorials?

Movatterモバイル変換

Navigation

Tutorial Overview

Challenge of Transforming Different Data Types

Want to Get Started With Data Preparation?

How to use the ColumnTransformer

Data Preparation for the Abalone Regression Dataset

Further Reading

API

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques toYour Machine Learning Projects

More On This Topic

About Jason Brownlee

69 Responses toHow to Use the ColumnTransformer for Data Preparation

Leave a ReplyClick here to cancel reply.

Never miss a tutorial:

Picked for you:

Loving the Tutorials?

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects