7.1.Pipelines and composite estimators#

To build a composite estimator, transformers are usually combined with othertransformers or withpredictors (such as classifiers or regressors).The most common tool used for composing estimators is aPipeline. Pipelines require all steps except the last to be atransformer. The last step can be anything, a transformer, apredictor, or a clustering estimator which might have or not have a.predict(...) method. A pipeline exposes all methods provided by the lastestimator: if the last step provides atransform method, then the pipelinewould have atransform method and behave like a transformer. If the last stepprovides apredict method, then the pipeline would expose that method, andgiven a dataX, use all steps except the last to transform the data,and then give that transformed data to thepredict method of the last step ofthe pipeline. The classPipeline is often used in combination withColumnTransformer orFeatureUnion which concatenate the output of transformersinto a composite feature space.TransformedTargetRegressordeals with transforming thetarget (i.e. log-transformy).

7.1.1.Pipeline: chaining estimators#

Pipeline can be used to chain multiple estimatorsinto one. This is useful as there is often a fixed sequenceof steps in processing the data, for example feature selection, normalizationand classification.Pipeline serves multiple purposes here:

Convenience and encapsulation

You only have to callfit andpredict once on yourdata to fit a whole sequence of estimators.

Joint parameter selection

You cangrid searchover parameters of all estimators in the pipeline at once.

Safety

Pipelines help avoid leaking statistics from your test data into thetrained model in cross-validation, by ensuring that the same samples areused to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers(i.e. must have atransform method).The last estimator may be any type (transformer, classifier, etc.).

Note

Callingfit on the pipeline is the same as callingfit oneach estimator in turn,transform the input and pass it on to the next step.The pipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier, thePipeline can be usedas a classifier. If the last estimator is a transformer, again, so is thepipeline.

7.1.1.1.Usage#

7.1.1.1.1.Build a pipeline#

ThePipeline is built using a list of(key,value) pairs, wherethekey is a string containing the name you want to give this step andvalueis an estimator object:

>>>fromsklearn.pipelineimportPipeline>>>fromsklearn.svmimportSVC>>>fromsklearn.decompositionimportPCA>>>estimators=[('reduce_dim',PCA()),('clf',SVC())]>>>pipe=Pipeline(estimators)>>>pipePipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
Shorthand version usingmake_pipeline#

The utility functionmake_pipeline is a shorthandfor constructing pipelines;it takes a variable number of estimators and returns a pipeline,filling in the names automatically:

>>>fromsklearn.pipelineimportmake_pipeline>>>make_pipeline(PCA(),SVC())Pipeline(steps=[('pca', PCA()), ('svc', SVC())])

7.1.1.1.2.Access pipeline steps#

The estimators of a pipeline are stored as a list in thesteps attribute.A sub-pipeline can be extracted using the slicing notation commonly usedfor Python Sequences such as lists or strings (although only a step of 1 ispermitted). This is convenient for performing only some of the transformations(or their inverse):

>>>pipe[:1]Pipeline(steps=[('reduce_dim', PCA())])>>>pipe[-1:]Pipeline(steps=[('clf', SVC())])
Accessing a step by name or position#

A specific step can also be accessed by index or name by indexing (with[idx]) thepipeline:

>>>pipe.steps[0]('reduce_dim', PCA())>>>pipe[0]PCA()>>>pipe['reduce_dim']PCA()

Pipeline’snamed_steps attribute allows accessing steps by name with tabcompletion in interactive environments:

>>>pipe.named_steps.reduce_dimispipe['reduce_dim']True

7.1.1.1.3.Tracking feature names in a pipeline#

To enable model inspection,Pipeline has aget_feature_names_out() method, just like all transformers. You can usepipeline slicing to get the feature names going into each step:

>>>fromsklearn.datasetsimportload_iris>>>fromsklearn.linear_modelimportLogisticRegression>>>fromsklearn.feature_selectionimportSelectKBest>>>iris=load_iris()>>>pipe=Pipeline(steps=[...('select',SelectKBest(k=2)),...('clf',LogisticRegression())])>>>pipe.fit(iris.data,iris.target)Pipeline(steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))])>>>pipe[:-1].get_feature_names_out()array(['x2', 'x3'], ...)
Customize feature names#

You can also provide custom feature names for the input data usingget_feature_names_out:

>>>pipe[:-1].get_feature_names_out(iris.feature_names)array(['petal length (cm)', 'petal width (cm)'], ...)

7.1.1.1.4.Access to nested parameters#

It is common to adjust the parameters of an estimator within a pipeline. This parameteris therefore nested because it belongs to a particular sub-step. Parameters of theestimators in the pipeline are accessible using the<estimator>__<parameter>syntax:

>>>pipe=Pipeline(steps=[("reduce_dim",PCA()),("clf",SVC())])>>>pipe.set_params(clf__C=10)Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])
When does it matter?#

This is particularly important for doing grid searches:

>>>fromsklearn.model_selectionimportGridSearchCV>>>param_grid=dict(reduce_dim__n_components=[2,5,10],...clf__C=[0.1,10,100])>>>grid_search=GridSearchCV(pipe,param_grid=param_grid)

Individual steps may also be replaced as parameters, and non-final steps may beignored by setting them to'passthrough':

>>>param_grid=dict(reduce_dim=['passthrough',PCA(5),PCA(10)],...clf=[SVC(),LogisticRegression()],...clf__C=[0.1,10,100])>>>grid_search=GridSearchCV(pipe,param_grid=param_grid)

Examples

7.1.1.2.Caching transformers: avoid repeated computation#

Fitting transformers may be computationally expensive. With itsmemory parameter set,Pipeline will cache each transformerafter callingfit.This feature is used to avoid computing the fit transformers within a pipelineif the parameters and input data are identical. A typical example is the case ofa grid search in which the transformers can be fitted only once and reused foreach configuration. The last step will never be cached, even if it is a transformer.

The parametermemory is needed in order to cache the transformers.memory can be either a string containing the directory where to cache thetransformers or ajoblib.Memoryobject:

>>>fromtempfileimportmkdtemp>>>fromshutilimportrmtree>>>fromsklearn.decompositionimportPCA>>>fromsklearn.svmimportSVC>>>fromsklearn.pipelineimportPipeline>>>estimators=[('reduce_dim',PCA()),('clf',SVC())]>>>cachedir=mkdtemp()>>>pipe=Pipeline(estimators,memory=cachedir)>>>pipePipeline(memory=...,         steps=[('reduce_dim', PCA()), ('clf', SVC())])>>># Clear the cache directory when you don't need it anymore>>>rmtree(cachedir)
Side effect of caching transformers#

Using aPipeline without cache enabled, it is possible toinspect the original instance such as:

>>>fromsklearn.datasetsimportload_digits>>>X_digits,y_digits=load_digits(return_X_y=True)>>>pca1=PCA(n_components=10)>>>svm1=SVC()>>>pipe=Pipeline([('reduce_dim',pca1),('clf',svm1)])>>>pipe.fit(X_digits,y_digits)Pipeline(steps=[('reduce_dim', PCA(n_components=10)), ('clf', SVC())])>>># The pca instance can be inspected directly>>>pca1.components_.shape(10, 64)

Enabling caching triggers a clone of the transformers before fitting.Therefore, the transformer instance given to the pipeline cannot beinspected directly.In the following example, accessing thePCAinstancepca2 will raise anAttributeError sincepca2 will be anunfitted transformer.Instead, use the attributenamed_steps to inspect estimators withinthe pipeline:

>>>cachedir=mkdtemp()>>>pca2=PCA(n_components=10)>>>svm2=SVC()>>>cached_pipe=Pipeline([('reduce_dim',pca2),('clf',svm2)],...memory=cachedir)>>>cached_pipe.fit(X_digits,y_digits)Pipeline(memory=...,         steps=[('reduce_dim', PCA(n_components=10)), ('clf', SVC())])>>>cached_pipe.named_steps['reduce_dim'].components_.shape(10, 64)>>># Remove the cache directory>>>rmtree(cachedir)

Examples

7.1.2.Transforming target in regression#

TransformedTargetRegressor transforms thetargetsy before fitting a regression model. The predictions are mappedback to the original space via an inverse transform. It takes as an argumentthe regressor that will be used for prediction, and the transformer that willbe applied to the target variable:

>>>importnumpyasnp>>>fromsklearn.datasetsimportmake_regression>>>fromsklearn.composeimportTransformedTargetRegressor>>>fromsklearn.preprocessingimportQuantileTransformer>>>fromsklearn.linear_modelimportLinearRegression>>>fromsklearn.model_selectionimporttrain_test_split>>># create a synthetic dataset>>>X,y=make_regression(n_samples=20640,...n_features=8,...noise=100.0,...random_state=0)>>>y=np.exp(1+(y-y.min())*(4/(y.max()-y.min())))>>>X,y=X[:2000,:],y[:2000]# select a subset of data>>>transformer=QuantileTransformer(output_distribution='normal')>>>regressor=LinearRegression()>>>regr=TransformedTargetRegressor(regressor=regressor,...transformer=transformer)>>>X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)>>>regr.fit(X_train,y_train)TransformedTargetRegressor(...)>>>print(f"R2 score:{regr.score(X_test,y_test):.2f}")R2 score: 0.67>>>raw_target_regr=LinearRegression().fit(X_train,y_train)>>>print(f"R2 score:{raw_target_regr.score(X_test,y_test):.2f}")R2 score: 0.64

For simple transformations, instead of a Transformer object, a pair offunctions can be passed, defining the transformation and its inverse mapping:

>>>deffunc(x):...returnnp.log(x)>>>definverse_func(x):...returnnp.exp(x)

Subsequently, the object is created as:

>>>regr=TransformedTargetRegressor(regressor=regressor,...func=func,...inverse_func=inverse_func)>>>regr.fit(X_train,y_train)TransformedTargetRegressor(...)>>>print(f"R2 score:{regr.score(X_test,y_test):.2f}")R2 score: 0.67

By default, the provided functions are checked at each fit to be the inverse ofeach other. However, it is possible to bypass this checking by settingcheck_inverse toFalse:

>>>definverse_func(x):...returnx>>>regr=TransformedTargetRegressor(regressor=regressor,...func=func,...inverse_func=inverse_func,...check_inverse=False)>>>regr.fit(X_train,y_train)TransformedTargetRegressor(...)>>>print(f"R2 score:{regr.score(X_test,y_test):.2f}")R2 score: -3.02

Note

The transformation can be triggered by setting eithertransformer or thepair of functionsfunc andinverse_func. However, setting bothoptions will raise an error.

Examples

7.1.3.FeatureUnion: composite feature spaces#

FeatureUnion combines several transformer objects into a newtransformer that combines their output. AFeatureUnion takesa list of transformer objects. During fitting, each of theseis fit to the data independently. The transformers are applied in parallel,and the feature matrices they output are concatenated side-by-side into alarger matrix.

When you want to apply different transformations to each field of the data,see the related classColumnTransformer(seeuser guide).

FeatureUnion serves the same purposes asPipeline -convenience and joint parameter estimation and validation.

FeatureUnion andPipeline can be combined tocreate complex models.

(AFeatureUnion has no way of checking whether two transformersmight produce identical features. It only produces a union when thefeature sets are disjoint, and making sure they are is the caller’sresponsibility.)

7.1.3.1.Usage#

AFeatureUnion is built using a list of(key,value) pairs,where thekey is the name you want to give to a given transformation(an arbitrary string; it only serves as an identifier)andvalue is an estimator object:

>>>fromsklearn.pipelineimportFeatureUnion>>>fromsklearn.decompositionimportPCA>>>fromsklearn.decompositionimportKernelPCA>>>estimators=[('linear_pca',PCA()),('kernel_pca',KernelPCA())]>>>combined=FeatureUnion(estimators)>>>combinedFeatureUnion(transformer_list=[('linear_pca', PCA()),                               ('kernel_pca', KernelPCA())])

Like pipelines, feature unions have a shorthand constructor calledmake_union that does not require explicit naming of the components.

LikePipeline, individual steps may be replaced usingset_params,and ignored by setting to'drop':

>>>combined.set_params(kernel_pca='drop')FeatureUnion(transformer_list=[('linear_pca', PCA()),                               ('kernel_pca', 'drop')])

Examples

7.1.4.ColumnTransformer for heterogeneous data#

Many datasets contain features of different types, say text, floats, and dates,where each type of feature requires separate preprocessing or featureextraction steps. Often it is easiest to preprocess data before applyingscikit-learn methods, for example usingpandas.Processing your data before passing it to scikit-learn might be problematic forone of the following reasons:

  1. Incorporating statistics from test data into the preprocessors makescross-validation scores unreliable (known asdata leakage),for example in the case of scalers or imputing missing values.

  2. You may want to include the parameters of the preprocessors in aparameter search.

TheColumnTransformer helps performing differenttransformations for different columns of the data, within aPipeline that is safe from data leakage and that canbe parametrized.ColumnTransformer works onarrays, sparse matrices, andpandas DataFrames.

To each column, a different transformation can be applied, such aspreprocessing or a specific feature extraction method:

>>>importpandasaspd>>>X=pd.DataFrame(...{'city':['London','London','Paris','Sallisaw'],...'title':["His Last Bow","How Watson Learned the Trick",..."A Moveable Feast","The Grapes of Wrath"],...'expert_rating':[5,3,4,5],...'user_rating':[4,5,4,3]})

For this data, we might want to encode the'city' column as a categoricalvariable usingOneHotEncoder but apply aCountVectorizer to the'title' column.As we might use multiple feature extraction methods on the same column, we giveeach transformer a unique name, say'city_category' and'title_bow'.By default, the remaining rating columns are ignored (remainder='drop'):

>>>fromsklearn.composeimportColumnTransformer>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>fromsklearn.preprocessingimportOneHotEncoder>>>column_trans=ColumnTransformer(...[('categories',OneHotEncoder(dtype='int'),['city']),...('title_bow',CountVectorizer(),'title')],...remainder='drop',verbose_feature_names_out=False)>>>column_trans.fit(X)ColumnTransformer(transformers=[('categories', OneHotEncoder(dtype='int'),                                 ['city']),                                ('title_bow', CountVectorizer(), 'title')],                  verbose_feature_names_out=False)>>>column_trans.get_feature_names_out()array(['city_London', 'city_Paris', 'city_Sallisaw', 'bow', 'feast','grapes', 'his', 'how', 'last', 'learned', 'moveable', 'of', 'the', 'trick', 'watson', 'wrath'], ...)>>>column_trans.transform(X).toarray()array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)

In the above example, theCountVectorizer expects a 1D array asinput and therefore the columns were specified as a string ('title').However,OneHotEncoderas most of other transformers expects 2D data, therefore in that case you needto specify the column as a list of strings (['city']).

Apart from a scalar or a single item list, the column selection can be specifiedas a list of multiple items, an integer array, a slice, a boolean mask, orwith amake_column_selector. Themake_column_selector is used to select columns basedon data type or column name:

>>>fromsklearn.preprocessingimportStandardScaler>>>fromsklearn.composeimportmake_column_selector>>>ct=ColumnTransformer([...('scale',StandardScaler(),...make_column_selector(dtype_include=np.number)),...('onehot',...OneHotEncoder(),...make_column_selector(pattern='city',dtype_include=object))])>>>ct.fit_transform(X)array([[ 0.904,  0.      ,  1. ,  0. ,  0. ],       [-1.507,  1.414,  1. ,  0. ,  0. ],       [-0.301,  0.      ,  0. ,  1. ,  0. ],       [ 0.904, -1.414,  0. ,  0. ,  1. ]])

Strings can reference columns if the input is a DataFrame, integers are alwaysinterpreted as the positional columns.

We can keep the remaining rating columns by settingremainder='passthrough'. The values are appended to the end of thetransformation:

>>>column_trans=ColumnTransformer(...[('city_category',OneHotEncoder(dtype='int'),['city']),...('title_bow',CountVectorizer(),'title')],...remainder='passthrough')>>>column_trans.fit_transform(X)array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)

Theremainder parameter can be set to an estimator to transform theremaining rating columns. The transformed values are appended to the end ofthe transformation:

>>>fromsklearn.preprocessingimportMinMaxScaler>>>column_trans=ColumnTransformer(...[('city_category',OneHotEncoder(),['city']),...('title_bow',CountVectorizer(),'title')],...remainder=MinMaxScaler())>>>column_trans.fit_transform(X)[:,-2:]array([[1. , 0.5],       [0. , 1. ],       [0.5, 0.5],       [1. , 0. ]])

Themake_column_transformer function is availableto more easily create aColumnTransformer object.Specifically, the names will be given automatically. The equivalent for theabove example would be:

>>>fromsklearn.composeimportmake_column_transformer>>>column_trans=make_column_transformer(...(OneHotEncoder(),['city']),...(CountVectorizer(),'title'),...remainder=MinMaxScaler())>>>column_transColumnTransformer(remainder=MinMaxScaler(),                  transformers=[('onehotencoder', OneHotEncoder(), ['city']),                                ('countvectorizer', CountVectorizer(),                                 'title')])

IfColumnTransformer is fitted with a dataframeand the dataframe only has string column names, then transforming a dataframewill use the column names to select the columns:

>>>ct=ColumnTransformer(...[("scale",StandardScaler(),["expert_rating"])]).fit(X)>>>X_new=pd.DataFrame({"expert_rating":[5,6,1],..."ignored_new_col":[1.2,0.3,-0.1]})>>>ct.transform(X_new)array([[ 0.9],       [ 2.1],       [-3.9]])

7.1.5.Visualizing Composite Estimators#

Estimators are displayed with an HTML representation when shown in ajupyter notebook. This is useful to diagnose or visualize a Pipeline withmany estimators. This visualization is activated by default:

>>>column_trans

It can be deactivated by setting thedisplay option inset_configto ‘text’:

>>>fromsklearnimportset_config>>>set_config(display='text')>>># displays text representation in a jupyter context>>>column_trans

An example of the HTML output can be seen in theHTML representation of Pipeline section ofColumn Transformer with Mixed Types.As an alternative, the HTML can be written to a file usingestimator_html_repr:

>>>fromsklearn.utilsimportestimator_html_repr>>>withopen('my_estimator.html','w')asf:...f.write(estimator_html_repr(clf))

Examples