Using Scikit-learn Pipelines with Pandas Dataframes
Scikit-learn is a popular Python library for training machine learning models. Pandas is a popular Python library for manipulating tabular data. They work great together because when you are building a machine learning model you start by working with the data in Pandas and then when it is cleaned up you train the model in scikit-learn.
But one hard thing to do is to make sure you apply the exact same data manipulating steps to the training set as to the test set and the live data when the model is deployed. It is very easy to leak data or to forget a step, either of which can ruin your model.
To help solve this problem, Scikit-learn developed Pipelines. Pipelines allow you to define a sequence of transforms, including a model training step, that is easy to apply consistently. This post will go over how to use Pipelines with Pandas Dataframes.
Pandas and Pipelines: Formerly Not So Simple
It used to be tough to use Pandas Dataframes and scikit-learn pipelines together. There was a
ColumnTransformer to work with dataframes, but it had some major limitations since the output of the transformer was a numpy array. This meant that if you used a second
ColumnTransformer in your pipeline you would get the following error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
But scikit-learn version 1.2 updated the pipeline API to fix this! Now there is the option to output Pandas dataframes!
A working pipeline
Now that the
set_output API exists, we can chain
ColumnTransformer without error!
For example, we can impute for one column, and then scale it and a few others. First we set up the two
ColumnTransformer, one to impute and one to scale:
# Apply each feature pipeline using a column transform imputer = ( "imputer", ColumnTransformer( [("col_impute", SimpleImputer(), ["x1"])], remainder="passthrough", ), ) scaler = ( "scaler", ColumnTransformer( [ ( "col_scale", StandardScaler(), ["col_impute__x1", "remainder__x2", "remainder__x3"], ) ], remainder="passthrough", ), )
Then we combined them in a pipeline:
pipe = Pipeline( steps=[ imputer, scaler, ] ).set_output(transform="pandas")
And it works! There are two tricks; we have to:
- Make the output of each step a dataframe with
- Adjust the columns names of the downstream steps because they get prepended with the name of the previous steps they’ve gone through.
Here is a Jupyter notebook (rendered on Github) with a toy dataset and a full Pandas pipeline example. Hope it helps!