Using Scikit-learn Pipelines with Pandas Dataframes

October 24, 2022 #machine-learning #machine-learning-engineering #software-development

A black and white photo of a man wearing a naval hat filing the ends of metal pipes.

Scikit-learn is a popular Python library for training machine learning models. Pandas is a popular Python library for manipulating tabular data. They work great together because when you are building a machine learning model you start by working with the data in Pandas and then when it is cleaned up you train the model in scikit-learn.

But one hard thing to do is to make sure you apply the exact same data manipulating steps to the training set as to the test set and the live data when the model is deployed. It is very easy to leak data or to forget a step, either of which can ruin your model.

To help solve this problem, Scikit-learn developed Pipelines. Pipelines allow you to define a sequence of transforms, including a model training step, that is easy to apply consistently. This post will go over how to use Pipelines with Pandas Dataframes.

Pandas and Pipelines: Formerly Not So Simple

It used to be tough to use Pandas Dataframes and scikit-learn pipelines together. There was a ColumnTransformer to work with dataframes, but it had some major limitations since the output of the transformer was a numpy array. This meant that if you used a second ColumnTransformer in your pipeline you would get the following error:

ValueError: Specifying the columns using strings is
only supported for pandas DataFrames

But scikit-learn version 1.2 updated the pipeline API to fix this! Now there is the option to output Pandas dataframes!

A working pipeline

Now that the set_output API exists, we can chain ColumnTransformer without error!

For example, we can impute for one column, and then scale it and a few others. First we set up the two ColumnTransformer, one to impute and one to scale:

# Apply each feature pipeline using a column transform
imputer = (
  "imputer",
  ColumnTransformer(
    [("col_impute", SimpleImputer(), ["x1"])],
    remainder="passthrough",
  ),
)

scaler = (
  "scaler",
  ColumnTransformer(
    [
      (
        "col_scale",
        StandardScaler(),
        ["col_impute__x1", "remainder__x2", "remainder__x3"],
      )
    ],
    remainder="passthrough",
  ),
)

Then we combined them in a pipeline:

pipe = Pipeline(
    steps=[
        imputer,
        scaler,
    ]
).set_output(transform="pandas")

And it works! There are two tricks; we have to:

Make the output of each step a dataframe with set_output(transform="pandas").
Adjust the columns names of the downstream steps because they get prepended with the name of the previous steps they’ve gone through.

Complete Example

Here is a Jupyter notebook (rendered on Github) with a toy dataset and a full Pandas pipeline example. Hope it helps!