Pipelines tutorial

Author

Dr. Ghita Berrada

Published

03 Mar 2025

You can download a .qmd or .ipynb version of this tutorial below to try out the code for yourselves.

Until W05, we’ve used the scikit-learn library to perform our data pre-processing steps, our model training and our model predictions step-by-step. What if we could aggregate all these steps?

That’s exactly what scikit-learn Pipelines allows us to do!

Why would you want to use scikit-learn Pipelines?

Simplicity: Combining preprocessing and model training in one step. Simplifying your code
Reusability: Improving maintainability and reproducibility. Easily reusing the same pipeline with different datasets.
Reduced Error: Avoiding common mistakes like forgetting to apply transformations to test data.

Do you need additional library installs?

No! As long as you have pandas and scikit-learn installed, you’re good to go!

⚙️ Setup

Loading libraries

We start, as usual, by loading all the libraries we’ll need in this file

from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Loading the Iris dataset

The example dataset we’ll use to showcase the Pipelines functionalities is the famous Iris dataset. You can find this dataset in various sources. We’ll load it from scikit-learn.datasets.

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.sample(5)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
63	6.1	2.9	4.7	1.4	1
3	4.6	3.1	1.5	0.2	0
60	5.0	2.0	3.5	1.0	1
107	7.3	2.9	6.3	1.8	2
4	5.0	3.6	1.4	0.2	0

Note that this is actually a multi-class classification problem as target has three values:

df.target.unique()

array([0, 1, 2])

Let’s get a quick idea of class imbalance:

class_distribution = np.bincount(df.target) / len(df.target)
print("Class Imbalance (Proportions):")
print(f"Class 0: {class_distribution[0]:.2f}\nClass 1: {class_distribution[1]:.2f}\nClass 2: {class_distribution[2]:.2f}")

Class Imbalance (Proportions):
Class 0: 0.33
Class 1: 0.33
Class 2: 0.33

There is no class imbalance in this case. All classes are perfectly balanced.

Training/test split

Let’s split the data in training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining the Pipeline

The dataset is fairly simple but our features are numeric and vary in range so we need to standardize them before training the scikit-learn’s LogisticRegression model¹. The test set will also need standardizing.

We create a Pipeline that will handle this for us automatically.

pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: Standardize features
    ('model', LogisticRegression())  # Step 2: Logistic Regression model
])

Training and evaluating the model

Now, it’s time to train and evaluate² the model using the pipeline we’ve created.

pipeline.fit(X_train, y_train)  # Model training

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Output -> Model Accuracy: 100.00%

Model Accuracy: 100.00%

We didn’t need to manually process the training or testing datasets: our pipeline handled it automatically for us.

This can be particularly useful in production ML workflows, especially when:

there are multiples features that require different handling
there are multiple pre-processing steps

It can quickly become difficult to handle the processing at multiple pipeline stages and maintain them over time.

Thanks to scikit-learn’s Pipelines, we can aggregate all processing and modeling steps in one place so it is much easier to tweak a portion of the workflow without having to manage it separately for training and evaluation stages.

To read further about `Pipelines`

To read further about Pipelines’s functionalities, you could have a look at:

this Medium post
this tutorial on Python-Bloggers
or look at the Pipeline documentation and examples that come with it.

Footnotes

which is a multinomial regression model under the hood since we have a multiclass problem↩︎
Note that accuracy is acceptable here because the classes are balanced.↩︎