Pipelines tutorial
You can download a .qmd
or .ipynb
version of this tutorial below to try out the code for yourselves.
Until W05, we’ve used the scikit-learn
library to perform our data pre-processing steps, our model training and our model predictions step-by-step. What if we could aggregate all these steps?
That’s exactly what scikit-learn
Pipelines
allows us to do!
Why would you want to use scikit-learn
Pipelines
?
- Simplicity: Combining preprocessing and model training in one step. Simplifying your code
- Reusability: Improving maintainability and reproducibility. Easily reusing the same pipeline with different datasets.
- Reduced Error: Avoiding common mistakes like forgetting to apply transformations to test data.
Do you need additional library installs?
No! As long as you have pandas
and scikit-learn
installed, you’re good to go!
⚙️ Setup
Loading libraries
We start, as usual, by loading all the libraries we’ll need in this file
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Loading the Iris dataset
The example dataset we’ll use to showcase the Pipelines
functionalities is the famous Iris
dataset. You can find this dataset in various sources. We’ll load it from scikit-learn.datasets
.
# Load the Iris dataset
= load_iris()
data = data.data, data.target
X, y # Create a DataFrame for better visualization
= pd.DataFrame(X, columns=data.feature_names)
df 'target'] = y
df[5) df.sample(
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
63 | 6.1 | 2.9 | 4.7 | 1.4 | 1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
60 | 5.0 | 2.0 | 3.5 | 1.0 | 1 |
107 | 7.3 | 2.9 | 6.3 | 1.8 | 2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Note that this is actually a multi-class classification problem as target
has three values:
df.target.unique()
0, 1, 2]) array([
Let’s get a quick idea of class imbalance:
= np.bincount(df.target) / len(df.target)
class_distribution print("Class Imbalance (Proportions):")
print(f"Class 0: {class_distribution[0]:.2f}\nClass 1: {class_distribution[1]:.2f}\nClass 2: {class_distribution[2]:.2f}")
Class Imbalance (Proportions):0: 0.33
Class 1: 0.33
Class 2: 0.33 Class
There is no class imbalance in this case. All classes are perfectly balanced.
Training/test split
Let’s split the data in training and test sets
= train_test_split(X, y, test_size=0.2, random_state=42) X_train, X_test, y_train, y_test
Defining the Pipeline
The dataset is fairly simple but our features are numeric and vary in range so we need to standardize them before training the scikit-learn
’s LogisticRegression
model1. The test set will also need standardizing.
We create a Pipeline
that will handle this for us automatically.
= Pipeline([
pipeline 'scaler', StandardScaler()), # Step 1: Standardize features
('model', LogisticRegression()) # Step 2: Logistic Regression model
( ])
Training and evaluating the model
Now, it’s time to train and evaluate2 the model using the pipeline we’ve created.
# Model training
pipeline.fit(X_train, y_train)
= pipeline.predict(X_test)
y_pred = accuracy_score(y_test, y_pred)
accuracy
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Output -> Model Accuracy: 100.00%
100.00% Model Accuracy:
We didn’t need to manually process the training or testing datasets: our pipeline handled it automatically for us.
This can be particularly useful in production ML workflows, especially when:
- there are multiples features that require different handling
- there are multiple pre-processing steps
It can quickly become difficult to handle the processing at multiple pipeline stages and maintain them over time.
Thanks to scikit-learn
’s Pipelines
, we can aggregate all processing and modeling steps in one place so it is much easier to tweak a portion of the workflow without having to manage it separately for training and evaluation stages.
To read further about Pipelines
To read further about Pipelines
’s functionalities, you could have a look at:
- this Medium post
- this tutorial on Python-Bloggers
- or look at the
Pipeline
documentation and examples that come with it.