Pipelines tutorial
You can download a .qmd or .ipynb version of this tutorial below to try out the code for yourselves.
Until W05, we’ve used the scikit-learn library to perform our data pre-processing steps, our model training and our model predictions step-by-step. What if we could aggregate all these steps?
That’s exactly what scikit-learn Pipelines allows us to do!
Why would you want to use scikit-learn Pipelines?
- Simplicity: Combining preprocessing and model training in one step. Simplifying your code
- Reusability: Improving maintainability and reproducibility. Easily reusing the same pipeline with different datasets.
- Reduced Error: Avoiding common mistakes like forgetting to apply transformations to test data.
Do you need additional library installs?
No! As long as you have pandas and scikit-learn installed, you’re good to go!
⚙️ Setup
Loading libraries
We start, as usual, by loading all the libraries we’ll need in this file
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_scoreLoading the Iris dataset
The example dataset we’ll use to showcase the Pipelines functionalities is the famous Iris dataset. You can find this dataset in various sources. We’ll load it from scikit-learn.datasets.
# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.sample(5)| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 63 | 6.1 | 2.9 | 4.7 | 1.4 | 1 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 60 | 5.0 | 2.0 | 3.5 | 1.0 | 1 |
| 107 | 7.3 | 2.9 | 6.3 | 1.8 | 2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Note that this is actually a multi-class classification problem as target has three values:
df.target.unique()array([0, 1, 2])Let’s get a quick idea of class imbalance:
class_distribution = np.bincount(df.target) / len(df.target)
print("Class Imbalance (Proportions):")
print(f"Class 0: {class_distribution[0]:.2f}\nClass 1: {class_distribution[1]:.2f}\nClass 2: {class_distribution[2]:.2f}")Class Imbalance (Proportions):
Class 0: 0.33
Class 1: 0.33
Class 2: 0.33There is no class imbalance in this case. All classes are perfectly balanced.
Training/test split
Let’s split the data in training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Defining the Pipeline
The dataset is fairly simple but our features are numeric and vary in range so we need to standardize them before training the scikit-learn’s LogisticRegression model1. The test set will also need standardizing.
We create a Pipeline that will handle this for us automatically.
pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Standardize features
('model', LogisticRegression()) # Step 2: Logistic Regression model
])Training and evaluating the model
Now, it’s time to train and evaluate2 the model using the pipeline we’ve created.
pipeline.fit(X_train, y_train) # Model training
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Output -> Model Accuracy: 100.00% Model Accuracy: 100.00%We didn’t need to manually process the training or testing datasets: our pipeline handled it automatically for us.
This can be particularly useful in production ML workflows, especially when:
- there are multiples features that require different handling
- there are multiple pre-processing steps
It can quickly become difficult to handle the processing at multiple pipeline stages and maintain them over time.
Thanks to scikit-learn’s Pipelines, we can aggregate all processing and modeling steps in one place so it is much easier to tweak a portion of the workflow without having to manage it separately for training and evaluation stages.
To read further about Pipelines
To read further about Pipelines’s functionalities, you could have a look at:
- this Medium post
- this tutorial on Python-Bloggers
- or look at the
Pipelinedocumentation and examples that come with it.
