LSE DS202 – Data Science for Social Scientists
  1. 🗓️ Weeks
  2. Week 07
  3. 📚 Pipelines tutorial
  • 🏠 Home
  • 📰 DS202 Blog
  • 📚 Course Info
    • 📓 Syllabus
    • 🗳️ Course Rep
    • 📋 Getting Ready
    • 📟 Communication
    • 🤖 Generative AI policy
  • ✍️ Assessments
    • 📋 Overview
    • 🧳 General Course
    • 📝 W04 Formative
    • ✅ W04 Formative Model Solution
    • ❌ W04 Formative Common Issues
    • 📝 W07 Formative
    • ✅ W07 Formative Model Solution
    • 📝 W10 Summative
    • ✅ A possible solution for W10 Summative
    • 📝 Spring Term Summative
    • 📝 Group Project
  • 🗓️ Weeks
    • Week 01
      • 📚 Lab Prep
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 02
      • 📚 Lab Prep
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • ⏭️ After the Lab
      • 👩🏻‍🏫 Lecture Material
    • Week 03
      • 📚 Lab Prep
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 04
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 05
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 06
    • Week 07
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 📚 Pipelines tutorial
      • 👩🏻‍🏫 Lecture Material
    • Week 08
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 09
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 10
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material
    • Week 11
      • 🛣️ Lab Roadmap
      • ✅ Lab Solutions
      • 👩🏻‍🏫 Lecture Material

On this page

  • ⚙️ Setup
  • Training/test split
  • Defining the Pipeline
  • Training and evaluating the model
  • To read further about Pipelines
  1. 🗓️ Weeks
  2. Week 07
  3. 📚 Pipelines tutorial

Pipelines tutorial

Author

Dr. Ghita Berrada

Published

03 Mar 2025

You can download a .qmd or .ipynb version of this tutorial below to try out the code for yourselves.

Until W05, we’ve used the scikit-learn library to perform our data pre-processing steps, our model training and our model predictions step-by-step. What if we could aggregate all these steps?

That’s exactly what scikit-learn Pipelines allows us to do!

Why would you want to use scikit-learn Pipelines?

  • Simplicity: Combining preprocessing and model training in one step. Simplifying your code
  • Reusability: Improving maintainability and reproducibility. Easily reusing the same pipeline with different datasets.
  • Reduced Error: Avoiding common mistakes like forgetting to apply transformations to test data.

Do you need additional library installs?

No! As long as you have pandas and scikit-learn installed, you’re good to go!

⚙️ Setup

Loading libraries

We start, as usual, by loading all the libraries we’ll need in this file

from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Loading the Iris dataset

The example dataset we’ll use to showcase the Pipelines functionalities is the famous Iris dataset. You can find this dataset in various sources. We’ll load it from scikit-learn.datasets.

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.sample(5)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
63 6.1 2.9 4.7 1.4 1
3 4.6 3.1 1.5 0.2 0
60 5.0 2.0 3.5 1.0 1
107 7.3 2.9 6.3 1.8 2
4 5.0 3.6 1.4 0.2 0

Note that this is actually a multi-class classification problem as target has three values:

df.target.unique()
array([0, 1, 2])

Let’s get a quick idea of class imbalance:

class_distribution = np.bincount(df.target) / len(df.target)
print("Class Imbalance (Proportions):")
print(f"Class 0: {class_distribution[0]:.2f}\nClass 1: {class_distribution[1]:.2f}\nClass 2: {class_distribution[2]:.2f}")
Class Imbalance (Proportions):
Class 0: 0.33
Class 1: 0.33
Class 2: 0.33

There is no class imbalance in this case. All classes are perfectly balanced.

Training/test split

Let’s split the data in training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining the Pipeline

The dataset is fairly simple but our features are numeric and vary in range so we need to standardize them before training the scikit-learn’s LogisticRegression model1. The test set will also need standardizing.

We create a Pipeline that will handle this for us automatically.

pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: Standardize features
    ('model', LogisticRegression())  # Step 2: Logistic Regression model
])

Training and evaluating the model

Now, it’s time to train and evaluate2 the model using the pipeline we’ve created.

pipeline.fit(X_train, y_train)  # Model training

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Output -> Model Accuracy: 100.00% 
Model Accuracy: 100.00%

We didn’t need to manually process the training or testing datasets: our pipeline handled it automatically for us.

This can be particularly useful in production ML workflows, especially when:

  • there are multiples features that require different handling
  • there are multiple pre-processing steps

It can quickly become difficult to handle the processing at multiple pipeline stages and maintain them over time.

Thanks to scikit-learn’s Pipelines, we can aggregate all processing and modeling steps in one place so it is much easier to tweak a portion of the workflow without having to manage it separately for training and evaluation stages.

To read further about Pipelines

To read further about Pipelines’s functionalities, you could have a look at:

  • this Medium post
  • this tutorial on Python-Bloggers
  • or look at the Pipeline documentation and examples that come with it.

Footnotes

  1. which is a multinomial regression model under the hood since we have a multiclass problem↩︎

  2. Note that accuracy is acceptable here because the classes are balanced.↩︎

✅ Lab Solutions
Week 08

Copyright 2024, LSE Data Science Institute