🗓️ Week 02 | Day 01
From Loops to Vectorisation

ME204 – Data Engineering for the Social World

Dr Jon Cardoso-Silva

21 July 2025

Where We Are

You’ve already experienced the entire data science workflow.

data_workflow Start Start Gather Gather data Start->Gather Store Store it somewhere Gather->Store Clean Clean & pre-process Store->Clean Build Build a dataset Clean->Build Explore Exploratory data analysis Build->Explore ML Machine Learning Explore->ML Insights Obtain insights ML->Insights Communicate Communicate results Insights->Communicate End End Communicate->End

What You Already Know

Let’s annotate that workflow with the skills you learned last week.

data_workflow_annotated Start Start Gather Gather data (requests.get) Start->Gather Store Store it somewhere (Save JSON/CSV) Gather->Store Clean Clean & pre-process (Parse 'daily' key) Store->Clean Build Build a dataset (pd.DataFrame) Clean->Build Explore Exploratory data analysis (.describe(), .plot()) Build->Explore ML Machine Learning (We skip this!) Explore->ML Insights Obtain insights (Matplotlib plot) ML->Insights Communicate Communicate results (Jupyter Notebook) Insights->Communicate End End Communicate->End

Today’s Goal: Upgrade each step of this workflow with a more powerful technique: vectorisation.

The Problem: Getting Data for Many Cities

It’s easy to query for just one city:

# Get data for London
params = {
    "latitude": 51.5,
    "longitude": -0.1,
    "daily": "temperature_2m_max",
    "start_date": "2023-06-01",
    "end_date": "2023-08-31"
}
response_london = requests.get(API_URL, params=params)
data_london = response_london.json()

But then we faced the question: how do we replicate this process to get data for more cities?

The “Beginner” Way: Copy, Paste, Repeat

We could just… copy and paste the code. This is very repetitive!

# Get data for London
params_london = { "latitude": 51.5, "longitude": -0.1, ... }
response_london = requests.get(API_URL, params=params_london)

# Get data for Paris
params_paris = { "latitude": 48.9, "longitude": 2.3, ... }
response_paris = requests.get(API_URL, params=params_paris)

# ... and again for Berlin, etc.

This works, but it’s messy and doesn’t scale. 🤔 Think about it: what if we had 100 cities?

The “Beginner” Way: A for loop

We can automate this repetition using lists and a for loop.

API_URL = "https://archive-api.open-meteo.com/v1/archive"
cities_names = ["London", "Paris", "Berlin"]
cities_latitudes = [51.5, 48.9, 52.5]
cities_longitudes = [-0.1, 2.3, 13.4]
all_data = {} # An empty dictionary to store results
for i in range(len(cities_names)):
    city = cities_names[i]
    params = {
        "latitude": cities_latitudes[i],
        "longitude": cities_longitudes[i],
        ...
    }
    response = requests.get(API_URL, params=params)
    all_data[city] = response.json()

This is better, but it’s still slow. We make one network request for every single city.

The API allows us to gather more in one go

A much better way is to send all the locations in a single API call.

# By learning more about the API,
# we learned we can make more efficient 
params = {
    "latitude": [51.5, 48.9],  # A list of latitudes
    "longitude": [-0.1, 2.3],  # A list of longitudes
    # ... other params ...
}
response = requests.get(API_URL, params=params)
data = response.json() # One request, all data back at once

Key Idea: We replaced a Python for loop with a single, powerful operation that works on the whole collection of cities at once.

This was enabled by the API itself but we want to make similar replacements of for loops for other stuff as well.

From Loops to List Comprehensions
(a new trick)

When we gathered data for multiple cities, the API returned a list of dictionaries.

[
  { "latitude": 51.5, "longitude": -0.1, "daily": { ... } },
  { "latitude": 48.9, "longitude": 2.3,  "daily": { ... } },
  ...
]

Instead of a simple pd.DataFrame(data['daily']), we now need to combine the daily data from each of the cities into a single DataFrame.

The for loop way

A common way to solve this is to create an empty list, loop through the results, and append a new DataFrame for each city.

# Assume 'data' is the list of dicts from the API
# and 'cities_names' is our list of city names
list_of_dfs = [] 
for i, city_data in enumerate(data):
    city_name = cities_names[i]

    # Create a DataFrame from the 'daily' data
    temp_df = pd.DataFrame(city_data['daily'])
    
    # Add the city name as a new column
    temp_df['city'] = city_name
    
    # Append it to our list
    list_of_dfs.append(temp_df)

Now list_of_dfs is a list of separate DataFrames: [df_london, df_paris, ...]

Combining with pd.concat

Pandas has the perfect tool for combining all of these separate Dataframes:

df_final = pd.concat(list_of_dfs)
  • pd.concat() takes a list of DataFrames and intelligently stitches them together into one.

We now have one single, tidy DataFrame to work with. But can we make the code that creates the list more elegant?

A Better Way: List Comprehensions

The for loop we wrote follows a very common Python pattern.

# Our loop

list_of_dfs = []
for i, city_data in enumerate(data):

    # ... create temp_df ...

    # then eventually append it
    list_of_dfs.append(temp_df)

This entire pattern can be replaced by a list comprehension.

# The list comprehension pattern
[<expression> for <item> in <iterable>]

It’s a compact, readable way to create a list from a loop.

Building a compact solution

Let’s convert our loop. We’ll also use .assign() which is a convenient way to add new columns.

  • The use of assign (🆕):
# I can assign a new column to a dataframe like this
df_london = df_london.assign(city="London")

# This is equivalent to
df_london['city'] = 'London'
  • Inside a loop:
    (Imagine we have a cities = ['London', 'Paris', 'Berlin'] and we already collected data)
for i, city in enumerate(cities)
    # Create temp_df

    # then assign a new column
    temp_df = temp_df.assign(city=city)
    

Building a compact solution (cont.)

Instead of saving each individual dataframe, we now save them to a list of dataframes:

# 🆕 A list of dataframes
list_of_dfs = []

for i, city in enumerate(cities)
    temp_df = pd.DataFrame(data[i]['daily'])
    temp_df = temp_df.assign(city=city)
    # Then add to the list:
    list_of_dfs.append(temp_df)
    

✨ PRO: replace for loops + list.append() with list comprehension for a much more compact solution:

list_of_dfs = [pd.DataFrame(data[i]['daily']).assign(city=city) 
               for i, city in enumerate(cities)]

Build the final DataFrame in one go

Now, we can place the list comprehension directly inside the pd.concat call. The result is a single, expressive line of code that is both efficient and highly readable once you know the pattern.

df_final = pd.concat([
    pd.DataFrame(data[i]['daily']).assign(city=city) 
    for i, city in enumerate(cities)
])

👉 This is the kind of elegant, pandas-friendly code we should aim for.

Thinking bigger

We just saw how a Python feature (list comprehensions) helps us write shorter code to prepare data for a pandas function. (You might not think of it as “simpler” yet but it grows on you..)

The pandas library itself has many powerful, built-in tools that can replace loops entirely. That is what we explore next.

Vectorisation in Pandas

Our goal today is to learn to spot similar opportunities of where we can simplify our code and get rid of for loops in favour of pandas vectorised operations.

Detecting Hot Days: The Loop Approach

Let’s solve the heatwave problem with a for loop first.

Goal: Find periods of 3 or more consecutive days where temperature is >= 28°C.

Our data:

25
26
29
30
31
26
32
33

Our logic:

temperatures = [25, 26, 29, 30, 31, 26, 32, 33]
is_hot_day = []
for temp in temperatures:
    is_hot_day.append(temp >= 28)

# is_hot_day is now:
# [False, False, True, True, True, False, True, True]

Detect hot days (the for loops way)

Loop 1: temp is 25. 25 >= 28 is False.

25
26
29
30
31
26
32
33
Our new list:

is_hot_day = [False]

Detect hot days (the for loops way)

Loop 2: temp is 26. 26 >= 28 is False.

25
26
29
30
31
26
32
33
Our new list:

is_hot_day = [False, False]

Detect hot days (the for loops way)

Loop 3: temp is 29. 29 >= 28 is True.

25
26
29
30
31
26
32
33
Our new list:

is_hot_day = [False, False, True]

Detect hot days (the for loops way)

…and so on, until the list is complete.

is_hot_day = [False, False, True, True, True, False, True, True]

Now we have a boolean list. The next step is to find the streaks of True values. We are particularly instered in those streaks that achieve at least 3 consecutive days.

Finding Heatwave Streaks (The Loop Way)

Here is the for loop logic to find streaks from our boolean list.

for i, is_hot in enumerate(is_hot_day):
    if is_hot:
        streak_counter += 1
    else:
        # When a streak of 'True's is broken by a 'False'
        if streak_counter >= 3:
            # We record the heatwave we just passed
            start_index = i - streak_counter
            heatwaves_found.append({'start': start_index, 'length': streak_counter})
        # Reset counter for the next potential streak
        streak_counter = 0

# Final check needed for streaks that go to the very end of the list!

Finding Heatwave Streaks (The Loop Way)

Let’s trace the loop. We start at index 0.

Input list:

[False, False, True, True, True, False, True, True]

  • is_hot is False.
  • streak_counter was 0, so we don’t record anything.
  • We reset streak_counter to 0.

streak_counter = 0
heatwaves_found = [ ]

Finding Heatwave Streaks (The Loop Way)

Index 1: is_hot is False. Same story.

Input list:

[False, False, True, True, True, False, True, True]

  • is_hot is False.
  • streak_counter was 0, so we don’t record anything.
  • We reset streak_counter to 0.

streak_counter = 0
heatwaves_found = [ ]

Finding Heatwave Streaks (The Loop Way)

Index 2: is_hot is True. We increment the counter.

Input list:

[False, False, True, True, True, False, True, True]

  • is_hot is True.
  • streak_counter is incremented.

streak_counter = 1
heatwaves_found = [ ]

Finding Heatwave Streaks (The Loop Way)

Index 3: is_hot is True. Increment again. Index 4: is_hot is True. Increment again.

Input list:

[False, False, True, True, True, False, True, True]

streak_counter = 3
heatwaves_found = [ ]

Finding Heatwave Streaks (The Loop Way)

Index 5: is_hot is False. The streak is broken!

Input list:

[False, False, True, True, True, False, True, True]

  • Now we check: was streak_counter >= 3? Yes, it was 3.
  • We record the heatwave: start is 5 - 3 = 2, length is 3.
  • Reset streak_counter to 0.

streak_counter = 0
heatwaves_found = [ {'start': 2, 'length': 3} ]

Finding Heatwave Streaks (The Loop Way)

We continue to the end. The final streak of True is only 2 long, so it is not recorded.

Input list:

[False, False, True, True, True, False, True, True]

This logic is getting complicated. There must be a better way…

The Pandas Way

Instead of creating separate lists, we add the data in-context, inside the DataFrame, as a separate temporary column instead.

df_summer['hot_day'] = df_summer['max_temp_c'] >= 28
Date Temp hot_day
2023-07-01 25 False
2023-07-02 26 False
2023-07-03 29 True
2023-07-04 30 True
2023-07-05 31 True
2023-07-06 26 False
2023-07-07 32 True
2023-07-08 33 True

This is much neater than using list.append() and for loops

Find Where Streaks Begin/End

How can we tell when a block of True values starts? A block starts when the value changes from the previous one. We use .diff() to find these changes.

# Convert boolean to integer (0 or 1) before calculating the difference
df_summer['hot_day'].astype(int).diff()
hot_day astype(int) .diff() Change?
False 0 NaN
False 0 0 No change
True 1 1 Streak starts
True 1 0 No change
True 1 0 No change
False 0 -1 Streak ends
True 1 1 Streak starts
True 1 0 No change

Give Each Streak a Unique ID

This is the clever part. If we take the cumulative sum (.cumsum()) of the “change detected” column, every time a change happens, the sum increases. This gives each consecutive block of days a unique ID!

df_summer['hot_day_block'] = (df_summer['hot_day'].astype(int).diff() != 0).cumsum()
hot_day .diff() != 0 .cumsum() Block ID
False True 1 1
False False 1 1
True True 2 2
True False 2 2
True False 2 2
False True 3 3
True True 4 4
True False 4 4

Now all the days in the first heatwave have ID 2, and the days in the second have ID 4.

Count the Length of Each Streak

Now we can use groupby() to count how many days are in each block.

df_summer['streak_length'] = df_summer.groupby('hot_day_block')['hot_day'].transform('size')

.transform('size') is powerful: it calculates the size of each group and then “broadcasts” that result back to every row of the original DataFrame.

Block ID streak_length
1 2
1 2
2 3
2 3
2 3
3 1
4 2
4 2

A Visual depiction of groupby

The groupby operation follows a “split-apply-combine” strategy. It splits the data into groups based on your criteria, applies a function to each group, and then combines the results.

Filter for Heatwaves

Now we have everything we need. A heatwave is any day that is:

  1. A hot day (hot_day == True)
  2. Part of a streak of 3 or more (streak_length >= 3)
heatwave_days = df_summer[(df_summer['hot_day']) & (df_summer['streak_length'] >= 3)]

This gives us exactly the days that are part of a qualifying heatwave, all without a single Python for loop. This is the power of vectorisation.

Spot the opportunities

  • Let’s look at the code you are writing to spot places where you could benefit from:
    • list comprehensions
    • pandas vectorised operations

📚 Resources we will check

🔗 Pandas Cookbook

Practical recipes for common data manipulation tasks. A great place to find examples.

➡️ Official Pandas Cookbook

🔗 Python for Data Analysis (3E)

The book by Wes McKinney, the creator of pandas. It’s an essential, freely available resource.

➡️ Read the book online

🔗 Seaborn: Declarative Visualisation

An alternative to Matplotlib. Seaborn’s functions work directly with DataFrames, which is much closer to the vectorised style we’re learning.

➡️ Explore the Gallery

LSE Summer School 2025 | ME204 Week 02 Day 01