ME204 – Data Engineering for the Social World
21 July 2025
You’ve already experienced the entire data science workflow.
Let’s annotate that workflow with the skills you learned last week.
Today’s Goal: Upgrade each step of this workflow with a more powerful technique: vectorisation.
It’s easy to query for just one city:
# Get data for London
params = {
    "latitude": 51.5,
    "longitude": -0.1,
    "daily": "temperature_2m_max",
    "start_date": "2023-06-01",
    "end_date": "2023-08-31"
}
response_london = requests.get(API_URL, params=params)
data_london = response_london.json()But then we faced the question: how do we replicate this process to get data for more cities?
We could just… copy and paste the code. This is very repetitive!
# Get data for London
params_london = { "latitude": 51.5, "longitude": -0.1, ... }
response_london = requests.get(API_URL, params=params_london)
# Get data for Paris
params_paris = { "latitude": 48.9, "longitude": 2.3, ... }
response_paris = requests.get(API_URL, params=params_paris)
# ... and again for Berlin, etc.This works, but it’s messy and doesn’t scale. 🤔 Think about it: what if we had 100 cities?
for loopWe can automate this repetition using lists and a for loop.
API_URL = "https://archive-api.open-meteo.com/v1/archive"
cities_names = ["London", "Paris", "Berlin"]
cities_latitudes = [51.5, 48.9, 52.5]
cities_longitudes = [-0.1, 2.3, 13.4]
all_data = {} # An empty dictionary to store resultsfor i in range(len(cities_names)):
    city = cities_names[i]
    params = {
        "latitude": cities_latitudes[i],
        "longitude": cities_longitudes[i],
        ...
    }
    response = requests.get(API_URL, params=params)
    all_data[city] = response.json()This is better, but it’s still slow. We make one network request for every single city.
A much better way is to send all the locations in a single API call.
# By learning more about the API,
# we learned we can make more efficient 
params = {
    "latitude": [51.5, 48.9],  # A list of latitudes
    "longitude": [-0.1, 2.3],  # A list of longitudes
    # ... other params ...
}
response = requests.get(API_URL, params=params)
data = response.json() # One request, all data back at onceKey Idea: We replaced a Python for loop with a single, powerful operation that works on the whole collection of cities at once.
This was enabled by the API itself but we want to make similar replacements of for loops for other stuff as well.
When we gathered data for multiple cities, the API returned a list of dictionaries.
[
  { "latitude": 51.5, "longitude": -0.1, "daily": { ... } },
  { "latitude": 48.9, "longitude": 2.3,  "daily": { ... } },
  ...
]Instead of a simple pd.DataFrame(data['daily']), we now need to combine the daily data from each of the cities into a single DataFrame.
for loop wayA common way to solve this is to create an empty list, loop through the results, and append a new DataFrame for each city.
# Assume 'data' is the list of dicts from the API
# and 'cities_names' is our list of city names
list_of_dfs = [] for i, city_data in enumerate(data):
    city_name = cities_names[i]
    # Create a DataFrame from the 'daily' data
    temp_df = pd.DataFrame(city_data['daily'])
    
    # Add the city name as a new column
    temp_df['city'] = city_name
    
    # Append it to our list
    list_of_dfs.append(temp_df)Now list_of_dfs is a list of separate DataFrames: [df_london, df_paris, ...]
pd.concatPandas has the perfect tool for combining all of these separate Dataframes:
pd.concat() takes a list of DataFrames and intelligently stitches them together into one.We now have one single, tidy DataFrame to work with. But can we make the code that creates the list more elegant?
The for loop we wrote follows a very common Python pattern.
# Our loop
list_of_dfs = []
for i, city_data in enumerate(data):
    # ... create temp_df ...
    # then eventually append it
    list_of_dfs.append(temp_df)This entire pattern can be replaced by a list comprehension.
It’s a compact, readable way to create a list from a loop.
Let’s convert our loop. We’ll also use .assign() which is a convenient way to add new columns.
assign (🆕):# I can assign a new column to a dataframe like this
df_london = df_london.assign(city="London")
# This is equivalent to
df_london['city'] = 'London'cities = ['London', 'Paris', 'Berlin'] and we already collected data)Instead of saving each individual dataframe, we now save them to a list of dataframes:
# 🆕 A list of dataframes
list_of_dfs = []
for i, city in enumerate(cities)
    temp_df = pd.DataFrame(data[i]['daily'])
    temp_df = temp_df.assign(city=city)
    # Then add to the list:
    list_of_dfs.append(temp_df)
    ✨ PRO: replace for loops + list.append() with list comprehension for a much more compact solution:
Now, we can place the list comprehension directly inside the pd.concat call. The result is a single, expressive line of code that is both efficient and highly readable once you know the pattern.
df_final = pd.concat([
    pd.DataFrame(data[i]['daily']).assign(city=city) 
    for i, city in enumerate(cities)
])👉 This is the kind of elegant, pandas-friendly code we should aim for.
We just saw how a Python feature (list comprehensions) helps us write shorter code to prepare data for a pandas function. (You might not think of it as “simpler” yet but it grows on you..)
The pandas library itself has many powerful, built-in tools that can replace loops entirely. That is what we explore next.
Our goal today is to learn to spot similar opportunities of where we can simplify our code and get rid of for loops in favour of pandas vectorised operations.
Let’s solve the heatwave problem with a for loop first.
Goal: Find periods of 3 or more consecutive days where temperature is >= 28°C.
Our data:
Our logic:
for loops way)Loop 1: temp is 25. 25 >= 28 is False.
is_hot_day = [False]
for loops way)Loop 2: temp is 26. 26 >= 28 is False.
is_hot_day = [False, False]
for loops way)Loop 3: temp is 29. 29 >= 28 is True.
is_hot_day = [False, False, True]
for loops way)…and so on, until the list is complete.
is_hot_day = [False, False, True, True, True, False, True, True]
Now we have a boolean list. The next step is to find the streaks of True values. We are particularly instered in those streaks that achieve at least 3 consecutive days.
Here is the for loop logic to find streaks from our boolean list.
for i, is_hot in enumerate(is_hot_day):
    if is_hot:
        streak_counter += 1
    else:
        # When a streak of 'True's is broken by a 'False'
        if streak_counter >= 3:
            # We record the heatwave we just passed
            start_index = i - streak_counter
            heatwaves_found.append({'start': start_index, 'length': streak_counter})
        # Reset counter for the next potential streak
        streak_counter = 0
# Final check needed for streaks that go to the very end of the list!Let’s trace the loop. We start at index 0.
[False, False, True, True, True, False, True, True]
is_hot is False.streak_counter was 0, so we don’t record anything.streak_counter to 0.streak_counter = 0
 heatwaves_found = [ ]
Index 1: is_hot is False. Same story.
[False, False, True, True, True, False, True, True]
is_hot is False.streak_counter was 0, so we don’t record anything.streak_counter to 0.streak_counter = 0
 heatwaves_found = [ ]
Index 2: is_hot is True. We increment the counter.
[False, False, True, True, True, False, True, True]
is_hot is True.streak_counter is incremented.streak_counter = 1
 heatwaves_found = [ ]
Index 3: is_hot is True. Increment again. Index 4: is_hot is True. Increment again.
[False, False, True, True, True, False, True, True]
streak_counter = 3
 heatwaves_found = [ ]
Index 5: is_hot is False. The streak is broken!
[False, False, True, True, True, False, True, True]
streak_counter >= 3? Yes, it was 3.start is 5 - 3 = 2, length is 3.streak_counter to 0.streak_counter = 0
 heatwaves_found = [ {'start': 2, 'length': 3} ]
We continue to the end. The final streak of True is only 2 long, so it is not recorded.
[False, False, True, True, True, False, True, True]
This logic is getting complicated. There must be a better way…
Instead of creating separate lists, we add the data in-context, inside the DataFrame, as a separate temporary column instead.
| Date | Temp | hot_day | 
|---|---|---|
| 2023-07-01 | 25 | False | 
| 2023-07-02 | 26 | False | 
| 2023-07-03 | 29 | True | 
| 2023-07-04 | 30 | True | 
| 2023-07-05 | 31 | True | 
| 2023-07-06 | 26 | False | 
| 2023-07-07 | 32 | True | 
| 2023-07-08 | 33 | True | 
This is much neater than using list.append() and for loops
How can we tell when a block of True values starts? A block starts when the value changes from the previous one. We use .diff() to find these changes.
# Convert boolean to integer (0 or 1) before calculating the difference
df_summer['hot_day'].astype(int).diff()| hot_day | astype(int) | .diff() | Change? | 
|---|---|---|---|
| False | 0 | NaN | |
| False | 0 | 0 | No change | 
| True | 1 | 1 | Streak starts | 
| True | 1 | 0 | No change | 
| True | 1 | 0 | No change | 
| False | 0 | -1 | Streak ends | 
| True | 1 | 1 | Streak starts | 
| True | 1 | 0 | No change | 
This is the clever part. If we take the cumulative sum (.cumsum()) of the “change detected” column, every time a change happens, the sum increases. This gives each consecutive block of days a unique ID!
| hot_day | .diff() != 0 | .cumsum() | Block ID | 
|---|---|---|---|
| False | True | 1 | 1 | 
| False | False | 1 | 1 | 
| True | True | 2 | 2 | 
| True | False | 2 | 2 | 
| True | False | 2 | 2 | 
| False | True | 3 | 3 | 
| True | True | 4 | 4 | 
| True | False | 4 | 4 | 
Now all the days in the first heatwave have ID 2, and the days in the second have ID 4.
Now we can use groupby() to count how many days are in each block.
.transform('size') is powerful: it calculates the size of each group and then “broadcasts” that result back to every row of the original DataFrame.
| Block ID | streak_length | 
|---|---|
| 1 | 2 | 
| 1 | 2 | 
| 2 | 3 | 
| 2 | 3 | 
| 2 | 3 | 
| 3 | 1 | 
| 4 | 2 | 
| 4 | 2 | 
groupbyThe groupby operation follows a “split-apply-combine” strategy. It splits the data into groups based on your criteria, applies a function to each group, and then combines the results.
Now we have everything we need. A heatwave is any day that is:
hot_day == True)streak_length >= 3)This gives us exactly the days that are part of a qualifying heatwave, all without a single Python for loop. This is the power of vectorisation.
📚 Resources we will check
🔗 Pandas Cookbook
🔗 Python for Data Analysis (3E)
🔗 Seaborn: Declarative Visualisation
LSE Summer School 2025 | ME204 Week 02 Day 01