ME204 – Data Engineering for the Social World
21 July 2025
You’ve already experienced the entire data science workflow.
Let’s annotate that workflow with the skills you learned last week.
Today’s Goal: Upgrade each step of this workflow with a more powerful technique: vectorisation.
It’s easy to query for just one city:
# Get data for London
params = {
"latitude": 51.5,
"longitude": -0.1,
"daily": "temperature_2m_max",
"start_date": "2023-06-01",
"end_date": "2023-08-31"
}
response_london = requests.get(API_URL, params=params)
data_london = response_london.json()
But then we faced the question: how do we replicate this process to get data for more cities?
We could just… copy and paste the code. This is very repetitive!
# Get data for London
params_london = { "latitude": 51.5, "longitude": -0.1, ... }
response_london = requests.get(API_URL, params=params_london)
# Get data for Paris
params_paris = { "latitude": 48.9, "longitude": 2.3, ... }
response_paris = requests.get(API_URL, params=params_paris)
# ... and again for Berlin, etc.
This works, but it’s messy and doesn’t scale. 🤔 Think about it: what if we had 100 cities?
for
loopWe can automate this repetition using lists and a for
loop.
API_URL = "https://archive-api.open-meteo.com/v1/archive"
cities_names = ["London", "Paris", "Berlin"]
cities_latitudes = [51.5, 48.9, 52.5]
cities_longitudes = [-0.1, 2.3, 13.4]
all_data = {} # An empty dictionary to store results
for i in range(len(cities_names)):
city = cities_names[i]
params = {
"latitude": cities_latitudes[i],
"longitude": cities_longitudes[i],
...
}
response = requests.get(API_URL, params=params)
all_data[city] = response.json()
This is better, but it’s still slow. We make one network request for every single city.
A much better way is to send all the locations in a single API call.
# By learning more about the API,
# we learned we can make more efficient
params = {
"latitude": [51.5, 48.9], # A list of latitudes
"longitude": [-0.1, 2.3], # A list of longitudes
# ... other params ...
}
response = requests.get(API_URL, params=params)
data = response.json() # One request, all data back at once
Key Idea: We replaced a Python for
loop with a single, powerful operation that works on the whole collection of cities at once.
This was enabled by the API itself but we want to make similar replacements of for
loops for other stuff as well.
When we gathered data for multiple cities, the API returned a list of dictionaries.
[
{ "latitude": 51.5, "longitude": -0.1, "daily": { ... } },
{ "latitude": 48.9, "longitude": 2.3, "daily": { ... } },
...
]
Instead of a simple pd.DataFrame(data['daily'])
, we now need to combine the daily
data from each of the cities into a single DataFrame.
for
loop wayA common way to solve this is to create an empty list, loop through the results, and append a new DataFrame for each city.
# Assume 'data' is the list of dicts from the API
# and 'cities_names' is our list of city names
list_of_dfs = []
for i, city_data in enumerate(data):
city_name = cities_names[i]
# Create a DataFrame from the 'daily' data
temp_df = pd.DataFrame(city_data['daily'])
# Add the city name as a new column
temp_df['city'] = city_name
# Append it to our list
list_of_dfs.append(temp_df)
Now list_of_dfs
is a list of separate DataFrames: [df_london, df_paris, ...]
pd.concat
Pandas has the perfect tool for combining all of these separate Dataframes:
pd.concat()
takes a list of DataFrames and intelligently stitches them together into one.We now have one single, tidy DataFrame to work with. But can we make the code that creates the list more elegant?
The for
loop we wrote follows a very common Python pattern.
# Our loop
list_of_dfs = []
for i, city_data in enumerate(data):
# ... create temp_df ...
# then eventually append it
list_of_dfs.append(temp_df)
This entire pattern can be replaced by a list comprehension.
It’s a compact, readable way to create a list from a loop.
Let’s convert our loop. We’ll also use .assign()
which is a convenient way to add new columns.
assign
(🆕):# I can assign a new column to a dataframe like this
df_london = df_london.assign(city="London")
# This is equivalent to
df_london['city'] = 'London'
cities = ['London', 'Paris', 'Berlin']
and we already collected data)Instead of saving each individual dataframe, we now save them to a list of dataframes:
# 🆕 A list of dataframes
list_of_dfs = []
for i, city in enumerate(cities)
temp_df = pd.DataFrame(data[i]['daily'])
temp_df = temp_df.assign(city=city)
# Then add to the list:
list_of_dfs.append(temp_df)
✨ PRO: replace for
loops + list.append()
with list comprehension for a much more compact solution:
Now, we can place the list comprehension directly inside the pd.concat
call. The result is a single, expressive line of code that is both efficient and highly readable once you know the pattern.
df_final = pd.concat([
pd.DataFrame(data[i]['daily']).assign(city=city)
for i, city in enumerate(cities)
])
👉 This is the kind of elegant, pandas
-friendly code we should aim for.
We just saw how a Python feature (list comprehensions) helps us write shorter code to prepare data for a pandas
function. (You might not think of it as “simpler” yet but it grows on you..)
The pandas
library itself has many powerful, built-in tools that can replace loops entirely. That is what we explore next.
Our goal today is to learn to spot similar opportunities of where we can simplify our code and get rid of for
loops in favour of pandas
vectorised operations.
Let’s solve the heatwave problem with a for
loop first.
Goal: Find periods of 3 or more consecutive days where temperature is >= 28°C.
Our data:
Our logic:
for
loops way)Loop 1: temp
is 25
. 25 >= 28
is False
.
is_hot_day = [False]
for
loops way)Loop 2: temp
is 26
. 26 >= 28
is False
.
is_hot_day = [False, False]
for
loops way)Loop 3: temp
is 29
. 29 >= 28
is True
.
is_hot_day = [False, False, True]
for
loops way)…and so on, until the list is complete.
is_hot_day = [False, False, True, True, True, False, True, True]
Now we have a boolean list. The next step is to find the streaks of True
values. We are particularly instered in those streaks that achieve at least 3 consecutive days.
Here is the for
loop logic to find streaks from our boolean list.
for i, is_hot in enumerate(is_hot_day):
if is_hot:
streak_counter += 1
else:
# When a streak of 'True's is broken by a 'False'
if streak_counter >= 3:
# We record the heatwave we just passed
start_index = i - streak_counter
heatwaves_found.append({'start': start_index, 'length': streak_counter})
# Reset counter for the next potential streak
streak_counter = 0
# Final check needed for streaks that go to the very end of the list!
Let’s trace the loop. We start at index 0
.
[False, False, True, True, True, False, True, True]
is_hot
is False
.streak_counter
was 0
, so we don’t record anything.streak_counter
to 0
.streak_counter = 0
heatwaves_found = [ ]
Index 1
: is_hot
is False
. Same story.
[False, False, True, True, True, False, True, True]
is_hot
is False
.streak_counter
was 0
, so we don’t record anything.streak_counter
to 0
.streak_counter = 0
heatwaves_found = [ ]
Index 2
: is_hot
is True
. We increment the counter.
[False, False, True, True, True, False, True, True]
is_hot
is True
.streak_counter
is incremented.streak_counter = 1
heatwaves_found = [ ]
Index 3
: is_hot
is True
. Increment again. Index 4
: is_hot
is True
. Increment again.
[False, False, True, True, True, False, True, True]
streak_counter = 3
heatwaves_found = [ ]
Index 5
: is_hot
is False
. The streak is broken!
[False, False, True, True, True, False, True, True]
streak_counter >= 3
? Yes, it was 3
.start
is 5 - 3 = 2
, length
is 3
.streak_counter
to 0
.streak_counter = 0
heatwaves_found = [ {'start': 2, 'length': 3}
]
We continue to the end. The final streak of True
is only 2 long, so it is not recorded.
[False, False, True, True, True, False, True, True]
This logic is getting complicated. There must be a better way…
Instead of creating separate lists, we add the data in-context, inside the DataFrame, as a separate temporary column instead.
Date | Temp | hot_day |
---|---|---|
2023-07-01 | 25 | False |
2023-07-02 | 26 | False |
2023-07-03 | 29 | True |
2023-07-04 | 30 | True |
2023-07-05 | 31 | True |
2023-07-06 | 26 | False |
2023-07-07 | 32 | True |
2023-07-08 | 33 | True |
This is much neater than using list.append()
and for loops
How can we tell when a block of True
values starts? A block starts when the value changes from the previous one. We use .diff()
to find these changes.
# Convert boolean to integer (0 or 1) before calculating the difference
df_summer['hot_day'].astype(int).diff()
hot_day |
astype(int) |
.diff() |
Change? |
---|---|---|---|
False |
0 | NaN | |
False |
0 | 0 | No change |
True |
1 | 1 | Streak starts |
True |
1 | 0 | No change |
True |
1 | 0 | No change |
False |
0 | -1 | Streak ends |
True |
1 | 1 | Streak starts |
True |
1 | 0 | No change |
This is the clever part. If we take the cumulative sum (.cumsum()
) of the “change detected” column, every time a change happens, the sum increases. This gives each consecutive block of days a unique ID!
hot_day |
.diff() != 0 |
.cumsum() |
Block ID |
---|---|---|---|
False |
True |
1 | 1 |
False |
False |
1 | 1 |
True |
True |
2 | 2 |
True |
False |
2 | 2 |
True |
False |
2 | 2 |
False |
True |
3 | 3 |
True |
True |
4 | 4 |
True |
False |
4 | 4 |
Now all the days in the first heatwave have ID 2
, and the days in the second have ID 4
.
Now we can use groupby()
to count how many days are in each block.
.transform('size')
is powerful: it calculates the size of each group and then “broadcasts” that result back to every row of the original DataFrame.
Block ID | streak_length |
---|---|
1 | 2 |
1 | 2 |
2 | 3 |
2 | 3 |
2 | 3 |
3 | 1 |
4 | 2 |
4 | 2 |
groupby
The groupby
operation follows a “split-apply-combine” strategy. It splits the data into groups based on your criteria, applies a function to each group, and then combines the results.
Now we have everything we need. A heatwave is any day that is:
hot_day == True
)streak_length >= 3
)This gives us exactly the days that are part of a qualifying heatwave, all without a single Python for
loop. This is the power of vectorisation.
📚 Resources we will check
🔗 Pandas Cookbook
🔗 Python for Data Analysis (3E)
🔗 Seaborn: Declarative Visualisation
LSE Summer School 2025 | ME204 Week 02 Day 01