DS105W – Data for Data Science
🗓️ 05 Mar 2026
NB01)Let’s talk about the new challenge ahead: ✍️ Mini-Project 2.
How does travelling between Inner London and Outer London change when it happens during peak hours versus off-peak hours?
What’s already set in the question:
How does travelling between Inner London and Outer London change when it happens during peak hours versus off-peak hours?
What you decide:
In this section, we will:
json_normalize() turns that structure into a flat tablegroupby() with a dataframeImagine you have this dataframe:
BEFORE GROUPING
| route_id | time_band | duration_min |
|---|---|---|
| R1 | peak | 51 |
| R2 | off-peak | 43 |
| R3 | peak | 56 |
| R4 | off-peak | 40 |
| R5 | peak | 49 |
AFTER GROUPING
| group label | rows inside |
|---|---|
| peak | R1, R3, R5 |
| off-peak | R2, R4 |
The same rows are now split by time_band.
Now I need to do something with each group 👉
groupby().agg() doesWe can summarize each group using built-in aggregation functions. agg() lets us do this for specific columns:
BEFORE summarizing
| group | values |
|---|---|
| peak | 51, 56, 49 |
| off-peak | 43, 40 |
AFTER summarizing
| time_band | mean_ duration |
median_ duration |
n_ routes |
|---|---|---|---|
| peak | 52.0 | 51.0 | 3 |
| off-peak | 41.5 | 41.5 | 2 |
groupby().apply() lets you doUse pd.Series.apply() when you want to decide exactly how to calculate something for each group. groupby().apply() lets you run your own function on every group:
BEFORE summarizing
| group | values |
|---|---|
| peak | 51, 56, 49 |
| off-peak | 43, 40 |
AFTER groupby().apply()
| time_band | spread_ratio |
|---|---|
| peak | 1.14 |
| off-peak | 1.08 |
pd.DataFrame.apply(..., axis=1)Apply a custom function to each row by using pd.DataFrame.apply(..., axis=1). Here, you can combine columns into something new for every row. The axis=1 argument means: “run the function once per row”.
BEFORE
| from_ zone |
to_ zone |
time_ band |
|---|---|---|
| Inner | Outer | peak |
| Inner | Outer | off-peak |
AFTER row-wise apply
| from_ zone |
to_ zone |
time_ band |
route_signature |
|---|---|---|---|
| Inner | Outer | peak | Inner -> Outer (peak) |
| Inner | Outer | off-peak | Inner -> Outer (off-peak) |
groupby().apply(...)Here, we compare each route pair (from_zone, to_zone) by passing each group to our function:
BEFORE
| from_ zone |
to_ zone |
time_ band |
time |
|---|---|---|---|
| Inner | Outer | peak | 51 |
| Inner | Outer | off-peak | 43 |
| Inner | Outer | peak | 56 |
| Inner | Outer | off-peak | 40 |
| ... | ... | ... | ... |
AFTER grouped apply
| from_ zone |
to_ zone |
peak_ mean |
off_peak_ mean |
delta_ min |
|---|---|---|---|---|
| Inner | Outer | 52.3 | 41.5 | 10.8 |
| Inner | Suburban | 47.0 | 39.2 | 7.8 |
Note: The values here are made up so you can see that in the end, you produce just one row per group.
Here’s something new. Most APIs send nested data structures, but to analyse the data, you usually want a flat table.
NESTED JSON
TABULAR TARGET
| duration | fare_totalCost | from | to |
|---|---|---|---|
| 42 | 2.90 | WC2A 2AE | SE1 9JA |
| 37 | 2.40 | WC2A 2AE | SE1 9JA |
| 39 | 2.60 | WC2A 2AE | SW11 1AA |
| 45 | 3.10 | WC2A 2AE | E15 2EE |
| … | … | … | … |
You probably faced this in ✍️ Mini Project 1 when unpacking nested dictionaries into columns.
json_normalize()This route works, but it is fragile when nested structure changes.
BEFORE manual extraction
| response object |
|---|
{"requestContext": {...}, "journeys": [...]} |
AFTER manual extraction
| duration | fare | from | to |
|---|---|---|---|
| 42 | {...} |
WC2A 2AE | SE1 9JA |
| 37 | {...} |
WC2A 2AE | SE1 9JA |
| … | … | … | … |
We still have some nested structures in the fare column.
json_normalize()Use json_normalize() when your main task is flattening nested response structures into analysis-ready tables.
You could use it alone: pd.json_normalize(journey_response) or you could make use of its parameters:
record_path: the path to the nested list to expandmeta: the path to the parent object to keepsep: the separator to use for the column namesrecord_pathrecord_path choice |
What happens | Typical row unit |
|---|---|---|
| not set | nested journeys stay in one cell | one response object |
"journeys" |
journeys are expanded to multiple rows | one journey |
record_path outputWITHOUT record_path
| requestContext.from | requestContext.to | journeys |
|---|---|---|
| WC2A 2AE | SE1 9JA | [{...}, {...}, ...] |
All the data is here, but most of it is packed inside one cell.
WITH record_path="journeys"
| duration | fare.totalCost |
|---|---|
| 42 | 2.90 |
| 37 | 2.40 |
| 39 | 2.60 |
| 45 | 3.10 |
| … | … |
We lost the data outside the journeys key but at least we have one row per journey!
metaOnce rows are expanded from journeys, we often need route context copied onto each row. meta does that.
meta choice |
Effect on flattened rows |
|---|---|
| not set | journey rows lose route context columns |
| include origin/destination paths | each row keeps route context and journey values |
meta context preservationWITH record_path
BUT WITHOUT meta
| duration | fare.totalCost |
|---|---|
| 42 | 2.90 |
| 37 | 2.40 |
| 39 | 2.60 |
| 45 | 3.10 |
| … | … |
WITH record_path
AND WITH meta
| duration | fare.totalCost | requestContext.from | requestContext.to |
|---|---|---|---|
| 42 | 2.90 | WC2A 2AE | SE1 9JA |
| 37 | 2.40 | WC2A 2AE | SE1 9JA |
| 39 | 2.60 | WC2A 2AE | SW11 1AA |
| 45 | 3.10 | WC2A 2AE | E15 2EE |
| … | … | … | … |
Both pieces of information is preserved!! Useful for a groupby() operation later if I have multiple routes in the same DataFrame.
sepThis does not change the data values. It changes column naming grammar for readability and downstream coding.
sep value |
Example output column |
|---|---|
default "." |
fare.totalCost |
"__" |
fare__totalCost |
In the lab, your class teacher will walk you through calling the TfL API with a working notebook, then guide you through flattening the nested JSON response with json_normalize().
By the end of lab, you should have your ✍️ Mini Project 2 repository cloned, at least one real API call in your own NB01, and the raw JSON saved to data/raw/.
Before W08 Lecture, have a look at the ONS Postcode Directory data dictionary in the ✍️ Mini Project 2 spec. W08 is about databases and merging data from multiple sources, and you will want to know what columns are available before you get there.
![]()
LSE DS105W (2025/26)