Week 07
✍️ Mini Project 2 Launch: Focus on Methodology, Then Set Up the Tech

DS105W – Data for Data Science

🗓️ 05 Mar 2026

Hour 1: Methodology clinic (`NB01`)

Let’s talk about the new challenge ahead: ✍️ Mini-Project 2.

Discuss what makes a strong approach to your mini-project
Draft your own plan
Give a peer some feedback
Submit a single Mentimeter entry (Scope, Confidence, Risk)

Your next challenge

How does travelling between Inner London and Outer London change when it happens during peak hours versus off-peak hours?

What’s already set in the question:

We focus on the journey between two points: Inner London and Outer London.
We care about the time of day: peak and off-peak.
We want to compare how journey times differ between these two parts of the day.

Decisions you must make

How does travelling between Inner London and Outer London change when it happens during peak hours versus off-peak hours?

What you decide:

Which time period will you study? (Next week? Next month? Only weekdays? Only weekends?)
How will you define Inner London?
Which Outer London destinations will you choose, and why?
How will you define peak and off-peak travel?
What aspect will you measure? Time in minutes? Walking distance in meters? Cost in pounds? Number of transfers?
How will you compare values numerically?

Groupby recap

In this section, we will:

Look at how to Group, Apply, and Combine data (with clear visuals)
Explore the shape of a nested JSON
See how json_normalize() turns that structure into a flat table

What happens when you use `groupby()` with a dataframe

Imagine you have this dataframe:

df = pd.DataFrame({
    "route_id": ["R1", "R2", "R3", "R4", "R5"],
    "time_band": ["peak", "off-peak", "peak", "off-peak", "peak"],
    "duration_min": [51, 43, 56, 40, 49],
})

df

If you group data by time_band, you produce this:

df.groupby("time_band")

BEFORE GROUPING

route_id	time_band	duration_min
R1	peak	51
R2	off-peak	43
R3	peak	56
R4	off-peak	40
R5	peak	49

AFTER GROUPING

group label	rows inside
peak	R1, R3, R5
off-peak	R2, R4

The same rows are now split by time_band.

Now I need to do something with each group 👉

What `groupby().agg()` does

We can summarize each group using built-in aggregation functions. agg() lets us do this for specific columns:

df.groupby("time_band").agg(
    mean_duration=("duration_min", "mean"),
    median_duration=("duration_min", "median"),
    n_routes=("route_id", "count"),
)

BEFORE summarizing

group	values
peak	51, 56, 49
off-peak	43, 40

AFTER summarizing

time_band	mean_ duration	median_ duration	n_ routes
peak	52.0	51.0	3
off-peak	41.5	41.5	2

What `groupby().apply()` lets you do

Use pd.Series.apply() when you want to decide exactly how to calculate something for each group. groupby().apply() lets you run your own function on every group:

def spread_ratio(g):
    return g["duration_min"].max() / g["duration_min"].min()

df.groupby("time_band").apply(spread_ratio)

BEFORE summarizing

group	values
peak	51, 56, 49
off-peak	43, 40

AFTER groupby().apply()

time_band	spread_ratio
peak	1.14
off-peak	1.08

Using `pd.DataFrame.apply(..., axis=1)`

Apply a custom function to each row by using pd.DataFrame.apply(..., axis=1). Here, you can combine columns into something new for every row. The axis=1 argument means: “run the function once per row”.

def route_signature(row):
    return f"{row['from_zone']} -> {row['to_zone']} ({row['time_band']})"

df["route_signature"] = df.apply(route_signature, axis=1)
df[["from_zone", "to_zone", "time_band", "route_signature"]].head()

BEFORE

from_ zone	to_ zone	time_ band
Inner	Outer	peak
Inner	Outer	off-peak

AFTER row-wise apply

from_ zone	to_ zone	time_ band	route_signature
Inner	Outer	peak	`Inner -> Outer (peak)`
Inner	Outer	off-peak	`Inner -> Outer (off-peak)`

Comparing groups with `groupby().apply(...)`

Here, we compare each route pair (from_zone, to_zone) by passing each group to our function:

grouped_signature = (
  df
  .groupby(["from_zone", "to_zone"])
  .apply(compare_peak_vs_off_peak, axis=1)
)

grouped_signature.head()

BEFORE

from_ zone	to_ zone	time_ band	time
Inner	Outer	peak	51
Inner	Outer	off-peak	43
Inner	Outer	peak	56
Inner	Outer	off-peak	40
...	...	...	...

AFTER grouped apply

from_ zone	to_ zone	peak_ mean	off_peak_ mean	delta_ min
Inner	Outer	52.3	41.5	10.8
Inner	Suburban	47.0	39.2	7.8

Note: The values here are made up so you can see that in the end, you produce just one row per group.

Nested JSON data

Here’s something new. Most APIs send nested data structures, but to analyse the data, you usually want a flat table.

NESTED JSON

{
  "requestContext": {
    "from": "WC2A 2AE",
    "to": "SE1 9JA"
  },
  "journeys": [
    {"duration": 42, "fare": {"totalCost": 2.90}},
    {"duration": 37, "fare": {"totalCost": 2.40}}
  ]
}

TABULAR TARGET

duration	fare_totalCost	from	to
42	2.90	WC2A 2AE	SE1 9JA
37	2.40	WC2A 2AE	SE1 9JA
39	2.60	WC2A 2AE	SW11 1AA
45	3.10	WC2A 2AE	E15 2EE
…	…	…	…

You probably faced this in ✍️ Mini Project 1 when unpacking nested dictionaries into columns.

Manual route first: what we would do without `json_normalize()`

This route works, but it is fragile when nested structure changes.

journeys_df = pd.DataFrame(journey_response["journeys"])
journeys_df["from"] = journey_response["requestContext"]["from"]
journeys_df["to"] = journey_response["requestContext"]["to"]
journeys_df.head()

BEFORE manual extraction

response object
`{"requestContext": {...}, "journeys": [...]}`

AFTER manual extraction

duration	fare	from	to
42	`{...}`	WC2A 2AE	SE1 9JA
37	`{...}`	WC2A 2AE	SE1 9JA
…	…	…	…

We still have some nested structures in the fare column.

Pandas has a function that does this in one consistent pattern: `json_normalize()`

Use json_normalize() when your main task is flattening nested response structures into analysis-ready tables.

You could use it alone: pd.json_normalize(journey_response) or you could make use of its parameters:

record_path: the path to the nested list to expand
meta: the path to the parent object to keep
sep: the separator to use for the column names

Parameter effect 1: `record_path`

# option A: keep journeys nested in one row
pd.json_normalize(journey_response)

# option B: one row per journey
pd.json_normalize(journey_response, record_path="journeys")

`record_path` choice	What happens	Typical row unit
not set	nested journeys stay in one cell	one response object
`"journeys"`	journeys are expanded to multiple rows	one journey

Visual comparison: `record_path` output

WITHOUT record_path

requestContext.from	requestContext.to	journeys
WC2A 2AE	SE1 9JA	`[{...}, {...}, ...]`

All the data is here, but most of it is packed inside one cell.

WITH record_path="journeys"

duration	fare.totalCost
42	2.90
37	2.40
39	2.60
45	3.10
…	…

We lost the data outside the journeys key but at least we have one row per journey!

Parameter effect 2: `meta`

Once rows are expanded from journeys, we often need route context copied onto each row. meta does that.

pd.json_normalize(
    journey_response,
    record_path="journeys",
    meta=[["requestContext", "from"], ["requestContext", "to"]],
)

`meta` choice	Effect on flattened rows
not set	journey rows lose route context columns
include origin/destination paths	each row keeps route context and journey values

Visual comparison: `meta` context preservation

WITH record_path
BUT WITHOUT meta

duration	fare.totalCost
42	2.90
37	2.40
39	2.60
45	3.10
…	…

WITH record_path
AND WITH meta

duration	fare.totalCost	requestContext.from	requestContext.to
42	2.90	WC2A 2AE	SE1 9JA
37	2.40	WC2A 2AE	SE1 9JA
39	2.60	WC2A 2AE	SW11 1AA
45	3.10	WC2A 2AE	E15 2EE
…	…	…	…

Both pieces of information is preserved!! Useful for a groupby() operation later if I have multiple routes in the same DataFrame.

Parameter effect 3: `sep`

This does not change the data values. It changes column naming grammar for readability and downstream coding.

df = pd.json_normalize(
    journey_response,
    record_path="journeys",
    sep="__",
)

`sep` value	Example output column
default `"."`	`fare.totalCost`
`"__"`	`fare__totalCost`

What’s Next?

In the lab, your class teacher will walk you through calling the TfL API with a working notebook, then guide you through flattening the nested JSON response with json_normalize().
By the end of lab, you should have your ✍️ Mini Project 2 repository cloned, at least one real API call in your own NB01, and the raw JSON saved to data/raw/.
Before W08 Lecture, have a look at the ONS Postcode Directory data dictionary in the ✍️ Mini Project 2 spec. W08 is about databases and merging data from multiple sources, and you will want to know what columns are available before you get there.

Week 07 ✍️ Mini Project 2 Launch: Focus on Methodology, Then Set Up the Tech

Hour 1: Methodology clinic (NB01)

Your next challenge

Decisions you must make

Groupby recap

What happens when you use groupby() with a dataframe

What groupby().agg() does

What groupby().apply() lets you do

Using pd.DataFrame.apply(..., axis=1)

Comparing groups with groupby().apply(...)

Nested JSON data

Manual route first: what we would do without json_normalize()

Pandas has a function that does this in one consistent pattern: json_normalize()

Parameter effect 1: record_path

Visual comparison: record_path output

Parameter effect 2: meta

Visual comparison: meta context preservation

Parameter effect 3: sep