DS105 2025-2026 Winter Term Icon

πŸ’» Week 08 Lab

Reshaping and merging workshop: build your NB02

Author

Dr Jon Cardoso-Silva

Published

19 March 2026

πŸ₯… Learning Goals

By the end of this lab, you should be able to: i) merge your TfL journey data with the ONS Postcode Directory using clean postcode keys, ii) reshape your merged data for analysis using melt and pivot_table, iii) have a working NB02 pipeline in your MP2 repo that transforms raw data into a processed DataFrame ready for EDA.

This lab builds directly on πŸ–₯️ W08 Lecture, where you learned pd.concat(), .melt(), .pivot_table(), and pd.merge() using synthetic data. Today you practise those tools on shared exercises, then connect them to your own ✍️ Mini-Project 2 data.

πŸ“‹ Preparation

  • Attend or watch the πŸ–₯️ W08 Lecture
  • Bring your MP2 repository open in Nuvolos β€” your data/raw/ folder must contain london_postcodes-ons-postcodes-directory-feb22.csv; if not, follow the setup instructions on the ✍️ Mini-Project 2 page
  • Open the lab notebook from the Nuvolos shared folder (/files/week08/)
  • If you are not working on Nuvolos, download the lab notebook below:

πŸ›£οΈ Lab Roadmap

Part Activity Type Focus Time Outcome
Part 1 🎯 ACTION POINTS Path challenge: locate your ONS file 15 min You load the ONS CSV from your MP2 repo using a relative path
Part 2 πŸ—£οΈ TEACHING MOMENT Guided merge with dirty postcodes 25 min Everyone merges a practice DataFrame with the real ONS data
Part 3 πŸ—£οΈ TEACHING MOMENT Reshaping and visualisation 20 min Everyone produces a melt β†’ strip+box+mean plot
Part 4 🎯 ACTION POINTS Work on your MP2 30 min Your NB01 and NB02 are done before W09 Lecture
Wrap-Up πŸ—£οΈ TEACHING MOMENT By-Thursday checklist 10 min Everyone knows what to finish before W09

πŸ’‘ Note: Parts 1–3 use shared data and a shared lab notebook. Part 4 is in your own MP2 repository.

Part 1: Check if you truly understand relative paths (15 min)

The lab notebook lives at /files/week08/ on Nuvolos. Your MP2 repository lives at /files/<your-github-repo-folder>/. Your ONS CSV sits at:

/files/<your-github-repo-folder>/data/raw/london_postcodes-ons-postcodes-directory-feb22.csv

🎯 ACTION POINTS

  1. Open the lab notebook from /files/week08/

  2. Define ONS_PATH as a relative path from the notebook to your ONS CSV.

    That is, without copying the ONS file or hardcoding an absolute path! From inside the W08 NB02 notebook, you should be able to load the ONS CSV using pd.read_csv(ONS_PATH) and see the expected output.

  3. Run the check cell: you should see ONS Postcode Directory loaded: (326214, ...) or similar

  4. If you get a FileNotFoundError, check: is your repo folder name spelled correctly? Are you one ../ too deep?

Tip

Think back to πŸ–₯️ W03. The ../ pattern moves you up one directory level. From /files/week08/, one ../ takes you to /files/ β€” from there, you can navigate into your repo folder.

Part 2: The merge challenge (25 min)

Note to class teachers: Keep this synchronised. The goal is for everyone to produce the same df_merged. Do not hint at the solution β€” let students work through it. Walk through the instructor notebook solution at the end of this part.

Your lab notebook defines this practice DataFrame for you:

df_practice = pd.DataFrame({
    "destination_name": [
        "Barking", "Barking",
        "Richmond", "Richmond",
        "Old Marylebone Rd",
    ],
    "destination_postcode": [
        "ig11 0ab",
        " IG11 0AB ",
        "TW9 1dn ",
        " tw9 1dn",
        "NW8 9JW",
    ],
    "duration_min": [62, 58, 44, 47, 38],
    "time_band": ["peak", "off-peak", "peak", "off-peak", "peak"],
})

🎯 ACTION POINTS

Merge df_practice with df_ons_full (keeping only pcds, oslaua, lsoa11, imd from the ONS side) so that df_merged looks exactly like this:

destination
_name
destination
_postcode
duration
_min
time
_band
pcds oslaua lsoa11 imd
0 Barking IG11 0AB 62 peak IG11 0AB E09000002 E01000092 6348.0
1 Barking IG11 0AB 58 off-peak IG11 0AB E09000002 E01000092 6348.0
2 Richmond TW9 1DN 44 peak TW9 1DN E09000027 E01003876 28654.0
3 Richmond TW9 1DN 47 off-peak TW9 1DN E09000027 E01003876 28654.0
4 Old Marylebone Rd NW8 9JW 38 peak NaN NaN NaN NaN

πŸ’‘ Note: The class teacher will walk through a solution at the end of this part.

Part 3: Reshaping and visualisation (20 min)

Note to class teachers: Use df_all from the shared lecture data (data/tfl_journeys_all.csv). The groupby step is provided in the notebook - the challenge is melt and plotting. Before students start the plot, run the barplot discussion: β€œWe have plot_df ready. I know we’ll cover visualisation properly next week, but Jon made one point about barplots today. What was it? What should we use instead when n is small?”

Starting from the 40-row synthetic dataset in the shared data/ folder, practise the full pipeline.

🎯 ACTION POINTS

  1. Run the provided groupby cell to produce summary (mean duration per destination Γ— time band)

  2. Use .melt() on summary to create plot_df with columns destination, time_band, mean_duration_min

  3. Produce a strip + box + mean overlay plot of duration_min from df_all, split by destination and time band.

    Consider swapping the x axis with the hue. Does it make a difference if destination is on the x axis or if time band is on the x axis? Which one do you prefer for this dataset, and why?


A polite panda holding a survey form, looking hopeful

Tell the LSE about your experience in this course!
β„Ή
(6 out of 103 of you have completed the course survey)
6%
0% – 50%
50% – 75%
75% – 100%

While you settle into Part 4, could we ask a small but important favour? The LSE runs a course survey every term, and your feedback genuinely shapes how this module is taught next year. It takes about 3 minutes. 🐼

πŸ’‘ Note: Please assess all the instructors you have interacted with
(Jon counts as a teacher too!).

Last updated: 12 March 2026


Part 4: Work on your Mini-Project 2 (30 min)

Note to class teachers: Students now work in their own repos. Circulate and prompt each student to articulate their ONS decision out loud before they start coding. The common mistake is merging ONS mechanically without knowing why. Also check for ../data/raw/ paths in NB02.

You have loaded the ONS data, merged on a cleaned postcode key, and practised the reshape-to-plot pipeline. Before you open your own NB02, think through how the ONS dataset fits into your specific project.

A decision to make:

How could you use the ONS data in a way that demonstrates you’ve genuinely understood today’s material β€” not just run the code?

  • Would you use it to select destination postcodes in NB01 β€” browsing oslaua, imd, or lsoa11 to justify where you looked?
  • Or would you use it after collection in NB02 β€” merging ONS attributes into your journey data so you can group or filter by geography in NB03?
  • Could you do both, and does combining both actually strengthen your analysis or just add noise?

There is no single right answer. The key is that your decision is documented in REPORT.md and visible in your code.

Wrap-Up & Next Steps (10 min)

Note to class teachers: Close by running through the by-Thursday checklist out loud. Ask two or three students to share their ONS decision and why. Students should leave knowing what is still outstanding, not just what they ran today.

Before You Leave:

  • You successfully loaded your ONS CSV using a relative path from a different folder
  • You can explain why row 4 in the merge exercise produced NaN and what that means for your own data
  • You know whether you are going to use ONS in NB01, NB02, or both, and you can say why
  • Your by-Thursday checklist is realistic given where you are right now

Looking Ahead:

  • Week 09 Lecture: EDA quality checks, mean vs median, correlation traps, and an introduction to closeread
  • Week 09 Lab: refine your EDA, work on REPORT.md, peer review
  • Week 10: ✍️ Mini-Project 2 deadline is Monday 23 March, 8 pm

πŸ”— Useful Resources

πŸ’» Course Materials

πŸ†˜ Getting Help

  • Slack: Post questions to #help channel
  • Office Hours: Book via StudentHub
  • Check staff availability on βœ‹ Contact Hours