DS105 2025-2026 Winter Term Icon

πŸ–₯️ Week 09 Lecture

Exploratory Data Analysis & Data Visualisation

Author

Dr Jon Cardoso-Silva

Published

19 March 2026

πŸ₯… Learning Goals

By the end of this lecture, you should be able to: i) explain why summary statistics alone can mislead (Anscombe’s Quartet, Datasaurus Dozen), ii) check data completeness and identify systematic missingness, iii) choose between mean and median based on distribution shape and justify your choice, iv) identify and investigate outliers using domain reasoning, v) recognise common data visualisation sins and apply evidence-based alternatives, vi) use sns.FacetGrid to compare distributions across groups, vii) distinguish correlation from causation in your own analysis.

πŸ“Œ Update announced in this lecture: Mini-Project 2 is now due on Wednesday 1 April 2026 at 8 pm UK time. That gives you an extra week to apply today’s EDA and visualisation work to NB03 and REPORT.md. The W11 group pitch is now formative, so you will receive feedback on the day as you would anyway but it will no longer count towards your grade.

πŸ“ Logistics

πŸ“Location: Thursday, 19 March 2026, 4-6 pm at CKK.LG.03

Today’s lecture covers two connected topics. We’ll talk about how to freely (but not really) explore your data using EDA techniques. Then we will cover how to communicate your insights effectively with data visualisation. The second part includes a live critique activity on Slack, so make sure you have it open and notifications on!

Note: This week’s Friday lab has three parts: open EDA work, group formation for the πŸ“¦ Group Project, and GitHub Pages setup. If you already have friends you’d like to work with, let them know. Groups should ideally have 3 people, with up to 4 if needed.

πŸ“‹ Preparation

  • You went to the πŸ–₯️ W08 Lecture and πŸ’» W08 Lab
  • Your NB01 and NB02 for ✍️ Mini-Project 2 should be complete (or very close), so you can focus on analysis and NB03 this week

A polite panda holding a survey form, looking hopeful

Tell the LSE about your experience in this course!
β„Ή
(only 11 out of 103 of you have completed the course survey)
11%
0% – 50%
50% – 75%
75% – 100%

The LSE runs a course survey every term, and your feedback genuinely shapes how this module is taught next year. It takes about 3 minutes. 🐼

πŸ’‘ Note: Please assess all the instructors you have interacted with
(Jon counts as a teacher too!).

Last updated: 18 March 2026

πŸ—£οΈ Lecture Overview

Hour 1: Exploratory Data Analysis
  • Why summary statistics alone can mislead: Anscombe’s Quartet and the Datasaurus Dozen
  • Introducing the IMDb dataset: multiple tables connected by shared keys (same pd.merge() logic as W08)
  • Checking data completeness and systematic missingness with .notna() and .groupby()
  • Understanding distributions: .describe(), sns.histplot(), skewness
  • Mean vs median: why they disagree, when each is appropriate, and what the gap tells you
  • Investigating outliers with domain reasoning
Hour 2: Data Visualisation & Communication
  • The Seven Sins of data visualisation (truncated axes, pie chart overload, bar plots hiding distributions, and more)
  • Weissgerber et al. (2015): the research behind β€œshow the data, not just the summary”
  • Good dataviz examples: CJR, The Pudding, Visual Cinnamon, Closeread Prize winners
  • πŸ† Hall of Fame / πŸ—‘οΈ Hall of Shame: live critique activity on Slack
  • Static insights vs interactive dashboards (and what your REPORT.md should do)
  • sns.FacetGrid for multi-group comparison
  • Correlation vs causation: what language to use in your REPORT.md
  • Closeread: an optional scrollytelling upgrade for REPORT.md

πŸ““ Lecture Materials

Today we use facilitation slides plus one lecture notebook demonstrating the EDA workflow on IMDb data. The notebook will be shared on Nuvolos, and you can also download a zip bundle with all Week 09 files.

🎬 Facilitation Slides

Use keyboard arrows to navigate. Select the slides below or view fullscreen.

Or download the slides directly as a PDF:

πŸ”– Appendix

Useful Links

Looking Ahead

  • Friday W09 Lab: NB03 working session + group formation for the πŸ“¦ Group Project
  • Week 10 Lecture: Git collaboration for teams (git fetch, git pull, merge conflicts) and introduction to SQL with the same IMDb database
  • Monday W11: Group pitch presentations for the πŸ“¦ Group Project