DS105A – Data for Data Science
🗓️ 27 Nov 2025
These activities prepare you for your ✍️ Mini-Project 2 exploration week.
Your ✍️ Mini-Project 2 is an exploratory study.
In our context, Exploratory Data Analysis (EDA) means:
📋 Key point: You do not need to prove anything in your Mini-Project 2. You are just trying to find interesting patterns and relationships in the data.
Therefore…
If you don’t even know where to start, pick one of these strategies:
| Strategy | Description | Examples of narrower questions |
|---|---|---|
| Localised comparisons | Compare journeys for postcodes within a single area (same borough/LSOA/MSOA/OA) | > “Does everyone who live in Newham have the same level of transport connectivity?” > “Does it matter if you live north of Victoria Park or south of it?” |
| Cross-borough comparisons | Compare journeys for postcodes across neighbouring (or completely geographically opposite) boroughs | > “How are people in Tower Hamlets better connected to the rest of London than people in Barking and Dagenham?” |
| Socio-economic comparisons | Pick postcodes on different ends of the deprivation spectrum on the Index of Multiple Deprivation (IMD) | > “Are people in the most deprived areas of London more likely to have poor transport connectivity than people in the least deprived areas?” |
| Time-based comparisons | Compare journeys for postcodes at different times of the day/week/month/year | > “How does transport connectivity change throughout the day?” > “How does transport connectivity change throughout the week?” > “How does transport connectivity vary between the peak and off-peak hours?” |

This is what we will do in the first part of the lecture.
Structure:
You will now listen to your colleagues explain about the methodology they are using (or will use) for their ✍️ Mini-Project 2.
I want you to help them by evaluating their methodology using two dimensions:
You will then use the Impact/Feasibility Framework (next slide) to help them refine their methodology. You will use a physical card to do so.
A speech script for your pitch to your colleagues:
FEEDBACK CARD
Impact Score (1-5): _______
Feasibility Score (1-5): ______
A physical card for the peer evaluation will be provided during the lecture.
Poll 1: Where is your methodology now?
Your colleagues gave you feedback on your methodology, now let’s document this using Mentimeter.
Submit the average Impact and Feasibility scores you got the feedback you received.
Mentimeter access code will be provided during the lecture.
LOW FEASIBILITY
HIGH FEASIBILITY
HIGH IMPACT
REFINE ZONE
(High Impact + Low Feasibility)
Valuable direction but you’re being a bit too ambitious.
PURSUE ZONE
(High Impact + High Feasibility)
Strong approach that’s achievable.
LOW IMPACT
AVOID ZONE
(Low Impact + Low Feasibility)
Weak approach with high difficulty.
SAFE ZONE
(Low Impact + High Feasibility)
Easy to execute, just not super original.
This is perfectly fine!
Now, take some time alone to reflect on your methodology and the feedback you received. You will use the SMART-C Criteria Checklist to help you do this (next slide).
🕰️ 15 minutes (Jon will walk around helping you reflect on your methodology)
| ✓ Criterion | Check |
|---|---|
| ✓ SPECIFIC | - How are you defining ‘poor transport connectivity’? - What comparison strategy are you going for? |
| ✓ MEASURABLE | - How are you quantifying ‘poor transport connectivity’? - Does the TfL API (and/or the ONS dataset) allow you to do this? |
| ✓ ACHIEVABLE | - How much free time do you have to complete this? - Can you implement this easily in NB03? Or would you need to re-do NB01 or NB02 to do this? |
| ✓ RELEVANT | - Does it answer the research question? (you don’t need to cover ALL of London, you can focus on a specific area) |
| ✓ TESTABLE | - How will you play “devil’s advocate” and test your methodology? (how would you prove without shadow of a doubt that your insights are valid?) |
| ✓ CLEAR | - Would a “no-coder” (someone who did not take DS105) understand what you did if they read the REPORT.md? (how clear is your methodology and your writing about the insights?) |
A speech script for your pitch to your colleagues:
FEEDBACK CARD
Impact Score (1-5): _______
Feasibility Score (1-5): ______
A physical card for the peer evaluation will be provided during the lecture.
Poll 2: Where is your methodology after refinement?
Submit your refined methodology position after applying SMART-C criteria.
Compare to Round 1 distribution.
Discussion: What patterns do we see? How did methodologies evolve?

Important considerations in Exploratory Data Analysis (EDA).
Result:
tconst 100.000000
title_type 100.000000
primary_title 99.999809
original_title 99.999809
is_adult 100.000000
start_year 88.028503
end_year 1.238095
runtime_minutes 35.460020
genres 95.623222
dtype: float64
That is
tconst, title_type, and is_adultend_yearstart_year?!This should prompt you to do further checks, say:
That is, to investigate further: which ones are empty? Why does it seem like they are empty?
Is this a data collection failure? → try to fix it → document if unable to
Is it just how the data is structured? → acknowledge how it might impact your analysis
Don’t try to ‘impute’ missing data. It’s too advanced for this course.
You never know, sometimes the existence of missing data IS the insight. To check, see if the missingness is systematic (it appears more often for certain categories):
| start_year | end_year | runtime | genres | total | |
|---|---|---|---|---|---|
| title_type | |||||
| tvPilot | 100.00 % | 0.00 % | 0.00 % | 0.00 % | 1 |
| tvShort | 99.07 % | 0.00 % | 87.65 % | 100.00 % | 10,810 |
| videoGame | 97.43 % | 0.00 % | 1.06 % | 85.28 % | 45,838 |
| tvSpecial | 99.26 % | 0.00 % | 48.10 % | 87.25 % | 55,604 |
| tvMiniSeries | 93.26 % | 57.00 % | 35.06 % | 95.24 % | 66,293 |
| tvMovie | 97.20 % | 0.00 % | 68.70 % | 91.24 % | 152,979 |
| tvSeries | 91.65 % | 38.40 % | 37.36 % | 92.06 % | 290,570 |
| video | 99.42 % | 0.00 % | 68.01 % | 97.35 % | 318,364 |
| movie | 85.22 % | 0.00 % | 63.15 % | 89.41 % | 731,704 |
| short | 96.10 % | 0.00 % | 64.11 % | 100.00 % | 1,094,537 |
| tvEpisode | 86.48 % | 0.00 % | 28.22 % | 95.82 % | 9,296,511 |
The following slides contain the code to reproduce the table above.
Step 1: Define a helper function that computes completeness percentages per group:
def count_completeness(group):
total = len(group)
return pd.Series({
'total': total,
'start_year': group['start_year'].notna().sum() / total * 100,
'end_year': group['end_year'].notna().sum() / total * 100,
'runtime': group['runtime_minutes'].notna().sum() / total * 100,
'genres': group['genres'].notna().sum() / total * 100
})Step 2: Apply it to each title_type group:
Step 3: Style it with bars and format percentages:
(
plot_df.sort_values('total')
.style
.bar(vmin=0, vmax=100, height=100, width=100,
props="border: 1px solid #212121;",
subset=['start_year', 'end_year', 'runtime', 'genres'])
.format('{:,.2f} %', subset=['start_year', 'end_year', 'runtime', 'genres'])
.format('{:,.0f}', subset=['total'])
.set_caption('Table 1. How complete is table title_basics?<br>(breakdown per field and title_type)')
)SQL counts non-NULL values per column automatically:
Then compute percentages and style in pandas:
raw_df = pd.read_sql(query, conn)
plot_df = raw_df.assign(
start_year=lambda df: df['has_start_year'] / df['total'] * 100,
end_year=lambda df: df['has_end_year'] / df['total'] * 100,
runtime=lambda df: df['has_runtime'] / df['total'] * 100,
genres=lambda df: df['has_genres'] / df['total'] * 100
)[['title_type', 'start_year', 'end_year', 'runtime', 'genres', 'total']](Then the styler is the same as in the previous slide.)
I have this table of all directors listed on IMDb:
total_movies: how many movies they directed overall in their entire careertop1000_movies: how many of those fall in the top 1000 most popular and highly-rated moviesHow could I make sense of the distribution of these two columns?
plot_df.head(10)
| primary_name | total_movies | top1000_movies |
|---|---|---|
| Akira Kurosawa | 31 | 13 |
| Ingmar Bergman | 41 | 10 |
| Alfred Hitchcock | 57 | 9 |
| Martin Scorsese | 53 | 9 |
| Quentin Tarantino | 15 | 9 |
| Christopher Nolan | 14 | 9 |
| Stanley Kubrick | 13 | 9 |
| Ertem Egilmez | 44 | 8 |
| Steven Spielberg | 37 | 8 |
| Billy Wilder | 26 | 8 |
You can use describe():
Which produces:
count 266466.000000
mean 2.662099
std 5.968662
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 438.000000
Name: total_movies, dtype: float64
How to read this output?
count is the total number of directorsmean is the statistical mean number of movies directedstd is the standard deviation of the number of movies directedmin is the minimum number of movies directed25% is the 25th percentile of the number of movies directed50% is the median number of movies directed75% is the 75th percentile of the number of movies directedmax is the maximum number of movies directedHistogram approach:
A histogram allows you to see the full distribution of the data.
In this case, look at how crazy skewed the distribution is!
By the way, you don’t need a title here…
NB03, you don’t need to have a polished title yet.REPORT.md, then add a narrative title to your plot.If you further process the data carefully, you can get a more informative histogram 👉
Assuming you have loaded each of the IMDB tables into pandas DataFrames, this is how you would create the df_top_directors DataFrame.
Step 1: Identify the top-rated popular movies:
We need to do several merges because the information we need is spread across multiple tables.
Step 2: Count total movies and top movies per director:
director_movies = (
df_title_principals[df_title_principals['category'] == 'director']
.merge(df_title_basics[df_title_basics['title_type'] == 'movie'], on='tconst')
.merge(df_name_basics[['nconst', 'primary_name']], on='nconst')
)
director_movies['is_top'] = director_movies['tconst'].isin(top_movies['tconst'])
df_top_directors = (
director_movies
.groupby(['nconst', 'primary_name'])
# .agg() is like .apply() but for common aggregation functions
# the column 'total_movies' is the count of tconst values
# the column 'top1000_movies' is the sum of is_top values (True/False)
.agg(
total_movies=('tconst', 'count'),
top1000_movies=('is_top', 'sum')
)
.reset_index()
.sort_values(['top1000_movies', 'total_movies'], ascending=False)
)Here is the big SQL query to achieve the same result as the previous slide.
SELECT
nb.nconst,
nb.primary_name,
COUNT(*) AS total_movies,
COUNT(CASE WHEN top.tconst IS NOT NULL THEN 1 END) AS top1000_movies
FROM title_principals AS tp
JOIN name_basics AS nb
ON tp.nconst = nb.nconst
JOIN title_basics AS tb
ON tp.tconst = tb.tconst
LEFT JOIN (
SELECT tb2.tconst
FROM title_basics AS tb2
JOIN title_ratings AS tr2
ON tb2.tconst = tr2.tconst
WHERE tb2.title_type = 'movie'
AND tr2.num_votes > 10000
ORDER BY tr2.average_rating DESC, tr2.num_votes DESC
LIMIT 1000
) AS top
ON tb.tconst = top.tconst
WHERE tp.category = 'director'
AND tb.title_type = 'movie'
GROUP BY nb.nconst, nb.primary_name
ORDER BY top1000_movies DESC, total_movies DESC;which I then load into pandas like this:
The following slides contain the code to replicate the histogram plots.
First, a word about formatting the y-axis labels: In these examples, I used a feature from matplotlib called FuncFormatter to format the y-axis labels with the K suffix for thousands (e.g. 1000 -> 1K). Read the documentation for more details.
Then later in the code, you can use it like this:
Go to the next slide to see the code for the first plot.
Here is the code to replicate the first histogram plot.
We use the sns.histplot() function to create the histogram.
fig, ax = plt.subplots(figsize=(7, 4)) # Fix figure size here
sns.histplot(
data=df_top_directors,
x='total_movies',
binwidth=1, # How to group the data into bins
color='#2d8659',
ax=ax
)
# I manually set the limits to 0-500,
# but a robust approach would be to use `ax.set_xlim(0, df['total_movies'].max())`
ax.set_xlim(0, 500)# Format the y-axis labels with the `K` suffix for thousands (e.g. 1000 -> 1K)
ax.yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, _: f"{x/1000:.0f}K"))
ax.set_xlabel('Total movies directed', fontsize=18)
ax.set_ylabel('Number of directors', fontsize=18)
ax.tick_params(axis='both', labelsize=15)
# Wrap up and save
fig.tight_layout()
fig.savefig('./figures/directors-total-movies-hist.svg', format='svg')For the second plot, I did the aggregation “manually” by creating two DataFrames: one for the first 10 bins and one for the “10+” bin.
count_df = (
df_top_directors
.groupby('total_movies')
.size()
.reset_index(name='director_count')
.sort_values('total_movies')
)
head_df = (
count_df[count_df['total_movies'] <= 10]
.assign(total_movies=lambda x: x['total_movies'].astype(int).astype(str))
)
tail_df = (
count_df[count_df['total_movies'] > 10]
.assign(total_movies=lambda x: '10+')
.assign(director_count=lambda x: x['director_count'].sum())
)Then we can concatenate the two DataFrames and plot the result:
Then, it’s a matter of calling sns.barplot() to create the bar chart. I like them horizontal because it’s easier to read the labels.
But I also add a horizontal line at the mean and median values to help guide the eye.
mean_val = plot_df['director_count'].mean()
median_val = plot_df['director_count'].median()
ax.axhline(mean_val - 1, # I subtract 1 to align with the bars
color='#e4002b',
linestyle='--',
linewidth=2,
label=f"Mean ≈ {mean_val:.1f}")
ax.axhline(median_val - 1, # I subtract 1 to align with the bars
color='#47315E',
linestyle='--',
linewidth=2,
label=f"Median = {median_val:.0f}")
ax.legend(frameon=False, loc='lower right')Now, let’s look at a distribution that is not skewed.
I selected only movies (title_type = ‘movie’) that have more than 10,000 votes.
Here are the top 10 highest-rated popular movies of all time:
| primary_title | year | average_rating | num_votes |
|---|---|---|---|
| The Shawshank Redemption | 1994 | 9.30 | 3120155 |
| The Godfather | 1972 | 9.20 | 2175958 |
| The Chaos Class | 1975 | 9.20 | 45308 |
| Attack on Titan the Movie: The Last Attack | 2024 | 9.20 | 21568 |
| Ramayana: The Legend of Prince Rama | 1993 | 9.10 | 17401 |
| The Dark Knight | 2008 | 9.10 | 3095865 |
| The Lord of the Rings: The Return of the King | 2003 | 9.00 | 2119280 |
| Schindler’s List | 1993 | 9.00 | 1555434 |
| 12 Angry Men | 1957 | 9.00 | 956229 |
| The Godfather Part II | 1974 | 9.00 | 1462039 |
Assuming you have loaded the IMDB tables into pandas DataFrames, here’s how to build the df_top_movies dataset.
Filter for popular movies (more than 10,000 votes):
Here is the SQL query to achieve the same result as the previous slide.
which I then load into pandas like this:
Here’s how to create the box plot with quartile annotations.
We use sns.boxplot() and then add vertical lines for Q1, median, and Q3:
fig, ax = plt.subplots(figsize=(9, 2.6))
sns.boxplot(
data=df_top_movies,
x='average_rating',
color='#6BB8B7',
ax=ax
)
ax.set_xlabel('Average rating (popular films)', fontsize=18)
ax.set_ylabel('')
ax.tick_params(axis='both', labelsize=16)
fig.tight_layout()
fig.savefig('./figures/top-movies-rating-boxplot.svg', format='svg')For the histogram, we use sns.histplot() and add mean and ±1 standard deviation markers.
fig, ax = plt.subplots(figsize=(9, 4))
sns.histplot(
data=df_top_movies,
x='average_rating',
binwidth=0.2,
color='#2d8659',
ax=ax
)
ax.set_xlabel('Average rating (popular films)', fontsize=18)
ax.set_ylabel('Number of movies', fontsize=18)
ax.tick_params(axis='both', labelsize=15)
ax.set_xlim(0, 10)
ax.xaxis.set_major_locator(mtick.MultipleLocator(1))
# Clear up the plot and save
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.tight_layout()
fig.savefig('./figures/top-movies-rating-hist.svg', format='svg')Once again, you can use describe():
Which produces:
count 12118.000000
mean 6.584832
std 1.016968
min 1.000000
25% 6.000000
50% 6.700000
75% 7.300000
max 9.300000
Name: average_rating, dtype: float64
Alternatively, you can view this as a boxplot:
The way I like to describe this is:
Because the distribution is not skewed, the distribution of the data is symmetric similar to a normal distribution.
When the data looks like a bell curve, the mean and the median are very similar and it makes sense to use mean and standard deviation to describe the distribution.
The way I like to describe this using the mean and standard deviation as estimators:
Just who are these highly prolific directors?
| nconst | primary_name | total_movies | top1000_movies | |
|---|---|---|---|---|
| 710 | nm0644554 | Kinya Ogawa | 438 | 0 |
| 711 | nm0183659 | Gérard Courant | 402 | 0 |
Do a bit of research to find out who these directors are and to confirm that this is not an error in your data collection process but rather a reflection of very unique individual careers.
What are these exceptionally highly rated movies?
| primary_title | year | avg_rating | num_votes |
|---|---|---|---|
| The Shawshank Redemption | 1994 | 9.30 | 3120155 |
| The Godfather | 1972 | 9.20 | 2175958 |
| The Chaos Class | 1975 | 9.20 | 45308 |
| Attack on Titan the Movie: The Last Attack | 2024 | 9.20 | 21568 |
| Ramayana: The Legend of Prince Rama | 1993 | 9.10 | 17401 |
| The Dark Knight | 2008 | 9.10 | 3095865 |
Is this an insight? Well, it depends on what story you are trying to tell.
Repeat after me: “correlation does not imply causation” (see fun examples here)
In this lecture, we have covered:
⌛ Deadline: Week 10, Wednesday 3 December 2025, 8pm UK time
🆘 Support Sessions: Drop-in sessions next week
![]()
LSE DS105A (2025/26)