✅ (Solutions) Lab 10

Author
Published

25 July 2023

The solution

library(DBI)
library(RSQLite)
library(dplyr)
library(dbplyr)

# Adjust the path accordingly
con <- dbConnect(RSQLite::SQLite(), "../../data/imdb.db")

# Load the tables
title_basics_tbl <- tbl(con, "title_basics")
title_ratings_tbl <- tbl(con, "title_ratings")
title_episode_tbl <- tbl(con, "title_episode")

Then this is THE solution:

(title_basics_tbl |> filter(titleType == 'tvSeries') |> 
    inner_join(title_ratings_tbl |> filter(numVotes >= 1e6)) |> 
    select(parentTconst=tconst, tvSeries=originalTitle)) |> 
    inner_join(title_episode_tbl) |> 
    inner_join(title_ratings_tbl |> select(tconst, numVotes, averageRating)) |> 
    inner_join(title_basics_tbl |> select(tconst, primaryTitle)) |> 
    arrange(tvSeries, seasonNumber, episodeNumber)

The explanation

Too complex? Here is a step-by-step explanation:

First join

First, you must spot that you have to get titleType == 'tvSeries' records from the title_basics table and that the info about the number of votes is in the other table, title_ratings. That is, you need an inner join between the two tables.

Eventually, you should reach:

top_tv_series <- (title_basics_tbl |> filter(titleType == 'tvSeries') |> 
    inner_join(title_ratings_tbl |> filter(numVotes >= 1e6)) |> 
    select(parentTconst=tconst, tvSeries=originalTitle))

which returns:

parentTconst tvSeries
tt0108778 Friends
tt0903747 Breaking Bad
tt0944947 Game of Thrones
tt1520211 The Walking Dead
tt4574334 Stranger Things

Note: I have renamed the columns because I need to refer to the TV shows as tvSeries in the next step. It’s ok if, at this stage, you didn’t really feel the need to rename columns. But later, you should come back and edit this join to rename the columns accordingly.

Then you need a second join

Now you need to collect the episodes, that exist in a separate table:

top_tv_series |>  inner_join(title_episode_tbl)

This ONLY works because I’ve renamed the columns in the previous step. This inner join will add three columns tconst, seasonNumber, and episodeNumber to the table.

parentTconst tvSeries tconst seasonNumber episodeNumber
tt0108778 Friends tt0583431 7 8
tt0108778 Friends tt0583432 10 9
tt0108778 Friends tt0583433 10 17
tt0108778 Friends tt0583435 8 1
tt0108778 Friends tt0583436 10 1
tt0108778 Friends tt0583437 5 1
tt0108778 Friends tt0583438 6 1
tt0108778 Friends tt0583439 2 12
tt0108778 Friends tt0583440 5 3
tt0108778 Friends tt0583441 6 15

Yet another join to get the ratings

Now, you need to get the ratings for each episode (another join by tconst):

previous_step |>
    inner_join(title_ratings_tbl |> select(tconst, numVotes, averageRating))

This will add two columns to the table: numVotes and averageRating.

The fourth and final join!

To get the title of each episode:

previous_step |>
inner_join(title_basics_tbl |> select(tconst, primaryTitle)) 

📣 Found a neater solution? Post it on Slack!