✅ (Solutions) Lab 10
The solution
library(DBI)
library(RSQLite)
library(dplyr)
library(dbplyr)
# Adjust the path accordingly
con <- dbConnect(RSQLite::SQLite(), "../../data/imdb.db")
# Load the tables
title_basics_tbl <- tbl(con, "title_basics")
title_ratings_tbl <- tbl(con, "title_ratings")
title_episode_tbl <- tbl(con, "title_episode")Then this is THE solution:
(title_basics_tbl |> filter(titleType == 'tvSeries') |>
inner_join(title_ratings_tbl |> filter(numVotes >= 1e6)) |>
select(parentTconst=tconst, tvSeries=originalTitle)) |>
inner_join(title_episode_tbl) |>
inner_join(title_ratings_tbl |> select(tconst, numVotes, averageRating)) |>
inner_join(title_basics_tbl |> select(tconst, primaryTitle)) |>
arrange(tvSeries, seasonNumber, episodeNumber)The explanation
Too complex? Here is a step-by-step explanation:
First join
First, you must spot that you have to get titleType == 'tvSeries' records from the title_basics table and that the info about the number of votes is in the other table, title_ratings. That is, you need an inner join between the two tables.
Eventually, you should reach:
top_tv_series <- (title_basics_tbl |> filter(titleType == 'tvSeries') |>
inner_join(title_ratings_tbl |> filter(numVotes >= 1e6)) |>
select(parentTconst=tconst, tvSeries=originalTitle))which returns:
| parentTconst | tvSeries |
|---|---|
| tt0108778 | Friends |
| tt0903747 | Breaking Bad |
| tt0944947 | Game of Thrones |
| tt1520211 | The Walking Dead |
| tt4574334 | Stranger Things |
Note: I have renamed the columns because I need to refer to the TV shows as tvSeries in the next step. It’s ok if, at this stage, you didn’t really feel the need to rename columns. But later, you should come back and edit this join to rename the columns accordingly.
Then you need a second join
Now you need to collect the episodes, that exist in a separate table:
top_tv_series |> inner_join(title_episode_tbl)This ONLY works because I’ve renamed the columns in the previous step. This inner join will add three columns tconst, seasonNumber, and episodeNumber to the table.
| parentTconst | tvSeries | tconst | seasonNumber | episodeNumber |
|---|---|---|---|---|
| tt0108778 | Friends | tt0583431 | 7 | 8 |
| tt0108778 | Friends | tt0583432 | 10 | 9 |
| tt0108778 | Friends | tt0583433 | 10 | 17 |
| tt0108778 | Friends | tt0583435 | 8 | 1 |
| tt0108778 | Friends | tt0583436 | 10 | 1 |
| tt0108778 | Friends | tt0583437 | 5 | 1 |
| tt0108778 | Friends | tt0583438 | 6 | 1 |
| tt0108778 | Friends | tt0583439 | 2 | 12 |
| tt0108778 | Friends | tt0583440 | 5 | 3 |
| tt0108778 | Friends | tt0583441 | 6 | 15 |
Yet another join to get the ratings
Now, you need to get the ratings for each episode (another join by tconst):
previous_step |>
inner_join(title_ratings_tbl |> select(tconst, numVotes, averageRating))This will add two columns to the table: numVotes and averageRating.
The fourth and final join!
To get the title of each episode:
previous_step |>
inner_join(title_basics_tbl |> select(tconst, primaryTitle)) 📣 Found a neater solution? Post it on Slack!