✅ (Solutions) Lab 10
The solution
library(DBI)
library(RSQLite)
library(dplyr)
library(dbplyr)
# Adjust the path accordingly
<- dbConnect(RSQLite::SQLite(), "../../data/imdb.db")
con
# Load the tables
<- tbl(con, "title_basics")
title_basics_tbl <- tbl(con, "title_ratings")
title_ratings_tbl <- tbl(con, "title_episode") title_episode_tbl
Then this is THE solution:
|> filter(titleType == 'tvSeries') |>
(title_basics_tbl inner_join(title_ratings_tbl |> filter(numVotes >= 1e6)) |>
select(parentTconst=tconst, tvSeries=originalTitle)) |>
inner_join(title_episode_tbl) |>
inner_join(title_ratings_tbl |> select(tconst, numVotes, averageRating)) |>
inner_join(title_basics_tbl |> select(tconst, primaryTitle)) |>
arrange(tvSeries, seasonNumber, episodeNumber)
The explanation
Too complex? Here is a step-by-step explanation:
First join
First, you must spot that you have to get titleType == 'tvSeries'
records from the title_basics table and that the info about the number of votes is in the other table, title_ratings
. That is, you need an inner join between the two tables.
Eventually, you should reach:
<- (title_basics_tbl |> filter(titleType == 'tvSeries') |>
top_tv_series inner_join(title_ratings_tbl |> filter(numVotes >= 1e6)) |>
select(parentTconst=tconst, tvSeries=originalTitle))
which returns:
parentTconst | tvSeries |
---|---|
tt0108778 | Friends |
tt0903747 | Breaking Bad |
tt0944947 | Game of Thrones |
tt1520211 | The Walking Dead |
tt4574334 | Stranger Things |
Note: I have renamed the columns because I need to refer to the TV shows as tvSeries
in the next step. It’s ok if, at this stage, you didn’t really feel the need to rename columns. But later, you should come back and edit this join to rename the columns accordingly.
Then you need a second join
Now you need to collect the episodes, that exist in a separate table:
|> inner_join(title_episode_tbl) top_tv_series
This ONLY works because I’ve renamed the columns in the previous step. This inner join will add three columns tconst
, seasonNumber
, and episodeNumber
to the table.
parentTconst | tvSeries | tconst | seasonNumber | episodeNumber |
---|---|---|---|---|
tt0108778 | Friends | tt0583431 | 7 | 8 |
tt0108778 | Friends | tt0583432 | 10 | 9 |
tt0108778 | Friends | tt0583433 | 10 | 17 |
tt0108778 | Friends | tt0583435 | 8 | 1 |
tt0108778 | Friends | tt0583436 | 10 | 1 |
tt0108778 | Friends | tt0583437 | 5 | 1 |
tt0108778 | Friends | tt0583438 | 6 | 1 |
tt0108778 | Friends | tt0583439 | 2 | 12 |
tt0108778 | Friends | tt0583440 | 5 | 3 |
tt0108778 | Friends | tt0583441 | 6 | 15 |
Yet another join to get the ratings
Now, you need to get the ratings for each episode (another join by tconst
):
|>
previous_step inner_join(title_ratings_tbl |> select(tconst, numVotes, averageRating))
This will add two columns to the table: numVotes
and averageRating
.
The fourth and final join!
To get the title of each episode:
|>
previous_step inner_join(title_basics_tbl |> select(tconst, primaryTitle))
📣 Found a neater solution? Post it on Slack!