π» Lab 10 β Data reshaping (joins and pivots)
Week 03 β Day 02 - Lab Roadmap (90 min)
π₯ Objectives
- Exercise your
join
andpivot
skills with a good challenge!
ποΈ Links to documentation
Packages weβre using today
π Lab Task(s)
Using the IMDb dataset as the data source (You should have received an e-mail with a link to the dataset.), you are tasked with re-creating a tibble summary_df
in a single pipe chain.
The dataset has the following features:
- it combines data from the
title_basics
,title_ratings
, andtitle_episodes
tables - it contains only the five TV series which have been rated by at least 1 million users
- it contains ALL episodes of the five TV series
- it contains the ratings and number of votes for each episode
- it contains the titles of each episode
- it is sorted by
parentTconst
,seasonNumber
, andepisodeNumber
Click on the button below if you want to see how your data frame should look like:
|> glimpse() summary_df
Rows: 580
Columns: 8
$ parentTconst <chr> "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747β¦
$ tvSeries <chr> "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaβ¦
$ tconst <chr> "tt0959621", "tt1054724", "tt1054725", "tt1054726", "tt1054727", "tt1054728", "tt1054729", "tt1232244β¦
$ seasonNumber <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4,β¦
$ episodeNumber <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12β¦
$ numVotes <int> 39948, 29132, 28224, 27198, 26737, 31457, 27252, 24733, 28405, 23801, 23910, 23246, 25540, 23379, 279β¦
$ averageRating <dbl> 9.0, 8.6, 8.7, 8.2, 8.3, 9.3, 8.8, 8.6, 9.3, 8.3, 8.2, 8.3, 8.8, 8.6, 9.2, 9.1, 8.4, 8.9, 9.3, 9.2, 8β¦
$ primaryTitle <chr> "Pilot", "Cat's in the Bag...", "...And the Bag's in the River", "Cancer Man", "Gray Matter", "Crazy β¦
These are the first few lines of the summary_df
tibble:
head(summary_df)
parentTconst | tvSeries | tconst | seasonNumber | episodeNumber | numVotes | averageRating | primaryTitle |
---|---|---|---|---|---|---|---|
tt0903747 | Breaking Bad | tt0959621 | 1 | 1 | 39948 | 9.0 | Pilot |
tt0903747 | Breaking Bad | tt1054724 | 1 | 2 | 29132 | 8.6 | Catβs in the Bagβ¦ |
tt0903747 | Breaking Bad | tt1054725 | 1 | 3 | 28224 | 8.7 | β¦And the Bagβs in the River |
tt0903747 | Breaking Bad | tt1054726 | 1 | 4 | 27198 | 8.2 | Cancer Man |
tt0903747 | Breaking Bad | tt1054727 | 1 | 5 | 26737 | 8.3 | Gray Matter |
tt0903747 | Breaking Bad | tt1054728 | 1 | 6 | 31457 | 9.3 | Crazy Handful of Nothinβ |
tt0903747 | Breaking Bad | tt1054729 | 1 | 7 | 27252 | 8.8 | A No-Rough-Stuff-Type Deal |
tt0903747 | Breaking Bad | tt1232244 | 2 | 1 | 24733 | 8.6 | Seven Thirty-Seven |
tt0903747 | Breaking Bad | tt1232249 | 2 | 2 | 28405 | 9.3 | Grilled |
tt0903747 | Breaking Bad | tt1232250 | 2 | 3 | 23801 | 8.3 | Bit by a Dead Bee |
These are the last few lines of the summary_df
tibble:
tail(summary)
parentTconst | tvSeries | tconst | seasonNumber | episodeNumber | numVotes | averageRating | primaryTitle |
---|---|---|---|---|---|---|---|
tt1520211 | The Walking Dead | tt12772908 | 11 | 15 | 4909 | 7.4 | Trust |
tt1520211 | The Walking Dead | tt12772916 | 11 | 16 | 5559 | 7.8 | Acts of God |
tt1520211 | The Walking Dead | tt13060666 | 11 | 17 | 5074 | 7.4 | Lockdown |
tt1520211 | The Walking Dead | tt13062124 | 11 | 18 | 4892 | 7.5 | A New Deal |
tt1520211 | The Walking Dead | tt13062126 | 11 | 19 | 4475 | 7.2 | Variant |
tt1520211 | The Walking Dead | tt13062128 | 11 | 20 | 4625 | 7.3 | Whatβs Been Lost |
tt1520211 | The Walking Dead | tt13062130 | 11 | 21 | 4514 | 7.2 | Outpost 22 |
tt1520211 | The Walking Dead | tt13062132 | 11 | 22 | 5048 | 7.9 | Faith |
tt1520211 | The Walking Dead | tt13062134 | 11 | 23 | 6657 | 8.5 | Family |
tt1520211 | The Walking Dead | tt13062136 | 11 | 24 | 11477 | 8.3 | Rest in Peace |
π‘ Tip: Take baby steps. Start by thinking of how you will collect the top 5 tv Series first, then how to collect the tconst
of their episodes, etc.