πŸ’» Lab 10 – Data reshaping (joins and pivots)

Week 03 – Day 02 - Lab Roadmap (90 min)

Author
Published

25 July 2001

πŸ₯… Objectives

  • Exercise your join and pivot skills with a good challenge!

πŸ“‹ Lab Task(s)

Using the IMDb dataset as the data source (You should have received an e-mail with a link to the dataset.), you are tasked with re-creating a tibble summary_df in a single pipe chain.

The dataset has the following features:

  • it combines data from the title_basics, title_ratings, and title_episodes tables
  • it contains only the five TV series which have been rated by at least 1 million users
  • it contains ALL episodes of the five TV series
  • it contains the ratings and number of votes for each episode
  • it contains the titles of each episode
  • it is sorted by parentTconst, seasonNumber, and episodeNumber

Click on the button below if you want to see how your data frame should look like:

summary_df |> glimpse()
Rows: 580
Columns: 8
$ parentTconst  <chr> "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747", "tt0903747…
$ tvSeries      <chr> "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaking Bad", "Breaking Bad", "Brea…
$ tconst        <chr> "tt0959621", "tt1054724", "tt1054725", "tt1054726", "tt1054727", "tt1054728", "tt1054729", "tt1232244…
$ seasonNumber  <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4,…
$ episodeNumber <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
$ numVotes      <int> 39948, 29132, 28224, 27198, 26737, 31457, 27252, 24733, 28405, 23801, 23910, 23246, 25540, 23379, 279…
$ averageRating <dbl> 9.0, 8.6, 8.7, 8.2, 8.3, 9.3, 8.8, 8.6, 9.3, 8.3, 8.2, 8.3, 8.8, 8.6, 9.2, 9.1, 8.4, 8.9, 9.3, 9.2, 8…
$ primaryTitle  <chr> "Pilot", "Cat's in the Bag...", "...And the Bag's in the River", "Cancer Man", "Gray Matter", "Crazy …

These are the first few lines of the summary_df tibble:

head(summary_df)
parentTconst tvSeries tconst seasonNumber episodeNumber numVotes averageRating primaryTitle
tt0903747 Breaking Bad tt0959621 1 1 39948 9.0 Pilot
tt0903747 Breaking Bad tt1054724 1 2 29132 8.6 Cat’s in the Bag…
tt0903747 Breaking Bad tt1054725 1 3 28224 8.7 …And the Bag’s in the River
tt0903747 Breaking Bad tt1054726 1 4 27198 8.2 Cancer Man
tt0903747 Breaking Bad tt1054727 1 5 26737 8.3 Gray Matter
tt0903747 Breaking Bad tt1054728 1 6 31457 9.3 Crazy Handful of Nothin’
tt0903747 Breaking Bad tt1054729 1 7 27252 8.8 A No-Rough-Stuff-Type Deal
tt0903747 Breaking Bad tt1232244 2 1 24733 8.6 Seven Thirty-Seven
tt0903747 Breaking Bad tt1232249 2 2 28405 9.3 Grilled
tt0903747 Breaking Bad tt1232250 2 3 23801 8.3 Bit by a Dead Bee

These are the last few lines of the summary_df tibble:

tail(summary)
parentTconst tvSeries tconst seasonNumber episodeNumber numVotes averageRating primaryTitle
tt1520211 The Walking Dead tt12772908 11 15 4909 7.4 Trust
tt1520211 The Walking Dead tt12772916 11 16 5559 7.8 Acts of God
tt1520211 The Walking Dead tt13060666 11 17 5074 7.4 Lockdown
tt1520211 The Walking Dead tt13062124 11 18 4892 7.5 A New Deal
tt1520211 The Walking Dead tt13062126 11 19 4475 7.2 Variant
tt1520211 The Walking Dead tt13062128 11 20 4625 7.3 What’s Been Lost
tt1520211 The Walking Dead tt13062130 11 21 4514 7.2 Outpost 22
tt1520211 The Walking Dead tt13062132 11 22 5048 7.9 Faith
tt1520211 The Walking Dead tt13062134 11 23 6657 8.5 Family
tt1520211 The Walking Dead tt13062136 11 24 11477 8.3 Rest in Peace

πŸ’‘ Tip: Take baby steps. Start by thinking of how you will collect the top 5 tv Series first, then how to collect the tconst of their episodes, etc.