πŸ’» Lab 09 – Introduction to SQLite and dbplyr

Week 03 – Day 01 - Lab Roadmap (90 min)

Author
Published

24 July 2023

πŸ₯… Objectives

  • Practice creating SQLite databases and tables
    • From SQLiteBrowser
    • As well as from R, with the dplyr::copy_to() command
  • Practice sending simple SQL queries to SQLite databases using the DBI package
  • Practice with the powerful dbplyr package

πŸ“‹ Lab Tasks

Welcome to this lab! The main objectives are to strengthen your skills in using the DBI and dbplyr packages and to get hands-on practice with SQL.

Your exciting mission is to create a database containing all the data from the IMDb Non-Commercial Dataset. Once you’ve set up the database, we’ll use dbplyr to perform queries and answer intriguing questions.

Lab Format

This lab is designed for you to work independently or in groups. There won’t be any planned TEACHING MOMENTS during this session. However, if you encounter any questions or need assistance, don’t hesitate to reach out to your instructor. They are here to help!

Let’s embark on this adventure of exploring IMDb data and honing our database skills. Let the fun begin! πŸš€

Part 1: βš™οΈ Setup (10 min)

  1. Create a new R script or, preferably, a new Quarto markdown document to host your code and answers.

  2. Load the following packages:

    library(DBI)
    library(dbplyr)
    library(RSQLite)
    library(readr)
    library(dplyr)
  3. Create a new SQLite database called imdb.db using the DBI::dbConnect() function.

    imdb_db <- dbConnect(RSQLite::SQLite(), "imdb.db")
Note

If you created a SQLite database in the morning, create a new one for this lab. You can even use the same name, but make sure you create a new database. We will change a few things in the database structure.

Part 2: πŸ“₯ Download the data (10 min)

  1. Create a folder called data in your project directory.

  2. Download the following files from the IMDb Non-Commercial Dataset website, and place them in the data folder:

  • name.basics.tsv.gz
  • title.akas.tsv.gz
  • title.basics.tsv.gz
  • title.crew.tsv.gz
  • title.episode.tsv.gz
  • title.principals.tsv.gz
  • title.ratings.tsv.gz

πŸ’‘ These files are zipped, but readr can handle them just fine. If you need to open the raw data in your own computer (outside of R), you can use 7-Zip to unzip them.

Part 3: πŸ“Š Create the tables (40 min)

Here you have two options: create the tables using SQLiteBrowser or using R directly. The major challenge here is to figure out the data types for each field and figuring out which column must be the primary key.

I choose Option 1: SQLiteBrowser
  1. Open the SQLiteBrowser application.

  2. Click on the Open Database button and select the imdb.db file you created in Part 1.

  3. Click on the Create Table button and create the following tables:

    • name_basics
    • title_akas
    • title_basics
    • title_crew
    • title_episode
    • title_principals
    • title_ratings
  4. On the Fields tab, add the fields for each table. You can find the field names in the IMDb Non-Commercial Dataset website.

    • Choose the appropriate data type for each field.
    • Make sure to select the Primary Key checkbox for the primary key field (each table has one).
  5. Alternatively, write the CREATE_TABLE command for each table using SQL. You can use the following template:

    CREATE TABLE table_name (
        column_1_name column_1_type PRIMARY KEY,
        column_2_name column_2_type,
        column_3_name column_3_type,
        ...
    );
  6. Once the tables have been created, click on the Import button and import the data from the data folder. Make sure to select the correct table name for each file.

  7. (Optional) Add indexes to the tables. Which columns of this table are you most likely to use in queries? You can index them using the following template:

    <!-- Choose indexes that will be useful for your possible future queries -->
    CREATE INDEX index_name ON table_name (column_1_name, column_2_name, ...);
I choose Option 2: R
  1. Read each file into R using the readr::read_tsv() function. For example:

    name_basics_df <- read_tsv("data/name.basics.tsv.gz")
  2. Change data types as needed. Use the dplyr::mutate() function along with the across() helper function to change the data types of multiple columns at once.

  3. For some weird reason, the developers of the IMDb database chose to represent null values with the \N string. We need to replace those values with NA using the dplyr::na_if() function or the dplyr::if_else() function.

  4. Use the dplyr::copy_to() function to copy the data frames into the SQLite database. Make sure to specify the imdb_db connection object as the second argument.

    # The variables temporary and overwrite are important!
    # Otherwise, your changes won't be persisted in the database.
    copy_to(imdb_db, name_basics, "name_basics", 
            types=c(nconst="varchar(9)", ...), # Complete this part
            unique=list(c("nconst")), # Add unique constraints (kind of PRIMARY KEYs)
            indexes=list(c("nconst")), # (optional) Add indexes
            temporary=FALSE, overwrite = TRUE)
  5. When moving to another table, delete the big data frame from memory using the rm() function. For example:

    rm(name_basics)

    This will help you avoid memory issues.

Part 4: πŸ“ Explore the data (30 min)

Now that we have the IMDb database set up, query the data using dbplyr and the usual dplyr functions to create table summaries and/or plots. If you need some initial ideas, here are some questions you can answer:

  • How many title types are there in the database? And how many titles of each type?
  • How many movies are there in the database?
  • How many movies are represented per year?
  • What is the average number of actors per movie? Over time.

(We won’t offer solutions to those questions. This is meant to be an open-ended exploration.)