π» Lab 09 β Introduction to SQLite and dbplyr
Week 03 β Day 01 - Lab Roadmap (90 min)
π₯ Objectives
- Practice creating SQLite databases and tables
- From SQLiteBrowser
- As well as from R, with the
dplyr::copy_to()
command
- Practice sending simple SQL queries to SQLite databases using the DBI package
- Practice with the powerful
dbplyr
package
ποΈ Links to documentation
New packages weβre using today
Very useful links
- SQL Datatypes (Read it to learn about the different types of INTEGER, for example)
- Datatypes in SQLite
Old packages weβre using today
π Lab Tasks
Welcome to this lab! The main objectives are to strengthen your skills in using the DBI
and dbplyr
packages and to get hands-on practice with SQL.
Your exciting mission is to create a database containing all the data from the IMDb Non-Commercial Dataset. Once youβve set up the database, weβll use dbplyr
to perform queries and answer intriguing questions.
Lab Format
This lab is designed for you to work independently or in groups. There wonβt be any planned TEACHING MOMENTS during this session. However, if you encounter any questions or need assistance, donβt hesitate to reach out to your instructor. They are here to help!
Letβs embark on this adventure of exploring IMDb data and honing our database skills. Let the fun begin! π
Part 1: βοΈ Setup (10 min)
Create a new R script or, preferably, a new Quarto markdown document to host your code and answers.
Load the following packages:
library(DBI) library(dbplyr) library(RSQLite) library(readr) library(dplyr)
Create a new SQLite database called
imdb.db
using theDBI::dbConnect()
function.<- dbConnect(RSQLite::SQLite(), "imdb.db") imdb_db
If you created a SQLite database in the morning, create a new one for this lab. You can even use the same name, but make sure you create a new database. We will change a few things in the database structure.
Part 2: π₯ Download the data (10 min)
Create a folder called
data
in your project directory.Download the following files from the IMDb Non-Commercial Dataset website, and place them in the
data
folder:
- name.basics.tsv.gz
- title.akas.tsv.gz
- title.basics.tsv.gz
- title.crew.tsv.gz
- title.episode.tsv.gz
- title.principals.tsv.gz
- title.ratings.tsv.gz
π‘ These files are zipped, but readr
can handle them just fine. If you need to open the raw data in your own computer (outside of R), you can use 7-Zip to unzip them.
Part 3: π Create the tables (40 min)
Here you have two options: create the tables using SQLiteBrowser or using R directly. The major challenge here is to figure out the data types for each field and figuring out which column must be the primary key.
I choose Option 1: SQLiteBrowser
Open the SQLiteBrowser application.
Click on the
Open Database
button and select theimdb.db
file you created in Part 1.Click on the
Create Table
button and create the following tables:name_basics
title_akas
title_basics
title_crew
title_episode
title_principals
title_ratings
On the
Fields
tab, add the fields for each table. You can find the field names in the IMDb Non-Commercial Dataset website.- Choose the appropriate data type for each field.
- Make sure to select the
Primary Key
checkbox for the primary key field (each table has one).
Alternatively, write the
CREATE_TABLE
command for each table using SQL. You can use the following template:CREATE TABLE table_name ( PRIMARY KEY, column_1_name column_1_type column_2_name column_2_type, column_3_name column_3_type,... );
Once the tables have been created, click on the
Import
button and import the data from thedata
folder. Make sure to select the correct table name for each file.(Optional) Add indexes to the tables. Which columns of this table are you most likely to use in queries? You can index them using the following template:
<!-- Choose indexes that will be useful for your possible future queries --> CREATE INDEX index_name ON table_name (column_1_name, column_2_name, ...);
I choose Option 2: R
Read each file into R using the
readr::read_tsv()
function. For example:<- read_tsv("data/name.basics.tsv.gz") name_basics_df
Change data types as needed. Use the
dplyr::mutate()
function along with theacross()
helper function to change the data types of multiple columns at once.For some weird reason, the developers of the IMDb database chose to represent null values with the
\N
string. We need to replace those values withNA
using thedplyr::na_if()
function or thedplyr::if_else()
function.Use the
dplyr::copy_to()
function to copy the data frames into the SQLite database. Make sure to specify theimdb_db
connection object as the second argument.# The variables temporary and overwrite are important! # Otherwise, your changes won't be persisted in the database. copy_to(imdb_db, name_basics, "name_basics", types=c(nconst="varchar(9)", ...), # Complete this part unique=list(c("nconst")), # Add unique constraints (kind of PRIMARY KEYs) indexes=list(c("nconst")), # (optional) Add indexes temporary=FALSE, overwrite = TRUE)
When moving to another table, delete the big data frame from memory using the
rm()
function. For example:rm(name_basics)
This will help you avoid memory issues.
Part 4: π Explore the data (30 min)
Now that we have the IMDb database set up, query the data using dbplyr
and the usual dplyr
functions to create table summaries and/or plots. If you need some initial ideas, here are some questions you can answer:
- How many title types are there in the database? And how many titles of each type?
- How many movies are there in the database?
- How many movies are represented per year?
- What is the average number of actors per movie? Over time.
(We wonβt offer solutions to those questions. This is meant to be an open-ended exploration.)