✅ (Solutions) Lab 09

Author
Published

24 July 2023

Part 1: ⚙️ Setup

Load required libraries

library(DBI)        # generic database interface
library(RSQLite)    # driver for SQLite
library(readr)      # for read_tsv
library(dplyr)      # for tbl, copy_to, and other dplyr verbs
library(dbplyr)     # to enable dplyr verbs on database tables

Then create the local database file imdb.db and connect to it (if this file doesn’t exist, it will be created):

con <- dbConnect(RSQLite::SQLite(), "imdb.db")

Part 2: 📥 Download the data

Just follow the instructions

Part 3: 📊 Create the tables

Step-by-step reasoning process

Step-by-step reasoning process

My first recommendation after reading the data is to run glimpse to see what you’re dealing with:

# Change path to your own
title_basics_df <- read_tsv("data/IMDB/title.basics.tsv.gz")
title_basics_df %>% glimpse()

Did you see the following warning?

Warning message:                                                                                                                                                       
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 

This is because readr might have encountered some inconsistencies in the data. You can check what these inconsistencies are by running problems(title_basics_df). This function will inform of which rows were problematic and why. You can later inspect those rows to see what’s going on. In this case, it looks like readr has managed to fix the problems, so we can ignore this warning.

Then, the big issue is: how many characters should I use when creating my string columns? That is, should I use TEXT in all string columns? VARCHAR(255)? Or should I aim to be more efficient in my choices (yes)?

Here’s one way to capture the maximum length of each string column:

title_basics_df %>% 
    summarise(across(where(~!is.numeric(.)), ~ max(stringr::str_length(.), na.rm=T)))

(The tilde ~ is a shorthand for function(x), where the dot . is the argument of the function. That is, in practice, ~!is.numeric(.) is the same as function(x) {!is.numeric(x)}.)

which returns

# A tibble: 1 x 8
  tconst titleType primaryTitle originalTitle startYear endYear runtimeMinutes genres
   <int>     <int>        <int>         <int>     <int>   <int>          <int>  <int>
1     10        12          419           419         4       4             26     32

Ignoring the output of startYear, endYear, runtimeMinutes – after all, these are numeric columns – this tells us that the following columns MUST be of the type TEXT:

  • primaryTitle
  • originalTitle

These columns CAN’T be the maximum VARCHAR(N) (VARCHAR(255)) because the data will be cropped.

The other columns can have varying VARCHAR lengths:

  • tconst can be VARCHAR(10)
  • titleType can be VARCHAR(12)
  • genres can be VARCHAR(32)

When it comes to the column genres, it’s ok if here you chose a larger number, say VARCHAR(100), or even VARCHAR(255). Thinking long-term, in the future new genres might be added to the database which might have longer names.

#### Recoding NA values (before modifying data types)

Having decided on the data types, it’s time to properly encode the NA values:

title_basics_df <- 
  title_basics_df %>% 
  mutate(across(where(~!is.numeric(.)), ~ if_else(. == "\\N", NA, .)))

If you look at the raw data, you will only encounter \N (single backslash followed by capital N), but when representing this as a string in R we have to escape the backslash, hence \\N.

🤫 Click here to read a secret

You don’t even need to treat the NULL cases like this! You can simply inform the read_tsv function that \N is the representation of NULL values when reading the data:

title_basics_df <- read_tsv("data/IMDB/title.basics.tsv.gz", na = "\\N")

This will already solve some data types for you:

title_basics_df %>% spec()
cols(
  tconst = col_character(),
  titleType = col_character(),
  primaryTitle = col_character(),
  originalTitle = col_character(),
  isAdult = col_double(),
  startYear = col_double(), # should be integer but better than character
  endYear = col_double(), # should be integer but better than character
  runtimeMinutes = col_double(), # should be integer but better than character
  genres = col_character()
)

Finally, modify data types

You only need to modify the integer columns + the isAdult column which are incorrectly classified as character columns:

title_basics_df <-
    title_basics_df %>%
    mutate(across(c(startYear, endYear, runtimeMinutes), as.integer),
           across(c(isAdult), as.logical))
🤫 Click here to read another secret

You can also specify the types when reading the data, using the column parsers that exist in the readr package without the need to modify the data types afterwards:

title_basics_df <- read_tsv("data/IMDB/title.basics.tsv.gz", na = "\\N",
                            col_types = cols(
                              tconst = col_character(),
                              titleType = col_character(),
                              primaryTitle = col_character(),
                              originalTitle = col_character(),
                              isAdult = col_logical(),
                              startYear = col_integer(),
                              endYear = col_integer(),
                              runtimeMinutes = col_integer(),
                              genres = col_character()
                            ))

How does that impact memory usage? Let’s check:

format(object.size(title_basics_df), units = "auto")

returns

[1] "1.8 Gb"

Whereas, when I ran the code without specifying the column types, I got:

[1] "2 Gb"

So, it’s more efficient to specify the column types when reading the data.

Create the table in the database

Now, you can create the table:


copy_to(con, title_basics_df, "title_basics", 
        types=c(tconst="VARCHAR(10)", titleType="VARCHAR(12)", 
                primaryTitle="TEXT", originalTitle="TEXT", 
                startYear="SMALLINT", endYear="SMALLINT", 
                runtimeMinutes="SMALLINT", genres="VARCHAR(50)"), 
        indexes=list(c("titleType", "genres")),
        unique=list(c("tconst")), # tconst must be unique
        temporary=FALSE, overwrite = TRUE)

I chose to index titleType and genres because I think these are the most likely columns to be used in queries. You might want to add more indexes, but be careful not to over-index. Choosing indices is a delicate art.

Table: title_basics

Read and parse it with:

title_basics_df <- read_tsv("data/IMDB/title.basics.tsv.gz", na = "\\N",
                            col_types = cols(
                              tconst = col_character(),
                              titleType = col_character(),
                              primaryTitle = col_character(),
                              originalTitle = col_character(),
                              isAdult = col_logical(),
                              startYear = col_integer(),
                              endYear = col_integer(),
                              runtimeMinutes = col_integer(),
                              genres = col_character()
                            ))

Then, there are two ways to upload the data to the database:

(This is how we taught you in the lecture.)

copy_to(con, title_basics_df, "title_basics", 
        types=c(tconst="VARCHAR(10)", titleType="VARCHAR(12)", 
                primaryTitle="TEXT", originalTitle="TEXT", 
                startYear="SMALLINT", endYear="SMALLINT", 
                runtimeMinutes="SMALLINT", genres="VARCHAR(50)"), 
        indexes=list(c("titleType", "genres")),
        unique=list(c("tconst")), # tconst must be unique
        temporary=FALSE, overwrite = TRUE)

This will create the table in the database from scratch, using the data types, indices and constraints (unique) that you informed and upload the data in one go.

HOWEVER, the is one downside to the copy_to function: it doesn’t allow you to specify PRIMARY KEYs. We have set tconst to be unique, but this is not the same as a primary key. To set a primary key, we need to create the table manually. See the next tab for more details.

If you have already created the table in the database, you might need to delete and recreate it again to follow the instructions below.

You can delete the table with:

DROP TABLE IF EXISTS title_basics;
dbExecute(con, "DROP TABLE IF EXISTS title_basics;")

Then, you can create the table with:

CREATE TABLE title_basics (
  tconst VARCHAR(10) PRIMARY KEY,
  titleType VARCHAR(12),
  primaryTitle TEXT,
  originalTitle TEXT,
  isAdult BOOLEAN,
  startYear SMALLINT,
  endYear SMALLINT,
  runtimeMinutes SMALLINT,
  genres VARCHAR(50),
  INDEX(titleType, genres)
);
# Create the table
dbExecute(con, 
"CREATE TABLE title_basics (
  tconst VARCHAR(10) PRIMARY KEY,
  titleType VARCHAR(12),
  primaryTitle TEXT,
  originalTitle TEXT,
  isAdult BOOLEAN,
  startYear SMALLINT,
  endYear SMALLINT,
  runtimeMinutes SMALLINT,
  genres VARCHAR(50));")

Then add the indices:

(Some RDBMS allow you to specify the INDEX alongside the column definition, but SQLite doesn’t.)

CREATE INDEX titleType ON title_basics(titleType);
CREATE INDEX genres ON title_basics(genres);

# Add the indices
dbExecute(con, "CREATE INDEX titleType ON title_basics(titleType);")
dbExecute(con, "CREATE INDEX genres ON title_basics(genres);")

After which, you can now upload the data with dbAppendTable:

dbAppendTable(con, "title_basics", title_basics_df)

You can’t use copy_to here. Otherwise, you will overwrite your CREATE TABLE statement.

Delete the data frame from memory:

rm(title_basics_df)

Table: title_crew

To maintain the primary keys, I’m gonna follow the SQL way of doing things. If you want to use the copy_to function, you can do so too.

Read and parse it with:

title_crew_df <- read_tsv("data/IMDB/title.crew.tsv.gz", na = "\\N",
                            col_types = cols(
                              tconst = col_character(),
                              directors = col_character(),
                              writers = col_character()
                            ))

Then:

CREATE TABLE title_crew (
  tconst VARCHAR(10) PRIMARY KEY,
  directors TEXT, -- it's a list of directors that can be huge
  writers TEXT -- it's a list of writers that can be huge
);

I didn’t add any indices. So let me just upload the data:

dbAppendTable(con, "title_crew", title_crew_df)

Delete the data frame from memory:

rm(title_crew_df)

Table: title_principals

Read and parse it with:

title_principals_df <- read_tsv("data/IMDB/title.principals.tsv.gz", na = "\\N",
                            col_types = cols(
                              tconst = col_character(),
                              ordering = col_integer(),
                              nconst = col_character(),
                              category = col_factor(),
                              job = col_character(),
                              characters = col_character()
                            ))

The unique identifier of each row is NOT a single column, but a combination of tconst, ordering and nconst. So, when creating the table, we need to specify the three columns as the primary key:

CREATE TABLE title_principals (
  tconst VARCHAR(10),
  ordering SMALLINT, -- it's a number between 1 and 10
  nconst VARCHAR(10),
  category VARCHAR(50),
  job VARCHAR(50),
  characters VARCHAR(255),
  PRIMARY KEY (tconst, ordering, nconst)
);

SQLite doesn’t support categorical data types, so I’m gonna leave category as a VARCHAR(50), but then, I’m gonna add an index to it:

CREATE INDEX category ON title_principals(category);

Then, we can upload the data:

# This one is huge, so it might take a while
dbAppendTable(con, "title_principals", title_principals_df)

Delete the data frame from memory:

rm(title_principals_df)

Table title_ratings

Read and parse it with:

title_ratings_df <- read_tsv("data/IMDB/title.ratings.tsv.gz", na = "\\N",
                            col_types = cols(
                              tconst = col_character(),
                              averageRating = col_double(),
                              numVotes = col_integer()
                            ))

Then:

CREATE TABLE title_ratings (
  tconst VARCHAR(10) PRIMARY KEY,
  averageRating DOUBLE,
  numVotes INTEGER
);

No need for indexing. So, let’s upload the data:

dbAppendTable(con, "title_ratings", title_ratings_df)

Delete the data frame from memory:

rm(title_ratings_df)

Table title_episode

Read and parse it with:

title_episode_df <- read_tsv("data/IMDB/title.episode.tsv.gz", na = "\\N",
                            col_types = cols(
                              tconst = col_character(),
                              parentTconst = col_character(),
                              seasonNumber = col_integer(),
                              episodeNumber = col_integer()
                            ))

The primary key is tconst, but I feel it is useful to add an index to parentTconst:

CREATE TABLE title_episode (
  tconst VARCHAR(10) PRIMARY KEY,
  parentTconst VARCHAR(10),
  seasonNumber SMALLINT, -- maximum is 2021
  episodeNumber INTEGER, -- some episode numbers go up to 97251!
);

The indexing:

CREATE INDEX parentTconst ON title_episode(parentTconst);

Then, we can upload the data:

dbAppendTable(con, "title_episode", title_episode_df)

Delete the data frame from memory:

rm(title_episode_df)

Table title_akas

Read and parse it with:

title_akas_df <- read_tsv("data/IMDB/title.akas.tsv.gz", na = "\\N",
                            col_types = cols(
                              titleId = col_character(),
                              ordering = col_integer(),
                              title = col_character(),
                              region = col_character(),
                              language = col_character(),
                              types = col_character(),
                              attributes = col_character(),
                              isOriginalTitle = col_logical()
                            ))

Calculating the maximum length of each column:

title_akas_df %>% 
  summarise(across(where(~!is.numeric(.)), ~ max(stringr::str_length(.), na.rm=T)))

yields:

  titleId    title region language types attributes isOriginalTitle
    <int>    <int>  <int>    <int> <int>      <int>           <int>
1      10 16250429      4        3    20         62               5

The two columns titleId and ordering act as primary keys, so we need to specify them as such:

CREATE TABLE title_akas (
  titleId VARCHAR(10),
  ordering SMALLINT,
  title TEXT,
  region VARCHAR(4),
  language VARCHAR(3),
  types VARCHAR(20),
  attributes VARCHAR(62),
  isOriginalTitle BOOLEAN,
  PRIMARY KEY (titleId, ordering)
);

And I will also add region, language and types as indices:

CREATE INDEX region ON title_akas(region);
CREATE INDEX language ON title_akas(language);
CREATE INDEX types ON title_akas(types);

Then, we can upload the data:

dbAppendTable(con, "title_akas", title_akas_df)

Delete the data frame from memory:

rm(title_akas_df)

Table name_basics

Read and parse it with:

name_basics_df <- read_tsv("data/IMDB/name.basics.tsv.gz", na = "\\N",
                            col_types = cols(
                              nconst = col_character(),
                              primaryName = col_character(),
                              birthYear = col_integer(),
                              deathYear = col_integer(),
                              primaryProfession = col_character(),
                              knownForTitles = col_character()
                            ))

The primary key is nconst. Let’s create the table:

CREATE TABLE name_basics (
  nconst VARCHAR(10) PRIMARY KEY,
  primaryName TEXT,
  birthYear SMALLINT,
  deathYear SMALLINT,
  primaryProfession TEXT,
  knownForTitles TEXT
);

I might want to run analysis on the primaryName column, so I will add an index to it:

CREATE INDEX primaryName ON name_basics(primaryName);

Then, we can upload the data:

dbAppendTable(con, "name_basics", name_basics_df)

Delete the data frame from memory:

rm(name_basics_df)

Part 4: 📊 Queries

I ended up with a huge database file. The imdb.db has almost 15GB. It is unlikely that I would be able to load all the data into memory. So, I will use the database to answer the questions.

How much data is there in each table?

dbListTables(con) %>% 
  map_dfr(~dbGetQuery(con, paste0("SELECT COUNT(*) AS n FROM ", .x))) %>% 
  mutate(table = dbListTables(con)) %>% 
  select(table, n) %>%
  arrange(desc(n))

returns

table n
title_principals 56721158
title_akas 36144552
name_basics 12709955
title_basics 10025512
title_crew 10025512
title_episode 7627595
title_ratings 1332367
sqlite_stat1 0
sqlite_stat4 0

(Ignore the internal SQLite tables.)