✅ (Solutions) Lab 09
Part 1: ⚙️ Setup
Load required libraries
library(DBI) # generic database interface
library(RSQLite) # driver for SQLite
library(readr) # for read_tsv
library(dplyr) # for tbl, copy_to, and other dplyr verbs
library(dbplyr) # to enable dplyr verbs on database tables
Then create the local database file imdb.db
and connect to it (if this file doesn’t exist, it will be created):
<- dbConnect(RSQLite::SQLite(), "imdb.db") con
Part 2: 📥 Download the data
Just follow the instructions
Part 3: 📊 Create the tables
Step-by-step reasoning process
Step-by-step reasoning process
My first recommendation after reading the data is to run glimpse
to see what you’re dealing with:
# Change path to your own
<- read_tsv("data/IMDB/title.basics.tsv.gz")
title_basics_df %>% glimpse() title_basics_df
Did you see the following warning?
:
Warning message`problems()` on your data frame for details, e.g.:
One or more parsing issues, call <- vroom(...)
dat problems(dat)
This is because readr
might have encountered some inconsistencies in the data. You can check what these inconsistencies are by running problems(title_basics_df)
. This function will inform of which rows were problematic and why. You can later inspect those rows to see what’s going on. In this case, it looks like readr
has managed to fix the problems, so we can ignore this warning.
Then, the big issue is: how many characters should I use when creating my string columns? That is, should I use TEXT in all string columns? VARCHAR(255)? Or should I aim to be more efficient in my choices (yes)?
Here’s one way to capture the maximum length of each string column:
%>%
title_basics_df summarise(across(where(~!is.numeric(.)), ~ max(stringr::str_length(.), na.rm=T)))
(The tilde ~
is a shorthand for function(x)
, where the dot .
is the argument of the function. That is, in practice, ~!is.numeric(.)
is the same as function(x) {!is.numeric(x)}
.)
which returns
# A tibble: 1 x 8
tconst titleType primaryTitle originalTitle startYear endYear runtimeMinutes genres
<int> <int> <int> <int> <int> <int> <int> <int>
1 10 12 419 419 4 4 26 32
Ignoring the output of startYear, endYear, runtimeMinutes – after all, these are numeric columns – this tells us that the following columns MUST be of the type TEXT:
primaryTitle
originalTitle
These columns CAN’T be the maximum VARCHAR(N) (VARCHAR(255)) because the data will be cropped.
The other columns can have varying VARCHAR lengths:
tconst
can be VARCHAR(10)titleType
can be VARCHAR(12)genres
can be VARCHAR(32)
When it comes to the column genres
, it’s ok if here you chose a larger number, say VARCHAR(100), or even VARCHAR(255). Thinking long-term, in the future new genres might be added to the database which might have longer names.
#### Recoding NA values (before modifying data types)
Having decided on the data types, it’s time to properly encode the NA values:
<-
title_basics_df %>%
title_basics_df mutate(across(where(~!is.numeric(.)), ~ if_else(. == "\\N", NA, .)))
If you look at the raw data, you will only encounter \N
(single backslash followed by capital N), but when representing this as a string in R we have to escape the backslash, hence \\N
.
🤫 Click here to read a secret
You don’t even need to treat the NULL cases like this! You can simply inform the read_tsv
function that \N
is the representation of NULL values when reading the data:
<- read_tsv("data/IMDB/title.basics.tsv.gz", na = "\\N") title_basics_df
This will already solve some data types for you:
%>% spec() title_basics_df
cols(
tconst = col_character(),
titleType = col_character(),
primaryTitle = col_character(),
originalTitle = col_character(),
isAdult = col_double(),
startYear = col_double(), # should be integer but better than character
endYear = col_double(), # should be integer but better than character
runtimeMinutes = col_double(), # should be integer but better than character
genres = col_character()
)
Finally, modify data types
You only need to modify the integer columns + the isAdult column which are incorrectly classified as character columns:
<-
title_basics_df %>%
title_basics_df mutate(across(c(startYear, endYear, runtimeMinutes), as.integer),
across(c(isAdult), as.logical))
🤫 Click here to read another secret
You can also specify the types when reading the data, using the column parsers that exist in the readr
package without the need to modify the data types afterwards:
<- read_tsv("data/IMDB/title.basics.tsv.gz", na = "\\N",
title_basics_df col_types = cols(
tconst = col_character(),
titleType = col_character(),
primaryTitle = col_character(),
originalTitle = col_character(),
isAdult = col_logical(),
startYear = col_integer(),
endYear = col_integer(),
runtimeMinutes = col_integer(),
genres = col_character()
))
How does that impact memory usage? Let’s check:
format(object.size(title_basics_df), units = "auto")
returns
[1] "1.8 Gb"
Whereas, when I ran the code without specifying the column types, I got:
[1] "2 Gb"
So, it’s more efficient to specify the column types when reading the data.
Create the table in the database
Now, you can create the table:
copy_to(con, title_basics_df, "title_basics",
types=c(tconst="VARCHAR(10)", titleType="VARCHAR(12)",
primaryTitle="TEXT", originalTitle="TEXT",
startYear="SMALLINT", endYear="SMALLINT",
runtimeMinutes="SMALLINT", genres="VARCHAR(50)"),
indexes=list(c("titleType", "genres")),
unique=list(c("tconst")), # tconst must be unique
temporary=FALSE, overwrite = TRUE)
I chose to index titleType
and genres
because I think these are the most likely columns to be used in queries. You might want to add more indexes, but be careful not to over-index. Choosing indices is a delicate art.
Table: title_basics
Read and parse it with:
<- read_tsv("data/IMDB/title.basics.tsv.gz", na = "\\N",
title_basics_df col_types = cols(
tconst = col_character(),
titleType = col_character(),
primaryTitle = col_character(),
originalTitle = col_character(),
isAdult = col_logical(),
startYear = col_integer(),
endYear = col_integer(),
runtimeMinutes = col_integer(),
genres = col_character()
))
Then, there are two ways to upload the data to the database:
(This is how we taught you in the lecture.)
copy_to(con, title_basics_df, "title_basics",
types=c(tconst="VARCHAR(10)", titleType="VARCHAR(12)",
primaryTitle="TEXT", originalTitle="TEXT",
startYear="SMALLINT", endYear="SMALLINT",
runtimeMinutes="SMALLINT", genres="VARCHAR(50)"),
indexes=list(c("titleType", "genres")),
unique=list(c("tconst")), # tconst must be unique
temporary=FALSE, overwrite = TRUE)
This will create the table in the database from scratch, using the data types, indices and constraints (unique) that you informed and upload the data in one go.
HOWEVER, the is one downside to the copy_to
function: it doesn’t allow you to specify PRIMARY KEYs. We have set tconst
to be unique, but this is not the same as a primary key. To set a primary key, we need to create the table manually. See the next tab for more details.
If you have already created the table in the database, you might need to delete and recreate it again to follow the instructions below.
You can delete the table with:
DROP TABLE IF EXISTS title_basics;
dbExecute(con, "DROP TABLE IF EXISTS title_basics;")
Then, you can create the table with:
CREATE TABLE title_basics (
VARCHAR(10) PRIMARY KEY,
tconst VARCHAR(12),
titleType
primaryTitle TEXT,
originalTitle TEXT,BOOLEAN,
isAdult SMALLINT,
startYear SMALLINT,
endYear SMALLINT,
runtimeMinutes VARCHAR(50),
genres INDEX(titleType, genres)
);
# Create the table
dbExecute(con,
"CREATE TABLE title_basics (
tconst VARCHAR(10) PRIMARY KEY,
titleType VARCHAR(12),
primaryTitle TEXT,
originalTitle TEXT,
isAdult BOOLEAN,
startYear SMALLINT,
endYear SMALLINT,
runtimeMinutes SMALLINT,
genres VARCHAR(50));")
Then add the indices:
(Some RDBMS allow you to specify the INDEX alongside the column definition, but SQLite doesn’t.)
CREATE INDEX titleType ON title_basics(titleType);
CREATE INDEX genres ON title_basics(genres);
# Add the indices
dbExecute(con, "CREATE INDEX titleType ON title_basics(titleType);")
dbExecute(con, "CREATE INDEX genres ON title_basics(genres);")
After which, you can now upload the data with dbAppendTable
:
dbAppendTable(con, "title_basics", title_basics_df)
You can’t use copy_to
here. Otherwise, you will overwrite your CREATE TABLE statement.
Delete the data frame from memory:
rm(title_basics_df)
Table: title_crew
To maintain the primary keys, I’m gonna follow the SQL way of doing things. If you want to use the copy_to
function, you can do so too.
Read and parse it with:
<- read_tsv("data/IMDB/title.crew.tsv.gz", na = "\\N",
title_crew_df col_types = cols(
tconst = col_character(),
directors = col_character(),
writers = col_character()
))
Then:
CREATE TABLE title_crew (
VARCHAR(10) PRIMARY KEY,
tconst -- it's a list of directors that can be huge
directors TEXT, -- it's a list of writers that can be huge
writers TEXT );
I didn’t add any indices. So let me just upload the data:
dbAppendTable(con, "title_crew", title_crew_df)
Delete the data frame from memory:
rm(title_crew_df)
Table: title_principals
Read and parse it with:
<- read_tsv("data/IMDB/title.principals.tsv.gz", na = "\\N",
title_principals_df col_types = cols(
tconst = col_character(),
ordering = col_integer(),
nconst = col_character(),
category = col_factor(),
job = col_character(),
characters = col_character()
))
The unique identifier of each row is NOT a single column, but a combination of tconst
, ordering
and nconst
. So, when creating the table, we need to specify the three columns as the primary key:
CREATE TABLE title_principals (
VARCHAR(10),
tconst SMALLINT, -- it's a number between 1 and 10
ordering VARCHAR(10),
nconst category VARCHAR(50),
VARCHAR(50),
job VARCHAR(255),
characters PRIMARY KEY (tconst, ordering, nconst)
);
SQLite doesn’t support categorical data types, so I’m gonna leave category
as a VARCHAR(50), but then, I’m gonna add an index to it:
CREATE INDEX category ON title_principals(category);
Then, we can upload the data:
# This one is huge, so it might take a while
dbAppendTable(con, "title_principals", title_principals_df)
Delete the data frame from memory:
rm(title_principals_df)
Table title_ratings
Read and parse it with:
<- read_tsv("data/IMDB/title.ratings.tsv.gz", na = "\\N",
title_ratings_df col_types = cols(
tconst = col_character(),
averageRating = col_double(),
numVotes = col_integer()
))
Then:
CREATE TABLE title_ratings (
VARCHAR(10) PRIMARY KEY,
tconst DOUBLE,
averageRating INTEGER
numVotes );
No need for indexing. So, let’s upload the data:
dbAppendTable(con, "title_ratings", title_ratings_df)
Delete the data frame from memory:
rm(title_ratings_df)
Table title_episode
Read and parse it with:
<- read_tsv("data/IMDB/title.episode.tsv.gz", na = "\\N",
title_episode_df col_types = cols(
tconst = col_character(),
parentTconst = col_character(),
seasonNumber = col_integer(),
episodeNumber = col_integer()
))
The primary key is tconst
, but I feel it is useful to add an index to parentTconst
:
CREATE TABLE title_episode (
VARCHAR(10) PRIMARY KEY,
tconst VARCHAR(10),
parentTconst SMALLINT, -- maximum is 2021
seasonNumber INTEGER, -- some episode numbers go up to 97251!
episodeNumber );
The indexing:
CREATE INDEX parentTconst ON title_episode(parentTconst);
Then, we can upload the data:
dbAppendTable(con, "title_episode", title_episode_df)
Delete the data frame from memory:
rm(title_episode_df)
Table title_akas
Read and parse it with:
<- read_tsv("data/IMDB/title.akas.tsv.gz", na = "\\N",
title_akas_df col_types = cols(
titleId = col_character(),
ordering = col_integer(),
title = col_character(),
region = col_character(),
language = col_character(),
types = col_character(),
attributes = col_character(),
isOriginalTitle = col_logical()
))
Calculating the maximum length of each column:
%>%
title_akas_df summarise(across(where(~!is.numeric(.)), ~ max(stringr::str_length(.), na.rm=T)))
yields:
titleId title region language types attributes isOriginalTitle
<int> <int> <int> <int> <int> <int> <int>
1 10 16250429 4 3 20 62 5
The two columns titleId
and ordering
act as primary keys, so we need to specify them as such:
CREATE TABLE title_akas (
VARCHAR(10),
titleId SMALLINT,
ordering
title TEXT,VARCHAR(4),
region VARCHAR(3),
language types VARCHAR(20),
attributes VARCHAR(62),
BOOLEAN,
isOriginalTitle PRIMARY KEY (titleId, ordering)
);
And I will also add region
, language
and types
as indices:
CREATE INDEX region ON title_akas(region);
CREATE INDEX language ON title_akas(language);
CREATE INDEX types ON title_akas(types);
Then, we can upload the data:
dbAppendTable(con, "title_akas", title_akas_df)
Delete the data frame from memory:
rm(title_akas_df)
Table name_basics
Read and parse it with:
<- read_tsv("data/IMDB/name.basics.tsv.gz", na = "\\N",
name_basics_df col_types = cols(
nconst = col_character(),
primaryName = col_character(),
birthYear = col_integer(),
deathYear = col_integer(),
primaryProfession = col_character(),
knownForTitles = col_character()
))
The primary key is nconst
. Let’s create the table:
CREATE TABLE name_basics (
VARCHAR(10) PRIMARY KEY,
nconst
primaryName TEXT,SMALLINT,
birthYear SMALLINT,
deathYear
primaryProfession TEXT,
knownForTitles TEXT );
I might want to run analysis on the primaryName
column, so I will add an index to it:
CREATE INDEX primaryName ON name_basics(primaryName);
Then, we can upload the data:
dbAppendTable(con, "name_basics", name_basics_df)
Delete the data frame from memory:
rm(name_basics_df)
Part 4: 📊 Queries
I ended up with a huge database file. The imdb.db
has almost 15GB. It is unlikely that I would be able to load all the data into memory. So, I will use the database to answer the questions.
How much data is there in each table?
dbListTables(con) %>%
map_dfr(~dbGetQuery(con, paste0("SELECT COUNT(*) AS n FROM ", .x))) %>%
mutate(table = dbListTables(con)) %>%
select(table, n) %>%
arrange(desc(n))
returns
table | n |
---|---|
title_principals | 56721158 |
title_akas | 36144552 |
name_basics | 12709955 |
title_basics | 10025512 |
title_crew | 10025512 |
title_episode | 7627595 |
title_ratings | 1332367 |
sqlite_stat1 | 0 |
sqlite_stat4 | 0 |
(Ignore the internal SQLite tables.)