DS202A Week 02 - Source Code Part II

Author
Published

10 Oct 2023

This .qmd is an expanded version of what I taught in the Week 02 lecture. It contains some of the code I used in the lecture, plus some extra explanations and examples.

Click on the button below to download the source files that were used to create this page:

Context

In Week 02’s labs, you tackled a problem set centred around tidyversein R, where the goal was to transform a given dataset onto a desired output. It made me happy to see how much you all engaged with those exercises, prompting you to realise you had some knowledge gaps about R programming in general that needed to be addressed. This is a great way of learning, and I am truly glad you took the time to ask us questions about what was puzzling you in that lab.

This is Part II of the source code for the Week 02 lecture. You can read Part I here. The point of this document is to talk more about tidyverse.

FAQ

Q: What really is the tidyverse?

tidyverse is not one but a set of packages that, together, help you tackle the most common data manipulation tasks. You can see the name of the packages involved on their website.

Q: Where do I get a list of all the functions available in tidyverse?

You have to go to the documentation of the specific package you are interested in. For example, if you want to see all the functions available in dplyr, you can go to its documentation. Click on the name of a function to see an explanation as well as examples of how to use it.

💡 You might also want to download some cheatsheets. Here are the links to those I shared in the lecture:

Q: How do I install tidyverse?

Just run install.packages("tidyverse") in your R console. This will install all the packages that are part of the tidyverse ecosystem.

DO NOT leave an install.packages() command in your .qmd file. Always do this in the R Console. Otherwise, you won’t be able to render your markdown file as an HTML later.

How to write code from scratch

I am using the 💻 Week 02 Lab as an example.

1. Dedicate the first chunk to loading packages

In your .qmd file, make it a habit to create and dedicate the first chunk of code to loading all the packages you’ll use using the library() function. True, you may not know all the packages you will need when starting a new file. But making the first chunk a reserved space for this task creates a neat, centralised hub. In my case, I often begin with just library(dplyr). I keep writing in my notebook, adding markdown text and code as I go along, but as soon as I feel the need for additional R packages like readr, I simply circle back to that first chunk to add them. You have to rerun the chunk for the packages to be loaded.

Following this iterative process, this is the list of packages I ended up using in the Week 02 lab:

library(dplyr)       # for data manipulation
library(tidyr)       # also for data manipulation
library(readr)       # for reading data
library(lubridate)   # for working with dates
library(tidyselect)  # for selecting columns

Why bother?

Best practices exist for a reason. The main benefit of keeping a tidy list of packages at the start of your file is so that, when rendering your .qmd file to a nice and beautiful HTML webpage, the program will run it from scratch, from top to bottom. If you reference a function in the middle of your file but you haven’t loaded the package that contains that function, the program will throw an error.

Q: Do I need to include every single package in there?

Not really. As mentioned in the lecture, if you use a particular package only once or twice, you don’t need all of its functions. Instead, you can just call the function directly. For example, suppose you want to use lubridate’s dmy() function but you won’t need any other date-related functions. You can just call lubridate::dmy() directly without having to load the whole package (library(lubridate) is not necessary).

2. Read your data early on

This one was easy in W02 labs because we gave you the code:

filepath <- "data/UK-HPI-full-file-2023-06.csv"
uk_hpi <- read_csv(filepath)

But what if we hadn’t? You would have to be able to locate where your data file is located and then read it. Even with the code above, it is possible you would get an error message. A No such file error or similar should prompt you to think about relative paths. And also, is the data truly inside the data/ folder? Or, more fundamentally, is there a data/ folder? Then, go about fixing these problems one by one.

3. Do things step-by-step

When you saw the final solution to Part III, you probably understood what was going on but didn’t feel confident enough (yet) to write this pipe chain yourself from scratch:

selected_cols    <- c("Date", "RegionName", "12m%Change")
selected_regions <- c("United Kingdom", "England", "Scotland", "Wales", "Northern Ireland") 

df <-
    uk_hpi %>%
    select(all_of(selected_cols)) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`) %>%
    filter(region %in% selected_regions) %>%
    drop_na(yearly_change) %>% 
    mutate(date = dmy(date)) %>% 
    arrange(desc(date))

df %>% head(8)

Some of it is my fault! By showing you just the result, I didn’t help you reason about it step-by-step. Here is how to do it:

3.1 The select() line

The end goal is to arrive at a data frame with the three columns:

|date |region | yearly_change|

This should prompt you to glimpse at the original data frame to see if you can find these columns there:

glimpse(uk_hpi)
Rows: 136,935
Columns: 54
$ Date                     <chr> "01/01/2004", "01/02/2004", "01/03/2004", "01…
$ RegionName               <chr> "Aberdeenshire", "Aberdeenshire", "Aberdeensh…
$ AreaCode                 <chr> "S12000034", "S12000034", "S12000034", "S1200…
$ AveragePrice             <dbl> 81693.67, 81678.76, 83525.10, 84333.68, 86379…
$ Index                    <dbl> 40.86421, 40.85676, 41.78032, 42.18478, 43.20…
$ IndexSA                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1m%Change`              <dbl> NA, -0.01824784, 2.26048321, 0.96807069, 2.42…
$ `12m%Change`             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ AveragePriceSA           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SalesVolume              <dbl> 388, 326, 453, 571, 502, 525, 652, 512, 497, …
$ DetachedPrice            <dbl> 122490.1, 121280.9, 123395.4, 122334.0, 12449…
$ DetachedIndex            <dbl> 43.61098, 43.18047, 43.93332, 43.55543, 44.32…
$ `Detached1m%Change`      <dbl> NA, -0.9871659, 1.7435088, -0.8601624, 1.7696…
$ `Detached12m%Change`     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SemiDetachedPrice        <dbl> 70563.16, 70804.42, 72689.07, 74484.23, 76637…
$ SemiDetachedIndex        <dbl> 40.82189, 40.96146, 42.05176, 43.09029, 44.33…
$ `SemiDetached1m%Change`  <dbl> NA, 0.3419153, 2.6617665, 2.4696403, 2.891219…
$ `SemiDetached12m%Change` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TerracedPrice            <dbl> 55319.63, 55720.08, 57362.86, 59193.40, 61202…
$ TerracedIndex            <dbl> 38.30567, 38.58295, 39.72049, 40.98803, 42.37…
$ `Terraced1m%Change`      <dbl> NA, 0.7238695, 2.9482802, 3.1911554, 3.394633…
$ `Terraced12m%Change`     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FlatPrice                <dbl> 48016.07, 49030.18, 50349.45, 51736.22, 53230…
$ FlatIndex                <dbl> 42.43355, 43.32975, 44.49564, 45.72118, 47.04…
$ `Flat1m%Change`          <dbl> NA, 2.1120161, 2.6907230, 2.7543015, 2.887739…
$ `Flat12m%Change`         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CashPrice                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CashIndex                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Cash1m%Change`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Cash12m%Change`         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CashSalesVolume          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MortgagePrice            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MortgageIndex            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Mortgage1m%Change`      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Mortgage12m%Change`     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MortgageSalesVolume      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FTBPrice                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FTBIndex                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `FTB1m%Change`           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `FTB12m%Change`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FOOPrice                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ FOOIndex                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `FOO1m%Change`           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `FOO12m%Change`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ NewPrice                 <dbl> 88436.14, 88606.45, 90296.91, 90319.88, 91989…
$ NewIndex                 <dbl> 40.26725, 40.34479, 41.11451, 41.12496, 41.88…
$ `New1m%Change`           <dbl> NA, 0.19257621, 1.90783778, 0.02543242, 1.848…
$ `New12m%Change`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ NewSalesVolume           <dbl> 103, 107, 140, 180, 167, 164, 163, 130, 142, …
$ OldPrice                 <dbl> 81043.95, 80965.30, 82903.24, 84003.99, 86222…
$ OldIndex                 <dbl> 40.88337, 40.84369, 41.82130, 42.37659, 43.49…
$ `Old1m%Change`           <dbl> NA, -0.0970528, 2.3935490, 1.3277553, 2.64123…
$ `Old12m%Change`          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ OldSalesVolume           <dbl> 285, 219, 313, 391, 335, 361, 489, 382, 355, …

OK, I can’t find these exact columns, but I can see that there are columns called Date, RegionName and 12m%Change.

Only now would I start thinking about the select() function. If I was new to dplyr, I would go to the website, more specifically to the select() documentation. There, I would see that the function takes two arguments: data and .... The first one is the data frame you want to select columns from. The second one is a set of column names you want to select. This might not be super clear at first, but then I could scroll down to see examples.

Putting it all together, I would arrive at:

select(uk_hpi, Date, RegionName, `12m%Change`)
# A tibble: 136,935 × 3
   Date       RegionName    `12m%Change`
   <chr>      <chr>                <dbl>
 1 01/01/2004 Aberdeenshire           NA
 2 01/02/2004 Aberdeenshire           NA
 3 01/03/2004 Aberdeenshire           NA
 4 01/04/2004 Aberdeenshire           NA
 5 01/05/2004 Aberdeenshire           NA
 6 01/06/2004 Aberdeenshire           NA
 7 01/07/2004 Aberdeenshire           NA
 8 01/08/2004 Aberdeenshire           NA
 9 01/09/2004 Aberdeenshire           NA
10 01/10/2004 Aberdeenshire           NA
# ℹ 136,925 more rows

Yes! That’s what I need, and it should be fine; you could stop here and move on. Let’s do that.

3.2 The rename() line

We are asked to use dplyr’s rename() function. Again, the examples are perhaps the best way to get a sense of how to use it. I could conclude that what I need is something like:

rename(uk_hpi, region = RegionName, date = Date, yearly_change = `12m%Change`)
# A tibble: 136,935 × 54
   date     region AreaCode AveragePrice Index IndexSA `1m%Change` yearly_change
   <chr>    <chr>  <chr>           <dbl> <dbl>   <dbl>       <dbl>         <dbl>
 1 01/01/2… Aberd… S120000…       81694.  40.9      NA     NA                 NA
 2 01/02/2… Aberd… S120000…       81679.  40.9      NA     -0.0182            NA
 3 01/03/2… Aberd… S120000…       83525.  41.8      NA      2.26              NA
 4 01/04/2… Aberd… S120000…       84334.  42.2      NA      0.968             NA
 5 01/05/2… Aberd… S120000…       86380.  43.2      NA      2.43              NA
 6 01/06/2… Aberd… S120000…       89268.  44.7      NA      3.34              NA
 7 01/07/2… Aberd… S120000…       93048.  46.5      NA      4.23              NA
 8 01/08/2… Aberd… S120000…       97791.  48.9      NA      5.10              NA
 9 01/09/2… Aberd… S120000…       98664.  49.4      NA      0.893             NA
10 01/10/2… Aberd… S120000…       99084.  49.6      NA      0.426             NA
# ℹ 136,925 more rows
# ℹ 46 more variables: AveragePriceSA <dbl>, SalesVolume <dbl>,
#   DetachedPrice <dbl>, DetachedIndex <dbl>, `Detached1m%Change` <dbl>,
#   `Detached12m%Change` <dbl>, SemiDetachedPrice <dbl>,
#   SemiDetachedIndex <dbl>, `SemiDetached1m%Change` <dbl>,
#   `SemiDetached12m%Change` <dbl>, TerracedPrice <dbl>, TerracedIndex <dbl>,
#   `Terraced1m%Change` <dbl>, `Terraced12m%Change` <dbl>, FlatPrice <dbl>, …

This is OK. But wait! What’s with all these additional columns I don’t need?

This would help me realise that I need to connect this with the previous step. By running rename on the original uk_hpi data frame, I have not made use at all of the previous select() line.

Putting the two things together, I could arrive at the following:

rename(select(uk_hpi, Date, RegionName, `12m%Change`), region = RegionName, date = Date, yearly_change = `12m%Change`)
# A tibble: 136,935 × 3
   date       region        yearly_change
   <chr>      <chr>                 <dbl>
 1 01/01/2004 Aberdeenshire            NA
 2 01/02/2004 Aberdeenshire            NA
 3 01/03/2004 Aberdeenshire            NA
 4 01/04/2004 Aberdeenshire            NA
 5 01/05/2004 Aberdeenshire            NA
 6 01/06/2004 Aberdeenshire            NA
 7 01/07/2004 Aberdeenshire            NA
 8 01/08/2004 Aberdeenshire            NA
 9 01/09/2004 Aberdeenshire            NA
10 01/10/2004 Aberdeenshire            NA
# ℹ 136,925 more rows

While the above works, it is not very pleasant to read. Sure, alternatively, the following would also work:

temporary_df <- select(uk_hpi, Date, RegionName, `12m%Change`)
rename(temporary_df, region = RegionName, date = Date, yearly_change = `12m%Change`)
# A tibble: 136,935 × 3
   date       region        yearly_change
   <chr>      <chr>                 <dbl>
 1 01/01/2004 Aberdeenshire            NA
 2 01/02/2004 Aberdeenshire            NA
 3 01/03/2004 Aberdeenshire            NA
 4 01/04/2004 Aberdeenshire            NA
 5 01/05/2004 Aberdeenshire            NA
 6 01/06/2004 Aberdeenshire            NA
 7 01/07/2004 Aberdeenshire            NA
 8 01/08/2004 Aberdeenshire            NA
 9 01/09/2004 Aberdeenshire            NA
10 01/10/2004 Aberdeenshire            NA
# ℹ 136,925 more rows

But why create a temporary data frame if I don’t need to?

Here is where I start to see the need for a pipe operator, %>%. What I want is to take the output of the select() line and feed it to the rename() line. This is precisely what the pipe is for. I can rewrite the above as:

# It is good to have one pipe per line
uk_hpi %>%
    select(Date, RegionName, `12m%Change`) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`)
# A tibble: 136,935 × 3
   date       region        yearly_change
   <chr>      <chr>                 <dbl>
 1 01/01/2004 Aberdeenshire            NA
 2 01/02/2004 Aberdeenshire            NA
 3 01/03/2004 Aberdeenshire            NA
 4 01/04/2004 Aberdeenshire            NA
 5 01/05/2004 Aberdeenshire            NA
 6 01/06/2004 Aberdeenshire            NA
 7 01/07/2004 Aberdeenshire            NA
 8 01/08/2004 Aberdeenshire            NA
 9 01/09/2004 Aberdeenshire            NA
10 01/10/2004 Aberdeenshire            NA
# ℹ 136,925 more rows

Note that I am not saving the output of this whole pipe chain anywhere. That is because I am still in prototype mode. I am still building my code step-by-step, and I don’t want to save anything until I am sure it works.

Let’s go to the next step:

3.3 The filter() line

Remember: the filter() function restricts the rows of a data frame to a subset of observations that satisfy a particular condition. By browsing the documentation and looking at examples, I realised that I had to use a bunch of conditions together.

I need to keep all rows for which the region contains the word “United Kingdom” OR “England” OR “Scotland” OR “Wales” OR “Northern Ireland”.

Some programming languages use the or word, but R uses the | symbol for this.

💡 Read more about logical operators in R here.

💡 Read more about relational operators in R here

After battling with the syntax, I arrived at:

uk_hpi %>%
    select(Date, RegionName, `12m%Change`) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`) %>%
    filter(region == "United Kingdom" | region == "England" | region == "Scotland" | region == "Wales" | region == "Northern Ireland")
# A tibble: 3,315 × 3
   date       region  yearly_change
   <chr>      <chr>           <dbl>
 1 01/04/1968 England            NA
 2 01/05/1968 England            NA
 3 01/06/1968 England            NA
 4 01/07/1968 England            NA
 5 01/08/1968 England            NA
 6 01/09/1968 England            NA
 7 01/10/1968 England            NA
 8 01/11/1968 England            NA
 9 01/12/1968 England            NA
10 01/01/1969 England            NA
# ℹ 3,305 more rows

This one works! If I want to be extra concise, maybe I could use the %in% operator (you would probably only know about it with practice):

uk_hpi %>%
    select(Date, RegionName, `12m%Change`) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`) %>%
    filter(region %in% c("United Kingdom", "England", "Scotland", "Wales", "Northern Ireland"))
# A tibble: 3,315 × 3
   date       region  yearly_change
   <chr>      <chr>           <dbl>
 1 01/04/1968 England            NA
 2 01/05/1968 England            NA
 3 01/06/1968 England            NA
 4 01/07/1968 England            NA
 5 01/08/1968 England            NA
 6 01/09/1968 England            NA
 7 01/10/1968 England            NA
 8 01/11/1968 England            NA
 9 01/12/1968 England            NA
10 01/01/1969 England            NA
# ℹ 3,305 more rows

Nice!

3.4 The drop_na() line

You probably noticed the number of NOT AVAILABLEs we got in the previous step. We need to get rid of them. The tidyr::drop_na() function does exactly that.

You would probably get stuck here until you discover that drop_na() exists and is not part of dplyr but rather tidyr. This is why I loaded tidyr at the start of the file. How would you arrive at this discovery? This one requires some googling skills! You could google something like “how to remove NA values using tidyverse” and you would probably find the answer. Searching online is perhaps the most useful skill professional programmers have - try practising it too!

Eventually, you would arrive at:

uk_hpi %>%
    select(Date, RegionName, `12m%Change`) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`) %>%
    filter(region %in% c("United Kingdom", "England", "Scotland", "Wales", "Northern Ireland")) %>%
    drop_na(yearly_change)
# A tibble: 3,243 × 3
   date       region  yearly_change
   <chr>      <chr>           <dbl>
 1 01/04/1969 England          6.25
 2 01/05/1969 England          6.25
 3 01/06/1969 England          6.25
 4 01/07/1969 England          4.40
 5 01/08/1969 England          4.40
 6 01/09/1969 England          4.40
 7 01/10/1969 England          5.52
 8 01/11/1969 England          5.52
 9 01/12/1969 England          5.52
10 01/01/1970 England          4.35
# ℹ 3,233 more rows

3.5 Dealing with dates

The last step is to convert the date column to a proper date format. This is where lubridate comes in. I would google something like “how to convert a string to a date in tidyverse” and I would probably find the answer.

Without going too much into the details, I would arrive at:

uk_hpi %>%
    select(Date, RegionName, `12m%Change`) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`) %>%
    filter(region %in% c("United Kingdom", "England", "Scotland", "Wales", "Northern Ireland")) %>%
    drop_na(yearly_change) %>% 
    mutate(date = dmy(date))
# A tibble: 3,243 × 3
   date       region  yearly_change
   <date>     <chr>           <dbl>
 1 1969-04-01 England          6.25
 2 1969-05-01 England          6.25
 3 1969-06-01 England          6.25
 4 1969-07-01 England          4.40
 5 1969-08-01 England          4.40
 6 1969-09-01 England          4.40
 7 1969-10-01 England          5.52
 8 1969-11-01 England          5.52
 9 1969-12-01 England          5.52
10 1970-01-01 England          4.35
# ℹ 3,233 more rows

I can use the dmy function directly because I had already imported lubridate at the start of the file. Otherwise, since I’m only using it once, I could also have done mutate(date = lubridate::dmy(date)) to the same effect.

The final step would be to sort the data frame by date, in descending order:

uk_hpi %>%
    select(Date, RegionName, `12m%Change`) %>%
    rename(region = RegionName, date = Date, yearly_change = `12m%Change`) %>%
    filter(region %in% c("United Kingdom", "England", "Scotland", "Wales", "Northern Ireland")) %>%
    drop_na(yearly_change) %>% 
    mutate(date = dmy(date)) %>% 
    arrange(desc(date))
# A tibble: 3,243 × 3
   date       region           yearly_change
   <date>     <chr>                    <dbl>
 1 2023-06-01 England                    1.9
 2 2023-06-01 Northern Ireland           2.7
 3 2023-06-01 Scotland                   0  
 4 2023-06-01 United Kingdom             1.7
 5 2023-06-01 Wales                      0.6
 6 2023-05-01 England                    1.7
 7 2023-05-01 Northern Ireland           2.7
 8 2023-05-01 Scotland                   1.9
 9 2023-05-01 United Kingdom             1.8
10 2023-05-01 Wales                      1.6
# ℹ 3,233 more rows

💡 Pro-tip: I could only trust that my sorting was correct because I first used the dmy() function to convert the column date from a string to a date-type column. If I hadn’t converted it, the sorting could have been wrong. This is because the column date was a string, and strings are sorted alphabetically. This means that “10/01/2021” would come before “02/01/2021”, which is not what we want.

What’s Next?

Did I miss any details? Send me a message so I can amend and release an updated version of this file/webpage!

I hope this helps you in your learning journey. See you in the next lecture!