π» Week 03 - Lab Roadmap (90 min)
Linear regression, a tidymodels tutorial
Welcome to our DS202A third lab.
This week, you will explore linear regression models, and thereβs something new to learn even if you are already familiar with the lm
command in R. We will be working with the tidymodels framework, a suite of packages that provides a cohesive interface for modelling and machine learning in R.
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Fit a linear regression model using the
lm
command in R - Fit the same linear regression model using the
tidymodels
framework - Evaluate the performance of a linear regression model using the
yardstick
package
π Preparation
Click here to read about how to prepare for this lab.
Then use the link below to download the lab materials:
π Lab Tasks
No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.
Part 0: Export your chat logs (~ 3 min)
As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:
π― ACTION POINTS
π CLICK HERE to export your chat log.
Thanks for being GENIAL! You are now one step closer to earning some prizes! ποΈ
π NOTE: You MUST complete the initial form.
If you really donβt want to participate in GENIAL1, just answer βNoβ to the Terms & Conditions question - your e-mail address will be deleted from GENIALβs database the following week.
Part I - the lm
command (25 min)
π¨π»βπ« TEACHING MOMENT: Your class teacher will demonstrate how to run the lm
function for the UK House Price Index data.
π£οΈ CLASSROOM DISCUSSION:
What is the value of the slope, and what does it represent?
What is the value of the intercept?
Replace the missing values in the following sentence:
This model suggests that whenever
_____(what)______
increases by 1____(unit)_____
on a period of_____(period)_____
, then_____(what)______
increases by_____(how much)______
.
π¨π»βπ« TEACHING MOMENT: Your class teacher will demonstrate how to make a plot about this model.
π£οΈ DISCUSSION:
Does this model fit the data well? Why? Why not?
Part II - the tidymodels
way (20 min)
π¨π»βπ« TEACHING MOMENT: Your class teacher will demonstrate how to get the same thing done using the tidymodels
framework.
Part III - evaluating the model (45 min)
π₯ PAIR UP
The questions in this section are more challenging, and the primary skill we want you to practice is to consult the documentation of the packages we are using. Some of the questions are open-ended on purpose.
Unlike other weeks, solutions will be posted at the end of the day on Friday, so you can have a chance to practice solving these questions again during the rest of the week. It would make us absolutely happy if you paired up with other colleagues (during the lab and outside of it!) and exchanged ideas of potential solutions with each other on Slack! Youβre also free to use AI help if you so wish (just keep your chat logs if you do that and share them with us at the next lab on Monday!).
Take note of the things you donβt understand so you can bring them to the lecture later this week.
Here are some (potentially) useful resources for this section π:
β οΈ You are unlikely to be able to finish all of these questions contained in this section during the course of this lab. Thatβs perfectly fine. Several of the concepts here will be new to you, and you will learn about them later in the lecture. Just see how far you can get by searching the documentation of the packages!
π― ACTION POINTS:
- Separate the data into training and testing sets. The training set must contain data up until Dec 2020. The testing set must contain data from Jan 2021 onwards.
- Fit a linear regression model with
tidymodels
using just the training set. Call itmodel3
. How does this model compare to the one you fitted in Part II? - Calculate the Mean Absolute Error (MAE) for the training set. That is, on average, how far off is the model from the actual data?
- Calculate the Mean Absolute Error (MAE) for the testing set. That is, on average, how far off is the model from the actual data?
- Now, add a second independent variable, call it
monthlyChangeHPI_lag1
, which is essentially a lagged variable of themonthlyChangeHPI
variable you already have. Build a new model, call itmodel4
, and compare it tomodel3
. Which one is better? - GO DEEPER: Now create a
df_scotland
with the same variables asdf
, but only for Scotland (no need to separate into training and testing sets). - How well can
model3
andmodel4
predict monthly changes in house prices in Scotland? How well does it compare to the results you got for England?
Footnotes
Weβre gonna cry a little bit, not gonna lie. But no hard feelings. Weβll get over it.β©οΈ