π» Week 03 - Lab Roadmap (90 min)
Linear regression, a tidymodels tutorial
Welcome to our second lab.
This week, you will explore linear regression models, and thereβs something new to learn even if you are already familiar with the lm
command in R. We will be working with the tidymodels framework, a suite of packages that provides a cohesive interface for modelling and machine learning in R.
If you never accessed ChatGPT, you must create an account. Click on chat.openai.com and sign up with your email address (it doesnβt have to be your LSE email address).
When you reach Part 3 of this lab, read the specific instructions for GENIAL participants.
π₯ Learning Objectives
By the end of this lab, you will be able to:
- Fit a linear regression model using the
lm
command in R - Fit the same linear regression model using the
tidymodels
framework - Evaluate the performance of a linear regression model using the
yardstick
package
π Preparation
Click here to read about how to prepare for this lab.
Then use the link below to download the lab materials:
π Lab Tasks
Here are the instructions for this lab:
Part I - the lm
command (25 min)
π¨π»βπ« TEACHING MOMENT: Your class teacher will explain, in simple terms, what is the goal of a linear regression. Then, they will demonstrate how to run the lm
function.
π£οΈ DISCUSSION:
- What is the value of the intercept?
- What is the value of the slope?
- How should you interpret this model?
π¨π»βπ« TEACHING MOMENT: Your class teacher will demonstrate how to make a plot about this model.
π£οΈ DISCUSSION:
Does this model fit the data well? Why? Why not?
Part II - the tidymodels
way (20 min)
π¨π»βπ« TEACHING MOMENT: Your class teacher will demonstrate how to get the same thing done using the tidymodels
framework.
Solutions to Parts I & II will be posted to Moodle on Tuesday afternoon.
Part III - evaluating the model (45 min)
The questions in this section are more challenging, and the primary skill we want you to practise is to consult the documentation of the packages we are using. Some of the questions are open-ended on purpose.
Solutions will be posted at the end of the day on Friday, so you can have a chance to practise solving these questions again during the rest of the week. It would make us absolutely happy if you were to pair up with other colleagues and exchange ideas of potential solutions with each other on Slack!
Take note of the things you donβt understand so you can bring them to the lecture later this week.
If you are participating in the GENIAL project, you are asked to:
- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
- Have only the following tabs open in your browser:
- These lab instructions
- The ChatGPT website (open a new chat window and name it βDS202A - Week 03β)
- The
tidymodels
documentation page (use the search bar) - The
dplyr
documentation page (use the search bar)
- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
- Fill out this brief survey at the end of the lab: π link (requires LSE login)
In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.
We suggest you have these tabs open in your browser:
- These lab instructions
- The
tidymodels
documentation page (use the search bar) - The
dplyr
documentation page (use the search bar)
You will unlikely be able to finish all of these questions in this lab. Several of the concepts here will be new to you, and you will learn about them later in the lecture. Just see how far you can get by searching the documentation of the packages!
π― ACTION POINTS:
Separate the data into training and testing sets. The training set must contain data up until Dec 2020. The testing set must contain data from Jan 2021 onwards.
Fit a linear regression model with
tidymodels
using just the training set. Call itmodel3
. How does this model compare to the one you fitted in Part II?Calculate the Mean Absolute Error (MAE) for the training set. That is, on average, how far off is the model from the actual data?
Calculate the Mean Absolute Error (MAE) for the testing set. That is, on average, how far off is the model from the actual data?
Now create a
df_scotland
with the same variables asdf
, but only for Scotland (no need to separate into training and testing sets).How well can
model3
predict monthly changes in house prices in Scotland?(GENIAL) Fill out this brief survey at the end of the lab: π link (requires LSE login)
Again, solutions for Part III will be posted only much later, on Friday, after the lecture. Just see how far you can get and try to understand the documentation of the packages.