💻 Week 03 - Lab Roadmap (90 min)

Linear regression, a tidymodels tutorial

Author

Dr. Jon Cardoso-Silva

Welcome to our DS202W third lab.

This week, you will explore linear regression models, and there’s something new to learn even if you are already familiar with the lm command in R. We will be working with the tidymodels framework, a suite of packages that provides a cohesive interface for modelling and machine learning in R.

🥅 Learning Objectives

By the end of this lab, you will be able to:

Fit a linear regression model using the lm command in R
Fit the same linear regression model using the tidymodels framework
Evaluate the performance of a linear regression model using the yardstick package

📚 Preparation

Click here to read about how to prepare for this lab.

Then use the link below to download the lab materials:

📋 Lab Tasks

No need to wait! Start reading the tasks and tackling the action points below when you come to the classroom.

Part 0: Export your chat logs (~ 3 min)

As part of the GENIAL project, we ask that you fill out the following form as soon as you come to the lab:

🎯 ACTION POINTS

🔗 CLICK HERE to export your chat log.

Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️

👉 NOTE: You MUST complete the initial form.

If you really don’t want to participate in GENIAL¹, just answer ‘No’ to the Terms & Conditions question - your e-mail address will be deleted from GENIAL’s database the following week.

Part I - the `lm` command (25 min)

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will demonstrate how to run the lm function for the UK House Price Index data.

🗣️ CLASSROOM DISCUSSION:

What is the value of the slope, and what does it represent?
What is the value of the intercept?
Replace the missing values in the following sentence:

This model suggests that whenever _____(what)______ increases by 1 ____(unit)_____ on a period of _____(period)_____, then _____(what)______ increases by _____(how much)______.

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will demonstrate how to make a plot about this model.

🗣️ DISCUSSION:

Does this model fit the data well? Why? Why not?

Part II - the `tidymodels` way (20 min)

👨🏻‍🏫 TEACHING MOMENT: Your class teacher will demonstrate how to get the same thing done using the tidymodels framework.

Part III - evaluating the model (45 min)

👥 PAIR UP

The questions in this section are more challenging, and the primary skill we want you to practice is to consult the documentation of the packages we are using. Some of the questions are open-ended on purpose.

Unlike other weeks, solutions will be posted at the end of the day on Friday, so you can have a chance to practice solving these questions again during the rest of the week. It would make us absolutely happy if you paired up with other colleagues (during the lab and outside of it!) and exchanged ideas of potential solutions with each other on Slack! You’re also free to use AI help if you so wish (just keep your chat logs if you do that and share them with us at the next lab on Monday!).

Take note of the things you don’t understand so you can bring them to the lecture later this week.

Here are some (potentially) useful resources for this section 😉:

The tidymodels documentation
The dplyr documentation

⚠️ You are unlikely to be able to finish all of these questions contained in this section during the course of this lab. That’s perfectly fine. Several of the concepts here will be new to you, and you will learn about them later in the lecture. Just see how far you can get by searching the documentation of the packages!

🎯 ACTION POINTS:

Separate the data into training and testing sets. The training set must contain data up until Dec 2020. The testing set must contain data from Jan 2021 onwards.
Fit a linear regression model with tidymodels using just the training set. Call it model3. How does this model compare to the one you fitted in Part II?
Calculate the Mean Absolute Error (MAE) for the training set. That is, on average, how far off is the model from the actual data?
Calculate the Mean Absolute Error (MAE) for the testing set. That is, on average, how far off is the model from the actual data?
Now, add a second independent variable, call it monthlyChangeHPI_lag1, which is essentially a lagged variable of the monthlyChangeHPI variable you already have. Build a new model, call it model4, and compare it to model3. Which one is better?
GO DEEPER: Now create a df_scotland with the same variables as df, but only for Scotland (no need to separate into training and testing sets).
How well can model3 and model4 predict monthly changes in house prices in Scotland? How well does it compare to the results you got for England?

Footnotes

We’re gonna cry a little bit, not gonna lie. But no hard feelings. We’ll get over it.↩︎