πŸ’» Week 03 - Lab Roadmap (90 min)

Linear regression, a tidymodels tutorial

Author

Dr. Jon Cardoso-Silva

Welcome to our second lab.

This week, you will explore linear regression models, and there’s something new to learn even if you are already familiar with the lm command in R. We will be working with the tidymodels framework, a suite of packages that provides a cohesive interface for modelling and machine learning in R.

This lab is part of the GENIAL project.

If you never accessed ChatGPT, you must create an account. Click on chat.openai.com and sign up with your email address (it doesn’t have to be your LSE email address).

When you reach Part 3 of this lab, read the specific instructions for GENIAL participants.

πŸ₯… Learning Objectives

By the end of this lab, you will be able to:

  • Fit a linear regression model using the lm command in R
  • Fit the same linear regression model using the tidymodels framework
  • Evaluate the performance of a linear regression model using the yardstick package

πŸ“š Preparation

Click here to read about how to prepare for this lab.

Then use the link below to download the lab materials:

πŸ“‹ Lab Tasks

Here are the instructions for this lab:

Part I - the lm command (25 min)

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will explain, in simple terms, what is the goal of a linear regression. Then, they will demonstrate how to run the lm function.

πŸ—£οΈ DISCUSSION:

  • What is the value of the intercept?
  • What is the value of the slope?
  • How should you interpret this model?

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will demonstrate how to make a plot about this model.

πŸ—£οΈ DISCUSSION:

Does this model fit the data well? Why? Why not?

Part II - the tidymodels way (20 min)

πŸ‘¨πŸ»β€πŸ« TEACHING MOMENT: Your class teacher will demonstrate how to get the same thing done using the tidymodels framework.

Solutions to Parts I & II will be posted to Moodle on Tuesday afternoon.

Part III - evaluating the model (45 min)

The questions in this section are more challenging, and the primary skill we want you to practise is to consult the documentation of the packages we are using. Some of the questions are open-ended on purpose.

Solutions will be posted at the end of the day on Friday, so you can have a chance to practise solving these questions again during the rest of the week. It would make us absolutely happy if you were to pair up with other colleagues and exchange ideas of potential solutions with each other on Slack!

Take note of the things you don’t understand so you can bring them to the lecture later this week.

If you are participating in the GENIAL project, you are asked to:

  • Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.
  • Have only the following tabs open in your browser:
    1. These lab instructions
    2. The ChatGPT website (open a new chat window and name it β€˜DS202A - Week 03’)
    3. The tidymodels documentation page (use the search bar)
    4. The dplyr documentation page (use the search bar)
  • Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.
  • Fill out this brief survey at the end of the lab: πŸ”— link (requires LSE login)

In case you are not participating in the GENIAL project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

  1. These lab instructions
  2. The tidymodels documentation page (use the search bar)
  3. The dplyr documentation page (use the search bar)

You will unlikely be able to finish all of these questions in this lab. Several of the concepts here will be new to you, and you will learn about them later in the lecture. Just see how far you can get by searching the documentation of the packages!

🎯 ACTION POINTS:

  1. Separate the data into training and testing sets. The training set must contain data up until Dec 2020. The testing set must contain data from Jan 2021 onwards.

  2. Fit a linear regression model with tidymodels using just the training set. Call it model3. How does this model compare to the one you fitted in Part II?

  3. Calculate the Mean Absolute Error (MAE) for the training set. That is, on average, how far off is the model from the actual data?

  4. Calculate the Mean Absolute Error (MAE) for the testing set. That is, on average, how far off is the model from the actual data?

  5. Now create a df_scotland with the same variables as df, but only for Scotland (no need to separate into training and testing sets).

  6. How well can model3 predict monthly changes in house prices in Scotland?

  7. (GENIAL) Fill out this brief survey at the end of the lab: πŸ”— link (requires LSE login)

Again, solutions for Part III will be posted only much later, on Friday, after the lecture. Just see how far you can get and try to understand the documentation of the packages.