✅ Week 03 - Lab Solutions
Linear regression as a machine learning algorithm
This solution file follows the format of a Jupyter Notebook file .ipynb you had to fill in during the lab session.
Downloading the student solutions
Click on the below button to download the student notebook.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import r2_score, root_mean_squared_error
import statsmodels.api as sm
from lets_plot import *
LetsPlot.setup_html()Before we do anything more
Please create a data folder called data to store all the different data sets in this course.
Student Performance Dataset
students = pd.read_csv("data/student-data.csv")
students = students.dropna()
print(students.shape)(637, 31)We start our machine learning journey with a student performance dataset (students), which contains information on students from two Portuguese schools. We have cleaned the data to only include statistically significant predictors. The columns include:
final_gradefinal grade from 0 to 20 (the outcome)schoolstudent’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira, reference: GP)sexstudent’s sex (binary: ‘F’ - female or ‘M’ - male, reference: female)agestudent’s age (numeric: from 15 to 22 years)studytimeweekly study time (categorical: ‘<2hrs’, ‘2-5hrs’, ‘5-10hrs’, ‘>10hrs’, reference: <2hrs)failuresnumber of past class failures (categorical: ‘0’, ‘1’, ‘2’, ‘3’, ‘4+’, reference: 0 failures)schoolsupextra educational support (binary: yes or no, reference: no)higherwants to take higher education (binary: yes or no, reference: no)gooutgoing out with friends frequency (categorical: ‘VeryLow’, ‘Low’, ‘Average’, ‘High’, ‘VeryHigh’, reference: VeryLow)dalcworkday alcohol consumption (categorical: ‘VeryLow’, ‘Low’, ‘Average’, ‘High’, ‘VeryHigh’, reference: VeryLow)healthcurrent health status (categorical: ‘VeryBad’, ‘Bad’, ‘Average’, ‘Good’, ‘VeryGood’, reference: VeryBad)romanticin a romantic relationship (binary: yes or no, reference: no)
Understanding student performance: some exploratory data analysis (EDA) (10 minutes)
Now it’s your turn to explore the students dataset! Use this time to create visualizations and discover patterns in the data that might help explain what drives student success.
Some ideas to get you started:
- How are final grades distributed? Are they normally distributed or skewed?
- Do students who want higher education perform better than those who don’t?
- Is there a relationship between study time and final grades?
- How does past academic failure affect current performance?
- Are there differences in performance between the two schools?
- Does going out with friends impact academic performance?
- What’s the relationship between health status and grades?
- Do students receiving extra educational support perform differently?
- How does alcohol consumption relate to academic performance?
- Are there gender differences in academic achievement?
Challenge yourself:
- Can you find any surprising relationships in the data?
- What patterns emerge when you look at combinations of variables?
- Are there any outliers or interesting edge cases?
Share your most interesting findings on Slack! We’d love to see what patterns you discover and which visualizations tell the most compelling stories about student performance.
Distribution of final grades
ggplot(students) + \
geom_histogram(aes(x="final_grade"), bins=20, fill="#4C72B0", alpha=0.8) + \
labs(
title="Distribution of Final Grades",
x="Final Grade",
y="Count"
)Most grades are higher than 10!
Grades by desire for higher education
ggplot(students) + \
geom_boxplot(aes(x="higher", y="final_grade", fill="higher")) + \
labs(
title="Final Grades by Desire for Higher Education",
x="Wants Higher Education",
y="Final Grade"
)Students who want higher education tend to have higher grades.
Study time and academic performance
ggplot(students) + \
geom_boxplot(aes(x="studytime", y="final_grade", fill="studytime")) + \
labs(
title="Final Grades by Weekly Study Time",
x="Study Time",
y="Final Grade"
)What we see:
- Students studying less than 2 hours per week have the lowest median final grades.
- Median grades increase for 2–5 hours and 5–10 hours, suggesting a positive association between study time and performance.
- The >10 hours group does not show a dramatic further increase relative to 5–10 hours.
Key insight:
More study time is associated with better performance, but the relationship appears non-linear, with possible diminishing returns beyond moderate study levels.
⚠️Warning:
Correlation does not imply causation: students struggling academically may study more in response.
Past failures and performance
ggplot(students) + \
geom_boxplot(aes(x="failures", y="final_grade", fill="failures")) + \
labs(
title="Final Grades by Number of Past Failures",
x="Past Failures",
y="Final Grade"
)What we see:
- Students with zero past failures clearly outperform all other groups.
- Each additional failure is associated with a lower median final grade.
- The spread narrows as failures increase, suggesting persistently lower outcomes.
Key insight:
Past failure is a strong negative predictor of current academic performance.
This variable often dominates linear models and is a good example of a feature that captures structural disadvantage rather than short-term behaviour.
Alcohol consumption and grades
ggplot(students) + \
geom_boxplot(aes(x="dalc", y="final_grade", fill="dalc")) + \
labs(
title="Final Grades by Workday Alcohol Consumption",
x="Alcohol Consumption",
y="Final Grade"
)What we see:
- Students with VeryLow or Low alcohol consumption tend to have higher median grades.
- Higher consumption categories show lower medians and greater variability.
- There is overlap between groups — the effect is not deterministic.
Key insight:
Higher workday alcohol consumption is associated with lower academic performance, but the relationship is noisy and far from absolute.
⚠️Important caution: Alcohol consumption may act as a proxy for other factors (routine, sleep, social environment), which the model cannot fully disentangle.
Understanding student performance: the hypothesis-testing approach (5 minutes)
Why do some students perform better than others? This is one question that a quantitative social scientist might answer by exploring the magnitude and precision of a series of variables. Suppose we hypothesised that students who want to pursue higher education have better academic performance. We can estimate a linear regression model by using final_grade as the dependent variable and higher as the independent variable.
To estimate a linear regression with interpretable output in Python, we can use the sm.OLS function, which requires two things:
- One or more features plus a constant
- An outcome
Let’s do this now. We can print the summary method to get information on the coefficient estimate for higher.
# Convert higher to float (assuming it's binary: yes=1, no=0)
X = students["higher"].map({'yes': 1, 'no': 0}).astype(float)
# Add a constant
X = sm.add_constant(X)
# Isolate the outcome column
y = students["final_grade"]
# Build an OLS and print its output
model_univ = sm.OLS(y, X).fit()
print(model_univ.summary())
OLS Regression Results
==============================================================================
Dep. Variable: final_grade R-squared: 0.112
Model: OLS Adj. R-squared: 0.111
Method: Least Squares F-statistic: 80.07
Date: Mon, 09 Feb 2026 Prob (F-statistic): 3.94e-18
Time: 12:08:33 Log-Likelihood: -1615.6
No. Observations: 637 AIC: 3235.
Df Residuals: 635 BIC: 3244.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 8.7647 0.371 23.610 0.000 8.036 9.494
higher 3.5147 0.393 8.948 0.000 2.743 4.286
==============================================================================
Omnibus: 130.457 Durbin-Watson: 1.737
Prob(Omnibus): 0.000 Jarque-Bera (JB): 359.764
Skew: -1.013 Prob(JB): 7.55e-79
Kurtosis: 6.074 Cond. No. 5.96
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.We see that students who want to pursue higher education have a positive and statistically significant (p < 0.001) increase in final grades of about 3.5 points.
👉 NOTE: The process of hypothesis testing is obviously more involved when using observational data than is portrayed by this simple example. Control variables will almost always be incorporated and, increasingly, identification strategies will be used to uncover causal effects. The end result, however, will involve as rigorous an attempt at falsifying a hypothesis as can be provided with the data.
For an example of how multivariate regression is used, we can run the following code.
# Create a function that standardises variables
def standardise(var):
return (var - var.mean()) / var.std()
# Create a data frame of features
X = students.drop(["final_grade"], axis=1)
# Identify which features are numeric, categorical and boolean
X_numeric = X.filter(items=["age"], axis=1)
X_categorical = X.filter(items=["school", "sex", "studytime", "failures", "schoolsup",
"higher", "goout", "dalc", "health", "romantic"], axis=1)
# standardise numeric features
X_numeric = X_numeric.apply(lambda x: standardise(x), axis=0)
# Get dummies from categorical features
X_categorical = pd.get_dummies(X_categorical, drop_first=True, dtype=int)
# Concatenate to final data frame
X = pd.concat([X_numeric, X_categorical], axis=1)
# Add a constant
X = sm.add_constant(X)
# ISOLATING OUR OUTCOME
y = students["final_grade"]# BUILDING / SUMMARISING OUR MODEL
model_multiv = sm.OLS(y, X).fit()
print(model_multiv.summary()) OLS Regression Results
==============================================================================
Dep. Variable: final_grade R-squared: 0.352
Model: OLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 15.16
Date: Mon, 09 Feb 2026 Prob (F-statistic): 1.12e-44
Time: 12:10:20 Log-Likelihood: -1515.2
No. Observations: 637 AIC: 3076.
Df Residuals: 614 BIC: 3179.
Df Model: 22
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 11.5166 0.692 16.635 0.000 10.157 12.876
age 0.1970 0.120 1.635 0.102 -0.040 0.434
failures -1.5883 0.196 -8.089 0.000 -1.974 -1.203
school_MS -1.5576 0.232 -6.720 0.000 -2.013 -1.102
sex_M -0.4953 0.239 -2.074 0.039 -0.964 -0.026
studytime_5-10hrs 0.6758 0.323 2.093 0.037 0.042 1.310
studytime_<2hrs -0.3999 0.260 -1.538 0.125 -0.911 0.111
studytime_>10hrs 0.7369 0.491 1.501 0.134 -0.227 1.701
schoolsup_yes -1.3656 0.364 -3.752 0.000 -2.080 -0.651
higher_yes 1.9432 0.381 5.094 0.000 1.194 2.692
goout_High -0.1126 0.302 -0.373 0.709 -0.705 0.480
goout_Low 0.2938 0.297 0.990 0.323 -0.289 0.877
goout_VeryHigh -0.5365 0.328 -1.636 0.102 -1.181 0.108
goout_VeryLow -0.9465 0.437 -2.165 0.031 -1.805 -0.088
dalc_High -2.5590 0.779 -3.287 0.001 -4.088 -1.030
dalc_Low -0.5344 0.488 -1.094 0.274 -1.493 0.424
dalc_VeryHigh -0.3631 0.784 -0.463 0.644 -1.904 1.177
dalc_VeryLow 0.0538 0.454 0.119 0.906 -0.837 0.945
health_Bad 0.6405 0.393 1.630 0.104 -0.131 1.412
health_Good 0.5861 0.356 1.645 0.100 -0.114 1.286
health_VeryBad 0.8574 0.378 2.269 0.024 0.115 1.599
health_VeryGood -0.0464 0.302 -0.154 0.878 -0.639 0.546
romantic_yes -0.3876 0.229 -1.694 0.091 -0.837 0.062
==============================================================================
Omnibus: 76.020 Durbin-Watson: 1.916
Prob(Omnibus): 0.000 Jarque-Bera (JB): 177.796
Skew: -0.651 Prob(JB): 2.47e-39
Kurtosis: 5.237 Cond. No. 19.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.Interestingly, we can see that the coefficient estimate for higher_yes remains positive and highly significant (p < 0.001), suggesting this relationship holds even when controlling for other factors like study time, past failures, and demographic characteristics.
👉 NOTE: p-values are useful to machine learning scientists as they indicate which variables may yield a significant increase in model performance. However, p-hacking where researchers manipulate data to find results that support their hypothesis make it hard to tell whether or not a relationship held up after honest attempts at falsification. This can range from using a specific modelling approach that produces statistically significant (while failing to report others that do not) findings to outright manipulation of the data. For a recent egregious case of the latter, we recommend the Data Falsificada series.
Predicting student grades: the machine learning approach (30 minutes)
Machine learning scientists take a different approach. Our aim, in this context, is to build a model that can be used to accurately predict student performance using a mixture of features and, for some models, hyperparameters (which we will address in Lab 5).
Thus, rather than attempting to falsify the effects of causes, we are more concerned about the fit of the model in the aggregate when applied to unforeseen data.
To achieve this, we do the following:
- Split the data into training and test sets
- Build a model using the training set
- Evaluate the model on the test set
Let’s look at each of these in turn.
Split the data into training and test sets
It is worth considering what a training and test set is and why we might split the data this way.
A training set is data that we use to build (or “train”) a model. In the case of multivariate linear regression, we are using the training data to estimate a series of coefficients. Here is a made-up multivariate linear model with three coefficients derived from (non-existent) data to illustrate things.
def sim_model_preds(x1, x2, x3):
y = 1.1 * x1 + 2.2 * x2 + 3.3 * x3
return yA test set is data that the model has not yet seen. We then apply the model to this data set and use an evaluation metric to find out how accurate our predictions are. For example, suppose we had a new observation where x1 = 10, x2 = 20 and x3 = 30 and y = 150. We can use the above model to develop a prediction.
sim_model_preds(10, 20, 30)154.0We get a prediction of 154 points!
We can also calculate the amount of error we make by calculating residuals (actual value - predicted value).
150 - sim_model_preds(10, 20, 30)-4.0We can see that our model is 4 points off the real answer!
Why do we evaluate our models using different data? Because, as stated earlier, machine learning scientists care about the applicability of a model to unforeseen data. If we were to evaluate the model using the training data, we obviously cannot do this to begin with. Furthermore, we cannot ascertain whether the model we have built can generalise to other data sets or if the model has simply learned the idiosyncrasies of the data it was used to train on. We will discuss the concept of overfitting throughout this course.
We can use train_test_split in sklearn.model_selection to split the data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)👉 NOTE: Our data are purely cross-sectional, so we can use this approach. However, when working with more complex data structures (e.g. time series cross sectional), different approaches to splitting the data will need to be used.
Build a model using the training set
We will now switch our focus from statsmodels to scikit-learn, the latter being Python’s most comprehensive and popular machine learning library.
Below, we will see how simple it is to run a model in this library. We have done all the data cleaning needed, so we can just get to it!
# Create a model instance of a linear regression
linear_model = LinearRegression()
# Fit this instance to the training set
linear_model.fit(X_train, y_train)Evaluate the model using the test set
Now that we have trained a model, we can then evaluate its performance on the test set. We will look at two evaluation metrics:
- R-squared: the proportion of variance in the outcome explained by the model.
- Root mean squared error (RMSE): the amount of error a typical observation parameterised as the units used in the initial measurement.
# Create predictions for the test set
linear_preds = linear_model.predict(X_test)
# Calculate performance metrics
r2 = r2_score(y_test, linear_preds)
rmse = root_mean_squared_error(y_test, linear_preds)
# Print results
print(np.round(r2, 2), np.round(rmse, 2))0.27 2.47🗣️ CLASSROOM DISCUSSION:
How can we interpret these results?
We find that the model explains ~ 39% of the test set variance in final grades. We also find that our model predictions are off by ~ 2.5 points on the 0-20 grade scale.
Graphically exploring where we make errors
We are going to build some residual scatter plots which look at the relationship between the values fitted by the model for each observation and the residuals (actual - predicted values). Before we do this for our data, let’s take a look at an example where there is a near perfect relationship between two variables. As this very rarely exists in the social world, we will rely upon simulated data.
We translated and adapted this code from here.
# Set a seed for reproducibility
np.random.seed(123)
# Create the variance covariance matrix
sigma = [[1, 0.99], [0.99, 1]]
# Create the mean vector
mu = [10, 5]
# Generate a multivariate normal distribution using 1,000 samples
v1, v2 = np.random.multivariate_normal(mu, sigma, 1000).T
# Combine to a data frame
sim_data = pd.DataFrame({"V1": v1, "V2": v2})Plot the correlation
(
ggplot(sim_data, aes("V1", "V2")) +
geom_point() +
theme_minimal() +
theme(panel_grid_minor = element_blank()) +
labs(x = "Variable 1", y = "Variable 2")
)Residual plots
# Build a linear model
sim_fit = linear_model.fit(v1.reshape(-1, 1), v2.reshape(-1, 1))
sim_preds = linear_model.predict(v1.reshape(-1, 1))
# Combine predictions / residuals to a data frame
sim_residuals_toplot = pd.DataFrame({"predictions": sim_preds.reshape(-1),
"residuals": v2 - sim_preds.reshape(-1)})
# Plot the results
(
ggplot(sim_residuals_toplot, aes("predictions", "residuals")) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_point() +
theme_minimal() +
theme(panel_grid_minor = element_blank()) +
scale_y_continuous(limits=[-7, 7]) +
labs(x = "Fitted values", y = "Residuals")
)Now let’s run this code for our model.
residuals_toplot = pd.DataFrame({"predictions": linear_preds,
"residuals": np.array(y_test) - linear_preds})
(
ggplot(residuals_toplot, aes("predictions", "residuals")) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_point() +
theme(panel_grid_minor = element_blank()) +
labs(x = "Fitted values", y = "Residuals")
)🎯 ACTION POINTS why does the graph of the simulated data illustrate a more well-fitting model when compared to our actual data?
The spread of our values in the actual data is relatively large. Furthermore, we see that as our fitted values become larger, we go from underpredicting to overpredicting student performance. This suggests there may be non-linear relationships or missing variables that could improve our model.
Challenge: Running multiple univariate regressions efficiently (30 minutes)
🎯 YOUR CHALLENGE: Can you figure out how to run a univariate model for all features in the student dataset without creating 11 separate model objects?
Remember our univariate model earlier where we looked at final_grade ~ higher? Now we want to do the same thing for ALL features to see which individual variables are the strongest predictors of student performance.
The naive approach would be to copy-paste code 11 times and create 11 different model objects, but that’s inefficient and error-prone. Your job is to find a more elegant solution!
Introduction to using list comprehensions to aid feature selection (30 minutes)
👨🏻🏫 TEACHING MOMENT: Your tutor will take you through the code, so sit back, relax and enjoy!
Remember our univariate model earlier? We are going to do the same for all features so see which ones show the best improvements in predictive power.
We could build 15+ different model objects, but this would be very inefficient. Instead, we are going to build a function that calculates the R² for a given feature and apply it to all features using a list comprehension.
Create a list of feature names
feature_names = X_train.columnsDefine a function to get r-squared values for each feature
This gets a little tricky! Remember that we have transformed all our categorical features into one-hot encoded dummy variables. We need to therefore make sure that all dummies in one feature are included. The trick we’ve opted for is to use our original feature names to loop over. In the get_r2 we use a logical condition X_train.columns.str.contains(feature) which simply asks “does a given column name contain the string value of a given feature?”
def get_r2(feature):
"""
Runs a linear model using a subset of features in a training set and calculates
the r-squared for each model, using a test set. The output is a single value.
"""
cols = X_train.columns[X_train.columns.str.contains(feature)]
if len(cols) > 0:
train = X_train[cols]
test = X_test[cols]
linear_model = LinearRegression()
linear_model.fit(train, y_train)
preds = linear_model.predict(test)
r2 = r2_score(y_test, preds)
return r2
else:
return np.nanWe can check to see the function works by trying it on a feature or two. Let’s try this for higher education aspiration.
get_r2("higher")0.07466305856424227Use a list comprehension to loop get_r2 over all features
r2s = [get_r2(feat) for feat in feature_names]Create a data frame showing the r-squared for each feature
linear_output = pd.DataFrame({"feature": feature_names, "r2": r2s})
linear_output = linear_output.sort_values(by="r2")Plot the results!
(
ggplot(linear_output, aes("r2", "feature")) +
geom_bar(stat="identity") +
theme_minimal() +
theme(panel_grid_minor = element_blank(),
panel_grid_major_y = element_blank()) +
labs(x = "R-squared value", y = "")
)Using penalised linear regression to perform feature selection (20 minutes)
We are now going to experiment with a lasso regression which, in this case, is a linear regression that uses a so-called hyperparameter - a “dial” built into a given model that can be experimented with to improve model performance. The hyperparameter in this case is a regularisation penalty which takes the value of a non-negative number. This penalty can shrink the magnitude of coefficients down to zero and the larger the penalty, the more shrinkage occurs.
Step 1: Create a lasso model
Run the following code. This builds a lasso model with the penalty parameter set to 0.01.
lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train, y_train)Step 2: Extract lasso coefficients
# Create a data frame with feature columns and Lasso coefficients
lasso_output = pd.DataFrame({"feature": X_train.columns, "coefficient": lasso_model.coef_})
# Code a positive / negative vector
lasso_output["positive"] = np.where(lasso_output["coefficient"] >= 0, True, False)
# Take the absolute value of the coefficients
lasso_output["coefficient"] = np.abs(lasso_output["coefficient"])
# Remove the constant and sort the data frame by (absolute) coefficient magnitude
lasso_output = lasso_output.query("feature != 'const'").sort_values("coefficient")lasso_output| feature | coefficient | positive | |
|---|---|---|---|
| 16 | dalc_VeryHigh | 0.000000 | True |
| 10 | goout_High | 0.000000 | True |
| 17 | dalc_VeryLow | 0.020745 | True |
| 1 | age | 0.098501 | True |
| 21 | health_VeryGood | 0.175866 | False |
| 18 | health_Bad | 0.211407 | True |
| 11 | goout_Low | 0.340762 | True |
| 22 | romantic_yes | 0.347256 | False |
| 5 | studytime_5-10hrs | 0.378414 | True |
| 19 | health_Good | 0.422875 | True |
| 12 | goout_VeryHigh | 0.448012 | False |
| 4 | sex_M | 0.548476 | False |
| 7 | studytime_>10hrs | 0.587786 | True |
| 15 | dalc_Low | 0.590139 | False |
| 13 | goout_VeryLow | 0.619138 | False |
| 6 | studytime_<2hrs | 0.680878 | False |
| 20 | health_VeryBad | 0.794242 | True |
| 8 | schoolsup_yes | 1.197482 | False |
| 3 | school_MS | 1.362506 | False |
| 2 | failures | 1.777782 | False |
| 9 | higher_yes | 1.785505 | True |
| 14 | dalc_High | 2.649506 | False |
🎯 ACTION POINTS What is the output? Which coefficients have been shrunk to zero? What is the most important feature?
Look at which features have coefficient values of 0 and which have the largest absolute coefficients. The most important features are those with the largest non-zero coefficients.
Step 3: Create a bar plot
(
ggplot(lasso_output, aes("coefficient", "feature", fill="positive")) +
geom_bar(stat="identity") +
theme(panel_grid_major_y = element_blank()) +
labs(x = "Lasso coefficient", y = "Feature",
fill = "Positive?")
)Step 4: Evaluate on the test set
Although a different model is used, the code for evaluating the model on the test set is exactly the same as earlier.
# Apply model to test set
lasso_preds = lasso_model.predict(X_test)
# Calculate performance metrics
lasso_r2 = r2_score(y_test, lasso_preds)
lasso_rmse = root_mean_squared_error(y_test, lasso_preds)
# Print rounded PMs
print(np.round(lasso_r2, 2), np.round(lasso_rmse, 2))0.27 2.46🗣️ CLASSROOM DISCUSSION:
What feature is the largest positive / negative predictor of student performance? Does this model represent an improvement on the linear model?
Compare the R-squared and RMSE values to our earlier multivariate model. Does the lasso provide better predictive performance or feature selection benefits?
(Bonus) Step 5: Experiment with different penalties
This is your chance to try out different penalties. Can you find a penalty that improves test set performance?
Let’s try a lower penalty value of 0.001.
# Instantiate a lasso model
lasso_model = Lasso(alpha=0.001)
# Fit the model to the training data
lasso_model.fit(X_train, y_train)
# Apply the model to the test set
lasso_preds = lasso_model.predict(X_test)
# Calculate performance metrics
lasso_r2 = r2_score(y_test, lasso_preds)
lasso_rmse = root_mean_squared_error(y_test, lasso_preds)
# Print the rounded metrics
print(np.round(lasso_r2, 2), np.round(lasso_rmse, 2))0.27 2.47Try different penalty values to see which gives you the lowest RMSE. How do the coefficients change as you increase the penalty?
👉 NOTE: In labs 4 and 5, we are going to use a method called k-fold cross validation to systematically test different combinations of hyperparameters for models such as the lasso.
