❌ Common W04 Formative issues

Author

Dr Ghita Berrada

1 Issue 1: Not submitting an HTML file

🧩 The problem:

Many of you submitted a .qmd file and/or an .ipynb file but forgot to submit a (self-contained) HTML file alongside it. We can’t visualise the outputs of your code correctly, particularly if you used an absolute path to load the data (see section Section 2).

💡 The solution:

If you have included the following lines in your .qmd header:

editor:
  render-on-save: true
  preview: true

then you will generate an HTML file every time you press the preview button at the top right corner of the VSCode menu bar (this supposes that you have the Quarto extension installed and that you have your .qmd file opened).

If not, you can either:

add these lines to the .qmd header, save the .qmd and press the preview button
or open a new terminal in VSCode, ensure that you are in the same folder as your .qmd (use the pwd and cd command as required for that) and then execute this command in the terminal quarto render name_of_your_qmd.qmd where name_of_your_qmd.qmd is the name of .qmd file

This should produce an HTML file, provided your .qmd file is syntaxically correct.

1.1 Submitting `.qmd` file with incorrect headers

🧩 The problem:

Many submitted .qmd that had headers that were incomplete and might have resulted in incorrectly rendered HTML files if HTML files had been generated.

💡 The solution:

Here is what a correct header looks like:

---
title: "✏️ W04 Formative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
  render-on-save: true
  preview: true
---

You need to replace <CANDIDATE_NUMBER> by your candidate number. The line self-contained: true is particularly important as this is the line that allows you to produce a self-contained HTML: if you forget it and include figures in your .qmd, your figures won’t show!

2 Issue 2: Loading the data using an absolute path

🧩 The problem:

When loading the dataset, many used absolute paths i.e the exact path to the data as it would appear on their machines. Why is that not a good idea?

If someone else wanted to reproduce your code, they would have to have the exact same path to the data on their machines!

This means that specifying absolute paths when loading datasets (or in any part of your code really) makes your code lose its reproducibility.

💡 The solution:

So, you should always use relative paths in your code and ensure your code is fully reproducible.

3 Issue 3: Recommending dropping missing values on the wrong reasons

🧩 The problem:

Many recommended dropping missing values for the wrong reasons.

They had computed the percentage of missing values and:

they had either determined that the percentage of missing values was too significant so the missing values needed to be dropped
or they had determined the missing values could not be obtained from an external sources so they needed to be dropped

Both these reasons are wrong.

💡 The solution:

In this particular case, the column with the most missing values (population) only had a percentage of missing values in the low 20%s: this hardly qualifies as a significant amount of missing values and can usually be dealt with with imputation. But most importantly, the percentage of missing values on its own is not sufficient to determine whether to drop missing values: you can only make that determination after having evaluated missingness patterns (and the only case where dropping values is valid is the MCAR case).

Regarding having to drop values because external sources can’t provide the missing information, this is also wrong because imputation does not necessarily rely on external sources to fill in the gaps: it actually uses available information in the current dataset to try and estimate the missing values as best as possible.

4 Issue 4: Only plotting distribution plots without interpreting them

🧩 The problem:

Many times, in both questions 3 and 4, distribution plots were drawn (more or less correctly). The plots were not accompanied with any interpretation

💡 The solution:

A plot on its own does not tell a story! It only tells only if it is interpreted. To see an example of the type of interpretation that was expected for Questions 3 and 4, see the model solution for this formative.

5 Issue 5: Not labeling the plots correctly

🧩 The problem:

GDP per capita, in Question 3, is a variable that has a unit.

💡 The solution:

All distribution plots that included GDP per capita needed to contain the unit of GDP per capita too!

6 Issue 6: Not explaining the modeling choices and/or the metrics

🧩 The problem:

Many only wrote the code for linear and LASSO regressions and output some metrics but didn’t explain how they selected the features for their models, why they made certain other choices (e.g choice of imputation technique, choice to drop missing values, etc.) and didn’t interpret their metrics in the context of the life expectancy prediction problem.

💡 The solution:

Code is only a means to an end. It is not valid by itself.

Unless you justify your modeling choices i.e your choice of features, your pre-processing steps (e.g imputation), we can’t ascertain your model is correct. You also need to interpret the model metrics as they don’t have any intrinsic meaning and only have a meaning in the context of the problem being solved.

Explanations and interpretations are much more crucial than the code itself!

1 Issue 1: Not submitting an HTML file

1.1 Submitting .qmd file with incorrect headers

2 Issue 2: Loading the data using an absolute path

3 Issue 3: Recommending dropping missing values on the wrong reasons

4 Issue 4: Only plotting distribution plots without interpreting them

5 Issue 5: Not labeling the plots correctly

6 Issue 6: Not explaining the modeling choices and/or the metrics

1.1 Submitting `.qmd` file with incorrect headers