❌ Common W04 Formative issues
1 Issue 1: Not submitting an HTML file
🧩 The problem:
Many of you submitted a .qmd
file and/or an .ipynb
file but forgot to submit a (self-contained) HTML file alongside it. We can’t visualise the outputs of your code correctly, particularly if you used an absolute path to load the data (see section Section 2).
💡 The solution:
If you have included the following lines in your .qmd
header:
editor:
render-on-save: true
preview: true
then you will generate an HTML file every time you press the preview button at the top right corner of the VSCode menu bar (this supposes that you have the Quarto extension installed and that you have your .qmd
file opened).
If not, you can either:
- add these lines to the
.qmd
header, save the.qmd
and press the preview button - or open a new terminal in VSCode, ensure that you are in the same folder as your
.qmd
(use thepwd
andcd
command as required for that) and then execute this command in the terminalquarto render name_of_your_qmd.qmd
wherename_of_your_qmd.qmd
is the name of.qmd
file
This should produce an HTML file, provided your .qmd
file is syntaxically correct.
1.1 Submitting .qmd
file with incorrect headers
🧩 The problem:
Many submitted .qmd
that had headers that were incomplete and might have resulted in incorrectly rendered HTML files if HTML files had been generated.
💡 The solution:
Here is what a correct header looks like:
---
title: "✏️ W04 Formative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
render-on-save: true
preview: true
---
You need to replace <CANDIDATE_NUMBER>
by your candidate number. The line self-contained: true
is particularly important as this is the line that allows you to produce a self-contained HTML: if you forget it and include figures in your .qmd
, your figures won’t show!
2 Issue 2: Loading the data using an absolute path
🧩 The problem:
When loading the dataset, many used absolute paths i.e the exact path to the data as it would appear on their machines. Why is that not a good idea?
If someone else wanted to reproduce your code, they would have to have the exact same path to the data on their machines!
This means that specifying absolute paths when loading datasets (or in any part of your code really) makes your code lose its reproducibility.
💡 The solution:
So, you should always use relative paths in your code and ensure your code is fully reproducible.
3 Issue 3: Recommending dropping missing values on the wrong reasons
🧩 The problem:
Many recommended dropping missing values for the wrong reasons.
They had computed the percentage of missing values and:
- they had either determined that the percentage of missing values was too significant so the missing values needed to be dropped
- or they had determined the missing values could not be obtained from an external sources so they needed to be dropped
Both these reasons are wrong.
💡 The solution:
In this particular case, the column with the most missing values (population
) only had a percentage of missing values in the low 20%s: this hardly qualifies as a significant amount of missing values and can usually be dealt with with imputation. But most importantly, the percentage of missing values on its own is not sufficient to determine whether to drop missing values: you can only make that determination after having evaluated missingness patterns (and the only case where dropping values is valid is the MCAR case).
Regarding having to drop values because external sources can’t provide the missing information, this is also wrong because imputation does not necessarily rely on external sources to fill in the gaps: it actually uses available information in the current dataset to try and estimate the missing values as best as possible.
4 Issue 4: Only plotting distribution plots without interpreting them
🧩 The problem:
Many times, in both questions 3 and 4, distribution plots were drawn (more or less correctly). The plots were not accompanied with any interpretation
💡 The solution:
A plot on its own does not tell a story! It only tells only if it is interpreted. To see an example of the type of interpretation that was expected for Questions 3 and 4, see the model solution for this formative.
5 Issue 5: Not labeling the plots correctly
🧩 The problem:
GDP per capita, in Question 3, is a variable that has a unit.
💡 The solution:
All distribution plots that included GDP per capita needed to contain the unit of GDP per capita too!
6 Issue 6: Not explaining the modeling choices and/or the metrics
🧩 The problem:
Many only wrote the code for linear and LASSO regressions and output some metrics but didn’t explain how they selected the features for their models, why they made certain other choices (e.g choice of imputation technique, choice to drop missing values, etc.) and didn’t interpret their metrics in the context of the life expectancy prediction problem.
💡 The solution:
Code is only a means to an end. It is not valid by itself.
Unless you justify your modeling choices i.e your choice of features, your pre-processing steps (e.g imputation), we can’t ascertain your model is correct. You also need to interpret the model metrics as they don’t have any intrinsic meaning and only have a meaning in the context of the problem being solved.
Explanations and interpretations are much more crucial than the code itself!