✏️ W04 Formative

2024/25 Winter Term

Author

Dr Ghita Berrada

⏲️ Due Date:

13 February 2025 at 5pm

🎯 Main Objectives:

To practice using GitHub Classroom
To practice creating and styling your own Quarto documents
To practice writing Python code of your own
To practice linear and LASSO regression and their evaluation

Please submit your work even if you didn’t manage to go very far with the Python code. As this is a formative assignment, it won’t be graded, and the main point is for you to get used to submitting your work through GitHub Classroom.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

📚 Preparation (if you are new to GitHub)

You will use GitHub Classroom ¹ to submit your work. You will need to have a GitHub account to do this.

Create an account on GitHub.

Never heard of GitHub²? Or maybe you have heard of it but never used it? Then, follow the instructions below to get started.

Go to our Slack workspace’s #announcements channel to find the link to ‘Intro to Git and GitHub’. You will be taken to a page with instructions on how to get started with Git and GitHub.
Read the instructions in the README.md and complete the exercises.
Ask any questions about the exercise above on the #help channel on Slack.

📝 Instructions

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub and then click on the green button Accept this assignment.
You will be redirected to a new private repository created just for you. The repository will be named ds202w-2025-2024-w04-formative--yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.
Many of you might still be catching up with Python and GitHub, so it’s okay if you can only complete a few questions. You will still get feedback on your answers, which will still count as completed (important for General Course and Exchange students).
Create your own .qmd file with your answers.

You can create a .qmd file from a Jupyter notebook (i.e .ipynb) by going on the VSCode Terminal, making sure you are in the same directory as your Jupyter notebook (use the pwd to check which directory you’re in and cd command to change directory if needed) and then typing the following command:

quarto convert <name_of_notebook>.ipynb

where <name_of_notebook>.ipynb is the name of the Jupyter notebook you want to convert into .qmd

Also check out the Quarto documentation to better understand the conversion from ipynb to qmd.

And check out this tutorial if you want to better understand the commands you can run on your VSCode terminal (e.g to change current directory).

You can also use the .qmd file you used in the W01 lab as a template. Just remove anything that is not relevant to this assignment.

Try to create separate headers and code chunks for each question. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
Use the #help channel on Slack liberally if you get stuck.

“What do I submit?”

⚠️ Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.
An HTML file render of the Quarto markdown file. To generate a render, the easiest way is to include these lines

editor:
  render-on-save: true
  preview: true

in your .qmd header so that an HTML file is generated each time you preview your document (make sure you also have the Quarto extension installed in VSCode so that you do the preview by clicking on a button at the top right corner of the VSCode menu bar without having to use the Terminal!). Also, don’t forget to add the line self-contained: true to your .qmd header, otherwise none of your plots will show!

Your .qmd header should look something like this:

---
title: "✏️ W04 Formative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
  render-on-save: true
  preview: true
---

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.

✔️ How we will grade your work

We won’t! This is formative. But you will get feedback on your answers. It won’t be super detailed at this stage, but it should give you an idea of how you are doing.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

📚 Tasks

The questions below will build on your code from 💻 W02 Lab and 💻W03 Lab.

About the Data

We will use a data set from Kaggle for this assignment. It’s a WHO Life expectancy dataset from Kaggle, which is a dataset that aims at studying which factors affect life expectancy.. You can download the dataset by clicking on the button below:

See here to find out more about the data.

Question 1

Load the data and look through the documentation in the link above. Rename the variable thinness 1-19 years into thinness_10_19_years and rename all column names such that:

special characters e.g characters such as - are replaced by _
spaces are replaced by _
there are no duplicate _
all trailing and leading spaces are removed
and all characters are lowercase

Question 2

Explore the missingness patterns in the data. Which variables have the most missing values? And what data processing do you recommend in view of these observations?

Question 3

Rename the current column gdp as gdp_per_capita.

Create a column gdp based on the columns gdp_per_capita and population (Hint: GDP per capita is GDP divided by population).

Can you draw the overall distribution of GDP per capita? What about the distribution of GDP per capita for developing countries and for developed countries? Any observations?

Question 4

Can you draw the overall distribution of income_composition_of_resources? What about the distribution of income_composition_of_resources for developing countries and for developed countries? Any observations?

Question 5

Split your data frame from Q3 into training and testing sets: make sure to take 30% of the data as test set and to keep the original proportion of both developing and developed countries in the training and test sets i.e to stratify (see scikit-learn documentation).

First, build a multivariate linear model on the training set in order to predict life expectancy, and evaluate the model on the test set using suitable metrics. How does this model perform at predicting life expectancy?

Then, build a LASSO model on the training set in order to predict life expectancy, and evaluate the model on the test set using suitable metrics. How does this model perform at predicting life expectancy? And how does it compare to the previous model?

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.