✏️ W07 Formative

2024/25 Winter Term

Author

Dr Ghita Berrada

⏲️ Due Date:

04 10 March 2025 at 5 7.30pm

🎯 Main Objectives:

To practice using GitHub Classroom
To practice creating and styling your own Quarto documents
To practice writing Python code of your own
To practice selecting classification models, tuning their parameters and evaluating and interpreting their results

Please submit your work even if you didn’t manage to go very far with the Python code. As this is a formative assignment, it won’t be graded, and the main point is for you to get used to submitting your work through GitHub Classroom and to get a bit more practice on the models you’ve encountered so far (before the actual summative!).

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

📚 Preparation (if you are new to GitHub)

You will use GitHub Classroom ¹ to submit your work. You will need to have a GitHub account to do this.

Create an account on GitHub.

Never heard of GitHub²? Or maybe you have heard of it but never used it? Then, follow the instructions below to get started.

Go to our Slack workspace’s #announcements channel to find the link to ‘Intro to Git and GitHub’ (go to the 📌Pins tab on the channel to find the post on the W04 formative!). You will be taken to a page with instructions on how to get started with Git and GitHub.
Read the instructions in the README.md and complete the exercises.
Ask any questions about the exercise above on the #help channel on Slack.

📝 Instructions

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub and then click on the green button Accept this assignment.
You will be redirected to a new private repository created just for you. The repository will be named ds202w-2025-2024-w07-formative--yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.
Many of you might still be catching up with Python and GitHub, so it’s okay if you can only complete a few questions. You will still get feedback on your answers, which will still count as completed (important for General Course and Exchange students).
Create your own .qmd file with your answers.

You can create a .qmd file from a Jupyter notebook (i.e .ipynb) by going on the VSCode Terminal, making sure you are in the same directory as your Jupyter notebook (use the pwd to check which directory you’re in and cd command to change directory if needed) and then typing the following command:

quarto convert <name_of_notebook>.ipynb

where <name_of_notebook>.ipynb is the name of the Jupyter notebook you want to convert into .qmd

Also check out the Quarto documentation to better understand the conversion from ipynb to qmd.

And check out this tutorial if you want to better understand the commands you can run on your VSCode terminal (e.g to change current directory).

You can also use the .qmd file you used in the W01 lab as a template. Just remove anything that is not relevant to this assignment.

Try to create separate headers and code chunks for each question. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
Use the #help channel on Slack liberally if you get stuck.

“What do I submit?”

⚠️ Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.
An HTML file render of the Quarto markdown file. To generate a render, the easiest way is to include these lines

editor:
  render-on-save: true
  preview: true

in your .qmd header so that an HTML file is generated each time you preview your document (make sure you also have the Quarto extension installed in VSCode so that you do the preview by clicking on a button at the top right corner of the VSCode menu bar without having to use the Terminal!). Also, don’t forget to add the line self-contained: true to your .qmd header, otherwise none of your plots will show!

Your .qmd header should look something like this:

---
title: "✏️ W07 Formative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
  render-on-save: true
  preview: true
---

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.

✔️ How we will grade your work

We won’t! This is formative. But you will get feedback on your answers. It won’t be super detailed at this stage, but it should give you an idea of how you are doing.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

📚 Tasks

The questions below will build on your code from 💻 W04 Lab and 💻W05 Lab essentially (as well as the corresponding lectures).

About the Data

We will use a data set, which is a mix of indicators extracted from World Bank Databases, Reporters Without Borders (RSF) data (relating to the World Press Freedom Index³), data from the Harvard Growth Lab about the Economic Complexity Index, data from Transparency International (for the 2022 Corruption Perception Index or CPI score 2022), data from the United Nations University World Institute for Development Economics Research (UNU-WIDER)’s World Income Inequality Database (WIID) and finally, data from the World Population Review (adult literacy rates).

Here is an overview of the indicators in our data and their meaning:

Indicator	Definition	Source
Inflation, consumer prices (annual %)	Measures the annual percentage change in the average price level of a basket of goods and services. Indicates the rate at which the general price level is rising.	World Bank
Urbanization (%)	Represents the percentage of a country’s population living in urban areas. Reflects the extent of urbanization in a country.	World Bank
FDI (% of GDP)	Foreign Direct Investment as a percentage of GDP. Reflects foreign investment in the economy, showing a country’s level of economic openness.	World Bank
GDP per capita, PPP (current international $)	GDP per capita adjusted for purchasing power parity (PPP). Provides a comparative measure of living standards across countries.	World Bank
Unemployment, total (% of total labor force)	Percentage of the labor force that is unemployed, based on the International Labour Organization’s model.	World Bank
Tax revenue (% of GDP)	The total revenue from taxes collected by the government, expressed as a percentage of GDP. Indicates the government’s capacity to generate tax income.	World Bank
Individuals using the Internet (% of population)	The percentage of the total population that has access to and uses the internet. Reflects digital connectivity within a country.	World Bank
Trade Openness (% of GDP)	The sum of exports and imports of goods and services as a percentage of GDP. Shows how integrated a country is with the global economy.	World Bank
Rule of Law Score	Measures the extent to which individuals can rely on the legal system to protect their rights. Includes factors like law enforcement and judicial independence.	World Bank
Government Effectiveness	Reflects the quality of public services, civil service quality, and the government’s ability to implement policies.	World Bank
Regulatory Quality	Measures the government’s ability to design and implement policies that promote the private sector’s development.	World Bank
Real GDP Growth (% Change)	The annual percentage change in real GDP. Reflects the economic growth or contraction within a country.	World Bank
Voice and Accountability: Estimate	Measures the extent to which a country’s citizens can participate in selecting their government, freedom of expression, and access to information.	World Bank
Political Stability and Absence of Violence/Terrorism: Estimate	An estimate of the political stability of a country, including the absence of violence, terrorism, and civil unrest.	World Bank
Central government debt, total (% of GDP)	Total government debt expressed as a percentage of GDP. Reflects how much a government borrows relative to its economic output.	World Bank
External debt stocks (% of GNI)	Total external debt as a percentage of a country’s Gross National Income (GNI). Indicates a country’s financial obligations to foreign creditors.	World Bank
Public and publicly guaranteed debt service (% of exports)	The ratio of a country’s debt service payments (interest and principal) to its export earnings. Shows the financial strain from external debt.	World Bank
Domestic credit to private sector (% of GDP)	The total financial credit extended to the private sector expressed as a percentage of GDP. Indicates the financial sector’s contribution to economic activity.	World Bank
RSF World Freedom of Press Score	A score measuring the degree of press freedom in a country. Higher scores indicate greater freedom of the press and less governmental interference.	RSF
ECI (Economic Complexity Index)	Measures the diversity and sophistication of a country’s productive capabilities. It evaluates how complex and interconnected a country’s exports are.	Atlas of Economic Complexity
Corruption Perception Index (CPI)	A score indicating the perceived levels of corruption in the public sector. A lower score suggests higher levels of perceived corruption.	Transparency International
Gini Index	A measure of income inequality within a country. A Gini index of 0 represents perfect equality, while an index of 100 represents perfect inequality.	WID.World
Total Population Literacy Rate	The percentage of people aged 15 and above who can read and write. Reflects the education level and access to literacy programs within a country.	World Population Review

All data, except Gini index and Total Population Literacy Rate, were collected for the year 2022. For both Gini Index and Total Population Literacy Rate, the data for the most recent year on record (up to 2022) was retrieved (the data goes back to 2016 for Gini and to 2011 for literacy rates).

Click on the button below to download the dataset:

Our aim through this dataset is to try and classify countries according to their level of corruption (corruption perception index 2022).

Question 1

Load the data and explore the characteristics of the data (e.g missingness patterns, interesting variable distributions, etc.)

Question 2

Create the outcome variable corruption for our classification models i.e:

if the corruption perception index 2022 is higher or equal to 50, the country is classified NotCorrupt and assigned to class 0
otherwise, the country is classified Corrupt and assigned to class 1

Question 3

Create a baseline model (using the model of your choice) using a simple training/test split to predict corruption. Explain your model and feature selection and any other modeling choices you made. Explain the performance of your model in the context of your dataset/application.

Question 4

Select two/three models (one being a tree-based model). Tune their hyperparameters and/or evaluate their performance using cross-validation. How well do they perform compared to your baseline model?

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.

Footnotes

What is GitHub Classroom?↩︎
What is GitHub ↩︎
to read more about the various World Freedom Press Index components, see here ↩︎