✏️ W07 Formative
2024/25 Winter Term
⏲️ Due Date:
- 04 March 2025 at 5pm
🎯 Main Objectives:
- To practice using GitHub Classroom
- To practice creating and styling your own Quarto documents
- To practice writing Python code of your own
- To practice selecting classification models, tuning their parameters and evaluating and interpreting their results
Please submit your work even if you didn’t manage to go very far with the Python code. As this is a formative assignment, it won’t be graded, and the main point is for you to get used to submitting your work through GitHub Classroom and to get a bit more practice on the models you’ve encountered so far (before the actual summative!).
👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.
📚 Preparation (if you are new to GitHub)
You will use GitHub Classroom 1 to submit your work. You will need to have a GitHub account to do this.
- Create an account on GitHub.
Never heard of GitHub2? Or maybe you have heard of it but never used it? Then, follow the instructions below to get started.
Go to our Slack workspace’s
#announcements
channel to find the link to ‘Intro to Git and GitHub’ (go to the📌Pins
tab on the channel to find the post on the W04 formative!). You will be taken to a page with instructions on how to get started with Git and GitHub.Read the instructions in the README.md and complete the exercises.
Ask any questions about the exercise above on the
#help
channel on Slack.
📝 Instructions
Go to our Slack workspace’s
#announcements
channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!Click on the link, sign in to GitHub and then click on the green button
Accept this assignment
.You will be redirected to a new private repository created just for you. The repository will be named
ds202w-2025-2024-w07-formative--yourusername
, whereyourusername
is your GitHub username. The repository will be private and will contain aREADME.md
file with a copy of these instructions.Many of you might still be catching up with Python and GitHub, so it’s okay if you can only complete a few questions. You will still get feedback on your answers, which will still count as completed (important for General Course and Exchange students).
Create your own
.qmd
file with your answers.
You can create a .qmd
file from a Jupyter notebook (i.e .ipynb
) by going on the VSCode Terminal, making sure you are in the same directory as your Jupyter notebook (use the pwd
to check which directory you’re in and cd
command to change directory if needed) and then typing the following command:
quarto convert <name_of_notebook>.ipynb
where <name_of_notebook>.ipynb
is the name of the Jupyter notebook you want to convert into .qmd
Also check out the Quarto documentation to better understand the conversion from ipynb
to qmd
.
And check out this tutorial if you want to better understand the commands you can run on your VSCode terminal (e.g to change current directory).
You can also use the .qmd
file you used in the W01 lab as a template. Just remove anything that is not relevant to this assignment.
Try to create separate headers and code chunks for each question. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
Use the
#help
channel on Slack liberally if you get stuck.
“What do I submit?”
⚠️ Do you know your CANDIDATE NUMBER? You will need it.
“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”
Source: LSE
A Quarto markdown file with the following naming convention:
<CANDIDATE_NUMBER>.qmd
, where<CANDIDATE_NUMBER>
is your candidate number. For example, if your candidate number is12345
, then your file should be named12345.qmd
.An HTML file render of the Quarto markdown file. To generate a render, the easiest way is to include these lines
editor:
render-on-save: true
preview: true
in your .qmd
header so that an HTML file is generated each time you preview your document (make sure you also have the Quarto extension installed in VSCode so that you do the preview by clicking on a button at the top right corner of the VSCode menu bar without having to use the Terminal!). Also, don’t forget to add the line self-contained: true
to your .qmd
header, otherwise none of your plots will show!
Your .qmd
header should look something like this:
---
title: "✏️ W07 Formative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
render-on-save: true
preview: true
---
You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit
AND push
your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.
✔️ How we will grade your work
We won’t! This is formative. But you will get feedback on your answers. It won’t be super detailed at this stage, but it should give you an idea of how you are doing.
👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.
📚 Tasks
The questions below will build on your code from 💻 W04 Lab and 💻W05 Lab essentially (as well as the corresponding lectures).
About the Data
We will use a data set, which is a mix of indicators extracted from World Bank Databases, Reporters Without Borders (RSF) data (relating to the World Press Freedom Index3), data from the Harvard Growth Lab about the Economic Complexity Index, data from Transparency International (for the 2022 Corruption Perception Index or CPI score 2022), data from the United Nations University World Institute for Development Economics Research (UNU-WIDER)’s World Income Inequality Database (WIID) and finally, data from the World Population Review (adult literacy rates).
Here is an overview of the indicators in our data and their meaning:
Indicator | Definition | Source |
---|---|---|
Inflation, consumer prices (annual %) | Measures the annual percentage change in the average price level of a basket of goods and services. Indicates the rate at which the general price level is rising. | World Bank |
Urbanization (%) | Represents the percentage of a country’s population living in urban areas. Reflects the extent of urbanization in a country. | World Bank |
FDI (% of GDP) | Foreign Direct Investment as a percentage of GDP. Reflects foreign investment in the economy, showing a country’s level of economic openness. | World Bank |
GDP per capita, PPP (current international $) | GDP per capita adjusted for purchasing power parity (PPP). Provides a comparative measure of living standards across countries. | World Bank |
Unemployment, total (% of total labor force) | Percentage of the labor force that is unemployed, based on the International Labour Organization’s model. | World Bank |
Tax revenue (% of GDP) | The total revenue from taxes collected by the government, expressed as a percentage of GDP. Indicates the government’s capacity to generate tax income. | World Bank |
Individuals using the Internet (% of population) | The percentage of the total population that has access to and uses the internet. Reflects digital connectivity within a country. | World Bank |
Trade Openness (% of GDP) | The sum of exports and imports of goods and services as a percentage of GDP. Shows how integrated a country is with the global economy. | World Bank |
Rule of Law Score | Measures the extent to which individuals can rely on the legal system to protect their rights. Includes factors like law enforcement and judicial independence. | World Bank |
Government Effectiveness | Reflects the quality of public services, civil service quality, and the government’s ability to implement policies. | World Bank |
Regulatory Quality | Measures the government’s ability to design and implement policies that promote the private sector’s development. | World Bank |
Real GDP Growth (% Change) | The annual percentage change in real GDP. Reflects the economic growth or contraction within a country. | World Bank |
Voice and Accountability: Estimate | Measures the extent to which a country’s citizens can participate in selecting their government, freedom of expression, and access to information. | World Bank |
Political Stability and Absence of Violence/Terrorism: Estimate | An estimate of the political stability of a country, including the absence of violence, terrorism, and civil unrest. | World Bank |
Central government debt, total (% of GDP) | Total government debt expressed as a percentage of GDP. Reflects how much a government borrows relative to its economic output. | World Bank |
External debt stocks (% of GNI) | Total external debt as a percentage of a country’s Gross National Income (GNI). Indicates a country’s financial obligations to foreign creditors. | World Bank |
Public and publicly guaranteed debt service (% of exports) | The ratio of a country’s debt service payments (interest and principal) to its export earnings. Shows the financial strain from external debt. | World Bank |
Domestic credit to private sector (% of GDP) | The total financial credit extended to the private sector expressed as a percentage of GDP. Indicates the financial sector’s contribution to economic activity. | World Bank |
RSF World Freedom of Press Score | A score measuring the degree of press freedom in a country. Higher scores indicate greater freedom of the press and less governmental interference. | RSF |
ECI (Economic Complexity Index) | Measures the diversity and sophistication of a country’s productive capabilities. It evaluates how complex and interconnected a country’s exports are. | Atlas of Economic Complexity |
Corruption Perception Index (CPI) | A score indicating the perceived levels of corruption in the public sector. A lower score suggests higher levels of perceived corruption. | Transparency International |
Gini Index | A measure of income inequality within a country. A Gini index of 0 represents perfect equality, while an index of 100 represents perfect inequality. | WID.World |
Total Population Literacy Rate | The percentage of people aged 15 and above who can read and write. Reflects the education level and access to literacy programs within a country. | World Population Review |
All data, except Gini index and Total Population Literacy Rate, were collected for the year 2022. For both Gini Index and Total Population Literacy Rate, the data for the most recent year on record (up to 2022) was retrieved (the data goes back to 2016 for Gini and to 2011 for literacy rates).
Click on the button below to download the dataset:
Our aim through this dataset is to try and classify countries according to their level of corruption (corruption perception index 2022).
Question 1
Load the data and explore the characteristics of the data (e.g missingness patterns, interesting variable distributions, etc.)
Question 2
Create the outcome variable corruption
for our classification models i.e:
- if the corruption perception index 2022 is higher or equal to 50, the country is classified
NotCorrupt
and assigned to class0
- otherwise, the country is classified
Corrupt
and assigned to class1
Question 3
Create a baseline model (using the model of your choice) using a simple training/test split to predict corruption
. Explain your model and feature selection and any other modeling choices you made. Explain the performance of your model in the context of your dataset/application.
Question 4
Select two/three models (one being a tree-based model). Tune their hyperparameters and/or evaluate their performance using cross-validation. How well do they perform compared to your baseline model?
🤖 Using AI help?
You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.
Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.
To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.