📝 W10 Summative
2024/25 Winter Term
⏲️ Due Date:
- 27 March 2025 at 5pm (London time)
If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.
Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧
🎯 Main Objectives:
- Demonstrate your ability to write a report in Quarto Markdown
- Demonstrate your ability to fit a linear/logistic regression model
- Demonstrate your ability to interpret and evaluate the performance of a linear/logistic regression model
- Demonstrate your understanding of supervised learning techniques
- Demonstrate your ability to defend your model choices
⚖️ Assignment Weight:
This assignment is worth 30% of your final grade in this course.
30%
“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”
Source: LSE
📝 Instructions
Go to our Slack workspace’s
#announcements
channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!Click on the link, sign in to GitHub and then click on the green button
Accept this assignment
.You will be redirected to a new private repository created just for you. The repository will be named
ds202w-2025-2024-w10-summative--yourusername
, whereyourusername
is your GitHub username. The repository will be private and will contain aREADME.md
file with a copy of these instructions.Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create your own
<CANDIDATE_NUMBER>.qmd
file with your answers, replacing the text<CANDIDATE_NUMBER>
with your actual LSE number.
You can create a .qmd
file from a Jupyter notebook (i.e .ipynb
) by going on the VSCode Terminal, making sure you are in the same directory as your Jupyter notebook (use the pwd
to check which directory you’re in and cd
command to change directory if needed) and then typing the following command:
quarto convert <CANDIDATE_NUMBER>.ipynb
where <CANDIDATE_NUMBER>.ipynb
is the name of the Jupyter notebook you want to convert into .qmd
Also check out the Quarto documentation to better understand the conversion from ipynb
to qmd
.
And check out this tutorial if you want to better understand the commands you can run on your VSCode terminal (e.g to change current directory).
You can also use the .qmd
file you used in the W01 lab as a template. Just remove anything that is not relevant to this assignment.
Then, replace whatever is between the
---
lines at the top of your newly created.qmd
file with the following:--- title: "DS202W - W10 Summative" author: <CANDIDATE_NUMBER> output: html self-contained: true jupyter: python3 engine: jupyter editor: render-on-save: true preview: true ---
Once again, replace the text
<CANDIDATE_NUMBER>
with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is12345
, then your.qmd
file should start with:--- title: "DS202W - W10 Summative" author: 12345 output: html self-contained: true jupyter: python3 engine: jupyter editor: render-on-save: true preview: true ---
Fill out the
.qmd
file with your answers. Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.Use the
#help
channel on Slack liberally if you get stuck.Once you are done, click on the
Render
button at the top of the.qmd
file. This will create an.html
file with the same name as your.qmd
file. For example, if your.qmd
file is named12345.qmd
, then the.html
file will be named12345.html
.Ensure that your
.qmd
code is reproducible, that is, if we were to restart VSCode and run your notebook from scratch, from top to the bottom, we would get the same results as you did.Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
Read the section How to get help and how to collaborate with others at the end of this document.
“What do I submit?”
You will submit two files:
A Quarto markdown file with the following naming convention:
<CANDIDATE_NUMBER>.qmd
, where<CANDIDATE_NUMBER>
is your candidate number. For example, if your candidate number is12345
, then your file should be named12345.qmd
.An HTML file render of the Quarto markdown file. To generate a render, the easiest way is to include these lines
editor:
render-on-save: true
preview: true
in your .qmd
header so that an HTML file is generated each time you preview your document (make sure you also have the Quarto extension installed in VSCode so that you do the preview by clicking on a button at the top right corner of the VSCode menu bar without having to use the Terminal!). Also, don’t forget to add the line self-contained: true
to your .qmd
header, otherwise none of your plots will show!
Your .qmd
header should look something like this:
---
title: "✏️ W10 Summative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
render-on-save: true
preview: true
---
- An HTML file render of the Quarto markdown file.
You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit
AND push
your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
🗄️ Get the data
What data will you be using?
You will be using two distinct datasets for this summative.
Parts 1 and 2
The dataset, for this part, comes from the replication package of an academic paper (Guan, Palma, and Wu 2024) that describes inflation, money and prices during the Yuan Dynasty era in China. The Yuan were a Mongol dynasty that ruled for only around a century (from 1260–1368), but they had a long-lasting influence on China’s culture, economy, and politics. They were also the first political regime in history that pegged paper money to precious metals and the first that deployed fiat money as the sole legal tender. The successive Yuan rulers each introduced their version of paper money initially pegged to a different value of silver (hence inflation!). At the beginning of the dynasty, the paper money was exchangeable for silver. From 1276, the money was not fully backed by silver and, in 1310, the government began to issue pure fiat money, just like we use today. For further details, have look at the paper from (Guan, Palma, and Wu 2024)
The dataset we have records an estimate of the consumer price index (CPI) from 1260 to 1355 as well as several factors that could have affected it e.g the number of disasters, the external warfare , unification warfare, rebellions, total warfare as well as the nominal money issues and imperial grants 1
📚 Preparation
- Download the data by clicking on the button below. 2
You can download the accompanying README file to check the meaning of the data variable:
Refer to (Guan, Palma, and Wu 2024) for more information on the dataset and its context.
Part 3
In this part, we leave China and the middle ages behind and we’re squarely back to the UK and the 20th/21st century. In this part, we’ll have a closer look at the setting of Bank of England interest rates! It is the single most important interest rate in the UK and has a wide ranging impact on the economy as a whole. Set too high and the economic activity grinds to halt, unemployment rates increase and the risk of recession/depression increases; set too low and the economy overheats and inflation rates climb too high (risk of walking or even galloping inflation). If you want to learn more about interest rates, you can visit the Bank of England’s page.
The Bank of England sets the interest rates periodically and relies on a number of indicators to do so. For the purpose of this exercise, you’ll place yourself in the shoes of a financial analyst and try to predict based on quarterly indicators (i.e indicators in the three months up to the rate setting meeting) whether the interest rates will go up, down or stay the same.
The economic indicators we’re looking at are as follows:
Indicator | Meaning | Source |
Consumer Confidence Index (CCI) |
The Consumer confidence index (CCI) is a standardised confidence indicator providing an indication of future developments of households’ consumption and saving. |
OECD |
Consumer Prices Index including owner occupiers’ housing costs (monthly estimates) |
CPIH is the most comprehensive measure of inflation. |
UK’s Office for National Statistics |
GDP (monthly estimates) | The GDP is the standard measure of the value added created through the production of goods and services in a country during a certain period. | UK’s Office for National Statistics |
Exchange rates | We track GBP to EUR and GBP to USD monthly exchange rates | Bank of England |
10-year gilt yields |
A 10-year gilt yield refers to the return (or interest rate) that investors receive when they buy a 10-year UK government bond (gilt) and hold it to maturity. It represents the annualized yield that an investor would earn over the bond’s life. Let’s break things down further: Government Bonds (or gilts) are issued by the UK government to borrow money. A 10-year gilt means the bond matures in 10 years. The yield is the effective return an investor gets from holding the bond. It’s influenced by:
Why do gilt yields matter? It’s a key indicator of market expectations for economic growth and inflation. Rising yields suggest higher borrowing costs and possibly higher interest rates. Falling yields suggest lower interest rates and a preference for safer assets. Example: If the UK 10-year gilt yield is 4%, it means that, on average, investors demand a 4% annual return over the next 10 years for lending money to the UK government. If the yield rises to 5%, it means investors require a higher return (possibly due to inflation fears or expectations of higher interest rates). |
Federal Reserve Bank of St-Louis |
Unemployment rate (aged 16 and over, seasonally adjusted): % |
Unemployment measures people without a job who have been actively seeking work within the last four weeks and are available to start work within the next two weeks. |
UK’s Office for National Statistics |
These indicators are contained in a first separate dataset (the economic indicators dataset that you can download below) and are recorded for every month from 01/01/1997 to 01/11/2024.
Aside from these indicators, you have a second (separate) dataset (obtained from the Bank of England) the records the Bank Rates i.e base interest rates set by the Bank from 06/05/1997 to 06/02/2025 (it also contains a variable rate_change
that indicates whether the interest rate went down (value -1), went up (value 1) or stayed the same (value 0) compared to the previous rate setting event).
📚 Preparation
- Download the datasets by clicking on the buttons below.
📋 Your Tasks
What do we actually want from you?
Part 1: Show us your pandas
and lets_plot
muscles! (10 marks)
You don’t need to use a chunk for each question. Feel free to organise your code and markdown for this part.
Load the data into a data frame called
yuan
. Freely explore the data on your own.This dataset comes in clean format but let’s explore it a bit and derive some insights from it before we do any modeling with it.
What are the years with the top 10 highest number of total wars? How many of those years (and which ones) overlap with the years with the top 10 nominal money issues?
Similarly, what are the years with the top 10 highest number of disasters? How many of those years (and which ones) overlap with the years with the top 10 nominal money issues?
Create a single plot that shows the evolution over time of CPI, total wars, disasters and nominal money issues (be mindful of variable scaling!). What does this plot tell you?
Part 2: Create regression models (45 marks)
In this part, we focus on predicting cpi
.
As it was in the previous section, you don’t need to use a chunk for each question. Feel free to organise your code and markdown for this part.
Create a baseline linear regression model:
Create the training and test sets:
- create a
yuan_train
to contain data before 1327 - create a
yuan_test
to contain data from 1327 onwards
- create a
Now, using only the
yuan_train
data as a starting point, create a linear regression model that predicts the target variableHow well does your model perform? Just as in the lab on week 3 and insights from the lectures, use the residuals plot and a metric of your choosing to justify your reasoning. Can you explain the performance change between the training and test set?
Now is your time to shine! Come up with your own feature selection or feature engineering or model selection strategy3 and try to get a better model performance than you had before. Don’t forget to validate your results using the appropriate resampling techniques!
Whatever you do, this is what we expect from you:Show us your code and your model.
Explain your choices (of feature engineering, model selection or resampling strategy)
Evaluate your model’s performance. If you created a new model, compare it to the baseline model. If you performed a more robust resampling, compare it to the single train-test split you did in the previous question.
Part 3: Create classification models (45 marks)
In this part, we’ll focus on predicting rate_change
contained in the Bank of England interest rates dataset.
Before you do any classification, you’ll need to do a bit of data processing:
- Download the Bank of England interest rates dataset and load it into a dataframe called
df
. Carefully inspect the dataset and check that rate setting events occur every once in a while. - Now download the economic indicators dataset into another dataframe: your task is to assign to each row of
df
values of economic indicators from the last quarter i.e the average of each indicator for the last three months up to the date of the rate setting event. For e.g, if the rate setting event is on 06/05/1997, you will to average data for GDP for May 1997, April 1997 and March 1997 and do the same separately for the other indicators i.e exchange rates, 10-year gilt yield, unemployment rates, CPIH and CCI.
- Download the Bank of England interest rates dataset and load it into a dataframe called
Create a baseline logistic regression model:
- Split your data in training and test set (70% of the years for training set - be careful with the ordering!)
- Use whatever metric you feel is most apt for this task to evaluate your model’s performance. Explain why you chose this metric.
- Explain what the regression coefficients mean in the context of this problem.
Now is your time to shine once again ! Come up with your own feature selection, feature engineering and/or model selection strategy and try to get a better model performance than you had before. Don’t forget to validate your results using the appropriate resampling techniques!
Whatever you do, this is what we expect from you:Show us your code and your model.
Explain your choices (of feature engineering, model selection or resampling strategy)
Evaluate your model’s performance. If you created a new model, compare it to the baseline model. If you performed a more robust resampling, compare it to the single train-test split you did in the previous question.
We’ve seen many models until now and you might be tempted to try and show us every single model you know, in particular in the questions calling upon you to improve model performance. Don’t!
Resist the siren calls🧜♀️ and make resolute model choices. Model selection is a skill! So, DO NOT TRY EVERY SINGLE MODEL UNDER THE SUN to tackle to the questions. State your modeling hypotheses clearly, justify your choices and only choose a couple of models to try and solve the questions.You are obviously allowed to use dimensionality reduction techniques i.e PCA/UMAP if you think it might help with your modeling (again justify their use if you do use them!). But you don’t have to use them. This summative is mainly about supervised learning techniques.
Simply lining up code without explanation will not get you high grades. We expect you to justify your modeling choices (e.g why did you choose to use a particular model in the particular context of the problem you’re solving? why is it uniquely suitable for the dataset/problem context? How did you set its parameters?) and to explain the model results and metrics in the context of the problem you’re dealing with.
✔️ How we will grade your work
Following all the instructions, you should expect a score of around 70/100. Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code4 or text will not get you a higher score; you need to add interesting insights or analyses to get a distinction.
⚠️ You will incur a penalty if you only submit a .qmd
file and not also a properly rendered .html
file alongside it!
Part 1: Show us your pandas
and lets_plot
muscles! (10 marks)
Here is a rough rubric for this part:
- <4 marks: You wrote some code but simply did not follow the instructions or didn’t interpret your plots or results at all.
- 4-6 marks: You might have made some mistakes when tallying the years that correspond the overlap between highest nominal money issues and highest total wars/disasters or your plot and conclusions for 2.c are not correct.
- 7-9 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
- 10 marks: You did everything correctly, and your submission was perfect. Wow! Your code and markdown were well-organised, and your answers were concise and to the point.
Part 2: Create regression models (45 marks)
Here is a rough rubric for this part:
- <11 marks: A deep fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did.
- 11-21 marks: A fail. You wrote some code and text but ignored important aspects of the instructions (like not using linear regression)
- 22-33 marks: You made some critical mistakes or did not complete all the tasks. For example: your pre-processing step was incorrect, your model contained some data leakage (e.g using variables that define others to predict them), or perhaps your analysis of your model was way off.
- 34-38: Good, you just made minor mistakes in your code, or your analysis demonstrated some minor misunderstandings of the concepts.
- ~39 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
- >39 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of linear regression and other models. We are likely to print a photo of your submission and hang it on the wall of our offices.
Part 3: Create classification models (45 marks)
Here is a rough rubric for this part:
- <11 marks: A deep fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did.
- 11-21 marks: A fail. You wrote some code and text but ignored important aspects of the instructions (like not using logistic regression)
- 22-33 marks: You made some critical mistakes or did not complete all the tasks. For example: your pre-processing step was incorrect, your model contained some data leakage (e.g using variables that define others to predict them), or perhaps your analysis of your model was way off.
- 34-38: Good, you just made minor mistakes in your code, or your analysis demonstrated some minor misunderstandings of the concepts.
- ~39 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
- >39 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of the logistic function and other models. We are likely to print a photo of your submission and hang it on the wall of our offices.
How to get help and how to collaborate with others
🙋 Getting help
You can post general coding questions on Slack but should not reveal code that is part of your solution.
For example, you can ask:
- “Does anyone know how I can create a logistic regression in
scikit-learn
with aPipeline
?” - “Has anyone figured out how to do time-aware cross-validation??”
- “I tried using something like
pd.query("Date>'1997-05-06'")
but then I got an error” (Reproducible example) - “Does anyone know how I can create a new variable that is the sum of two other variables?”
You are allowed to share ‘aesthetic’ elements of your code if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df
as the data frame, but you should not share the code for the data wrangling that led to the creation of df
.
If we find that you posted something on Slack that violates this principle without realising it, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.
👯 Collaborating with others
You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you cannot share or copy code from others — pretty much the same rules as above.
🤖 Using AI help?
You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.
Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.
To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.
References
Footnotes
imperial grants could be fixed annual and occasional (ad hoc) grants. According to (Guan, Palma, and Wu 2024),
Fixed annual grants were usually set each year during an emperor’s reign. There were also occasional imperial grants, typically occurring when an emperor was enthroned, when other Mongol nobles presented themselves before an emperor, or when emperors rewarded imperial bodyguards or other soldiers for their service. The early reign of Kublai coincided with the period of the strict silver standard (1260–75). During this time, imperial grants remained low and stable, with fixed grants occupying most of the total. However, Kublai granted more occasional grants from 1285 until the last year of his reign in 1294. Kublai’s successors typically kept fixed imperial grants low, but occasionally they granted substantial grants. One of the primary reasons for this change was because of the politically volatile imperial power. From 1294 to 1333, nine emperors ascended the throne, each having reigned for an average of 4 years and some having been assassinated. To win the support of the Mongol nobles, emperors resorted to ad hoc grants. During the reign of the last Yuan emperor, Toghon Temür (enthroned in 1333), the tendency to issue substantial grants gradually ceased. Our findings are consistent with qualitative historical documentation and other studies on the Yuan’s imperial grants. The composition of the imperial grants over time […] shows that paper money constituted the largest part of the imperial grants throughout the Yuan dynasty. While silver grants occupied approximately one-tenth of the total grants in the early Yuan period, their proportion diminished as the government loosened the silver standard.
↩︎Note that is a Stata file (you will encounter such proprietary files in “the wild” once in a while as a data scientist!). Check the
pandas
documentation on how to read Stata files.↩︎Feature engineering is creating new variables from existing ones. For example, you could create a new variable that results from a mathematical transformation of an existing variable.Or you could enrich your dataset with some other publicly available data.
Hint: Don’t forget you are dealing with time series data😉.↩︎Hint: don’t just write code, especially uncommented chunks of code. It won’t get you very far. You need to explain the code results, interpret them and put them in context.↩︎