📝 W08 Summative

2025/26 Autumn Term

Author

⏲️ Due Date:

21 November 2025 at 5pm (London time)

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

🎯 Main Objectives

Demonstrate your ability to organise and communicate your analysis clearly in a self-contained, rendered Quarto HTML report.
Demonstrate your skills in data manipulation and visualization using R, including exploring relationships, subgroup comparisons, and multi-variable plots with dplyr and ggplot2.
Demonstrate your ability to fit baseline and improved regression and classification models (linear regression and logistic regression) in R.
Demonstrate your ability to interpret model results, evaluate model performance, and justify your choice of evaluation metrics.
Demonstrate your understanding of supervised learning techniques, including feature selection, feature engineering, and training/test splitting strategies.
Demonstrate your ability to critically reflect on the limitations, ethical considerations, and potential biases in your analysis.
Demonstrate your ability to justify and defend your modelling choices and interpret findings in context.

⚖️ Assignment Weight:

This assignment is worth 30% of your final grade in this course.

30%

Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

📝 Instructions

Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled 📝 W08 Summative. Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub and then click on the green button Accept this assignment.
You will be redirected to a new private repository created just for you. The repository will be named ds202a-2025-2026-w08-summative-yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.
Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create a <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

For example, if your candidate number is 12345, then your file should be named 12345.qmd.
Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:
```
---
title: "DS202A - W08 Summative"
author: <CANDIDATE_NUMBER>
output: html
self-contained: true
---
```
Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:
```
---
title: "DS202A - W08 Summative"
author: 12345
output: html
self-contained: true
---
```
Fill out the .qmd file with your answers. Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
Once you are done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.

Ensure that your .qmd code is reproducible, that is, if we were to restart R and RStudio and run your notebook from scratch, from top to the bottom, we would get the same results as you did.
Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.
Read the section How to get help and how to collaborate with others at the end of this document.

“What do I submit?”

You will submit two files:

A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.
An HTML file render of the Quarto markdown file (your file should be self-contained i.e all figures should embedded in it).

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

🗄️ Get the data

What data will you be using?

You will be using one single dataset throughout this summative.

Your dataset for this summative is the Sleep and Lifestyle Dataset (Kaggle, 2019) available on Kaggle. This is a synthetic dataset whose features include: age, gender, occupation, sleep duration (hours), quality of sleep (scale: 1-10), physical activity level (minutes/day), stress Level (scale: 1-10), BMI category, blood Pressure (systolic/diastolic), heart rate (bpm), daily steps, sleep disorder.

📚 Preparation

Download the data by clicking on the button below.

📋 Your Tasks

What do we actually want from you?

General guidelines

Model selection matters: You should make clear, justified choices about which models to use. Resist the temptation to try every model you know! State your modelling hypotheses clearly, justify your choices, and select only a few models to explore (more models does not mean a better grade!).
Dimensionality reduction: You may use PCA, UMAP, or similar techniques if you think it helps your modelling. You must justify their use. These are optional.
Justification is crucial: Simply presenting code without explanation will not get you high marks. In fact, simply lining up code with no or little explanation is a fail. You must justify your modelling choices (e.g., why a model is suitable for the dataset and problem context, how parameters were set), and interpret results and metrics in context.
Evaluation: For all models, evaluate performance and justify your choice of metrics or evaluation strategy. Do not assume a single metric suffices for all conclusions.

Part 1: Data Exploration (15 marks)

Q1.1 Load the dataset into a DataFrame sleep.

Q1.2 Explore and visualise the data:

Identify which features appear most strongly correlated with sleep duration.
Examine differences by demographic subgroups (e.g gender, age groups, occupation, people with sleep disorder and people without).
Create at least one multi-variable plot showing sleep duration against two predictors (e.g., physical activity and heart rate).
Identify missing data or anomalies and briefly discuss potential implications.

Part 2: Create regression models (35 marks)

In this part, we focus on predicting sleep_duration (hours)

Q2.1 Split the data into training and test sets. Consider how your split might affect model performance.

Q2.2 Fit a baseline linear regression model on the training set.

Q2.3 Evaluate your model performance and justify your choice of evaluation approach and metrics.

Q2.4 Explore ways to improve your model. You might experiment with:

Feature transformations or new derived variables
Alternative modelling approaches
Adjustments to the training/test split or resampling strategy

Justify your modelling choices.

Q2.5 Discuss your results:

Which predictors appear most influential ?
How does the model perform across different subgroups or conditions?
Explain plausible reasons for performance changes

Part 3: Classification (35 marks)

Our task, in this part, is to predict whether a participant achieves adequate vs inadequate sleep (define thresholds reasonably).

Q3.1 Split your data into training and test sets (consider if stratification is useful).

Q3.2 Fit a baseline logistic regression classifier. Explain the model coefficients.

Q3.3 Evaluate your model and justify your choice of evaluation approach and metrics.

Q3.4 Explore ways to improve the classifier. You might experiment with:

Feature engineering/selection
Alternative classification algorithms
Treatment of categorical variables

Q3.5 Interpret your findings:

Which features appear most predictive?
What practical implications can you draw?

Part 4: Reflection (15 marks)

Q4.1 Discuss ethical considerations of predicting sleep outcomes from lifestyle data.

Q4.2 Identify potential sources of bias or confounding.

Q4.3 Suggest additional variables or modelling approaches that could strengthen predictions.

Q4.4 Consider how findings could be communicated responsibly to a non-technical audience.

✔️ How we will grade your work

Following all the instructions, you should expect a score of around 70/100. Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code¹ or text will not get you a higher score; you need to add interesting insights or analyses to get a distinction.

⚠️ You will incur a penalty if you only submit a .qmd file and not also a properly rendered .html file alongside it!

Part 1: Data Exploration (15 marks)

<5 marks: A deep fail. Minimal or no code, or code/plots with no accompanying interpretation. Exploratory steps largely missing or completely incorrect. Correlations and subgroup comparisons not attempted; plots absent, meaningless, or completely unannotated.
5–7 marks: A fail. You wrote some code and/or produced plots, but important aspects of the instructions were ignored (e.g., correlations miscalculated, subgroup comparisons missing, or plots unclear). Interpretations of plots are largely absent or incorrect.
8–11 marks: You made some critical mistakes or did not complete all tasks. Correlations, subgroups, or visualisations partially incorrect. Plot annotations may be missing or inappropriate. Explanations and interpretations provided but sometimes unclear, incomplete, or not fully justified.
12–13 marks: Good. Most tasks completed correctly. Code and visualisations mostly accurate. Plot annotations generally appropriate. Interpretations are clear, mostly concise, and demonstrate solid understanding. Minor organisation issues or minor gaps in insight.
~14 marks: You did everything correctly as instructed. Code and markdown well organised. Visualisations accurate and well-annotated. Interpretations clear, insightful, and mostly concise. Minor room for improvement in explanation depth or presentation.
>14 marks: Impressive! All tasks completed correctly and elegantly. Code and markdown highly organised. Visualisations informative, appropriately annotated, and clearly interpreted. Explanations demonstrate deep understanding, critical insight, and concise communication.

Part 2: Regression (35 marks)

<9 marks: A deep fail. Minimal or no code, or code/markdown so disorganised we cannot understand what you did. Baseline regression missing or completely incorrect. No evaluation or interpretation of results.
9–17 marks: A fail. Some code and/or text present but major instructions ignored (e.g., baseline regression not implemented, train/test split poorly handled, evaluation absent or unjustified). Interpretations largely missing or incorrect.
18–26 marks: You made some critical mistakes or did not complete all tasks. Preprocessing errors, partial data leakage, or model mis-specification may be present. Evaluation metrics may be misapplied or unjustified. Interpretations of model results are partially incorrect, incomplete, or lack clear justification.
27–31 marks: Good. Most tasks completed correctly. Minor issues in code, organisation, or explanations. Evaluation and interpretations mostly correct but may be slightly incomplete or not fully justified.
~32–33 marks: You did everything correctly as instructed. Code and markdown well organised. Baseline and improved models correctly implemented. Evaluation metrics chosen appropriately, and results interpreted clearly. Minor room for conciseness or depth of insight.
>33 marks: Impressive! All tasks completed correctly and elegantly. Code and markdown highly organised. Modelling choices clearly justified. Evaluation metrics chosen and interpreted appropriately. Interpretations and insights demonstrate deep understanding and critical thinking in context.

Part 3: Classification (35 marks)

<9 marks: A deep fail. Minimal or no code, or code/markdown so disorganised we cannot understand what you did. Logistic regression baseline missing or completely incorrect. No evaluation or interpretation.
9–17 marks: A fail. Some code/text present but major instructions ignored (e.g., baseline logistic regression not implemented, train/test split poorly handled, evaluation absent or unjustified). Interpretations of model coefficients or predictions largely missing or incorrect.
18–26 marks: You made some critical mistakes or did not complete all tasks. Preprocessing errors, partial data leakage, or mis-specification of the logistic regression model may be present. Evaluation metrics may be misapplied or unjustified. Interpretations of coefficients and predictions are partially incorrect or incomplete.
27–31 marks: Good. Most tasks completed correctly. Minor issues in code, organisation, or explanation. Baseline and improved models mostly correct. Evaluation and interpretations generally accurate but may lack full justification or insight.
~32–33 marks: You did everything correctly as instructed. Code and markdown well organised. Logistic regression baseline correctly implemented. Improvements explored appropriately. Evaluation metrics chosen and interpreted clearly. Coefficients and predictions interpreted accurately and meaningfully.
>33 marks: Impressive! All tasks completed correctly and elegantly. Code and markdown highly organised. Modelling choices clearly justified. Evaluation metrics and model coefficients interpreted with depth and insight. Interpretations demonstrate critical thinking and contextual understanding.

Part 4: Reflection (15 marks)

<5 marks: A deep fail. Minimal or no reflection provided. Ethical considerations, limitations, and interpretation of results not discussed.
5–7 marks: A fail. Some reflection present, but major aspects ignored (e.g., ethical issues, biases, or communication considerations). Explanations superficial or largely incorrect.
8–11 marks: You made some critical omissions or your discussion was incomplete. Ethical considerations, potential biases, or limitations partially discussed. Interpretation and communication insights partially incorrect, unclear, or not fully justified.
12–13 marks: Good. Most reflection tasks addressed correctly. Ethical, bias, and limitation discussions mostly accurate. Interpretation and communication insights generally clear, but minor gaps in depth or justification.
~14 marks: All reflection tasks completed correctly. Discussion of ethical considerations, bias, limitations, and communication clear and mostly concise. Minor room for deeper insight or more precise justification.
>14 marks: Impressive! Reflection is thorough, insightful, and well-articulated. Ethical considerations, biases, limitations, and communication strategies are clearly and critically discussed. Demonstrates deep understanding of the context and responsible use of modelling.

How to get help and how to collaborate with others

🙋 Getting help

You can post general coding questions on Slack but should not reveal code that is part of your solution.

For example, you can ask:

“Does anyone know how I can create a logistic regression in tidymodels without a recipe?”
“Has anyone figured out how to do time-aware cross-validation, grouped per country??”
“I tried using something like df %>% mutate(col1=col2 + col3) but then I got an error” (Reproducible example)
“Does anyone know how I can create a new variable that is the sum of two other variables?”

You are allowed to share ‘aesthetic’ elements of your code if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df as the data frame, but you should not share the code for the data wrangling that led to the creation of df.

If we find that you posted something on Slack that violates this principle without realising it, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you cannot share or copy code from others — pretty much the same rules as above.

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.

Footnotes

Hint: don’t just write code, especially uncommented chunks of code. It won’t get you very far. You need to explain the code results, interpret them and put them in context.↩︎