📝 W11+1 Summative

2025-2026 Autumn Term

Author

Dr. Ghita Berrada

💡 NOTES:

You should only provide code if it adds anything to your answer: make sure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.
You are free to choose the dataset(s) on which you do the analysis. You may use one or two datasets.
Given that the datasets are large, you are allowed to use a subsample of the datasets. What matters is that the analysis serves as a proof-of-concept of your approach. If you choose to sample the data, simply explain how and why you did the sampling and what effect you think it might have on your results.
If you use any method not seen in the course, justify why you’re using it and explain how it works.
All analyses should be conducted in R using tidyverse/tidymodels/quanteda where practicable.

⏲️ Due Date: Friday, December 19th, 5pm

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

⚖️ Assignment Weight:

This assignment is worth 30% of your final grade in this course.

30%

Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

📝 Instructions

👉 Read it carefully, as some details might change from one assignment to another.

Go to our Slack workspace’s #ds202a_central channel to find a GitHub Classroom link (the message announcing the summative will be pinned to the channel). Do not share this link with anyone outside this course!
Click on the link, sign in to GitHub and then click on the green button Accept this assignment.
You will be redirected to a new private repository created just for you. The repository will be named ds202a-2025-w11p1-summative--yourusername, where yourusername is your GitHub username.
Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.
Create your own <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE candidate number. You can use the .qmd file you used in previous labs as a template.
Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:

---
title: "DS202A - W11+1 Summative"
author: <CANDIDATE_NUMBER>
format: html
self-contained: true
editor:
  render-on-save: true
  preview: true
---

Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER.

Fill out the .qmd file with your answers. Use headers and code chunks to keep your work organised.
Use the #help channel on Slack liberally if you get stuck.
Once done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file.
- If you added any code, ensure your .qmd code is reproducible. That is, if we were to restart VSCode and run your notebook from scratch, from top to the bottom, we would get the same results as you did.
Push both files (i.e .qmd and rendered HTML) to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.
Read the section How to get help and collaborate with others at the end of this document.

“What do I submit?”

You will submit two files:

A Quarto markdown file: <CANDIDATE_NUMBER>.qmd
An HTML file render: <CANDIDATE_NUMBER>.html

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub.

Your HTML must be self-contained

Your rendered HTML must be self-contained (i.e., all images, plots, scripts, and styles must be embedded in the HTML file).
HTML files that are not self-contained will receive a penalty because they cannot be viewed offline or archived reliably.

🗄️ The Data

For this assignment, you have the choice between four datasets. You may use one dataset for both parts or different datasets for each part.

Reproducibility and data handling

Do not upload your dataset(s) to GitHub Classroom.
Large data files should not be tracked in version control and may cause issues with your repository.

Instead, you must provide clear reproducibility instructions in your .qmd file(s), explaining:

where the data would be placed in the repository (e.g., data/my_dataset.csv),
how the file would be named,
how your code would load it,
and any additional steps needed to reproduce your analysis.

These instructions must appear in both Part A and Part B wherever relevant.
They ensure your work is fully reproducible without committing the dataset itself.

Dataset 1: UN Gener al Debate Corpus (1946-2024)

Download: Harvard Dataverse

The UN General Debate Corpus contains speeches from the General Debate of the United Nations General Assembly from 1946 to 2024. This corpus includes over 10,000 speeches from 202 countries.

Each year, leaders or representatives from UN member states deliver speeches addressing global issues, national priorities, and international relations. The dataset includes the full text of these speeches along with metadata such as country, year, session number, and speaker information.

🔍 Technical note: Speeches are available in English. Older speeches (1946-1990s) were originally delivered in English; more recent speeches may be translations.

Dataset 2: Reddit Climate Change Discussion Dataset (2024)

Download: Hugging Face

This dataset contains approximately 80,400 comments from climate-related subreddits collected in February 2024. The data comes from eight subreddits: r/Climate, r/Energy, r/RenewableEnergy, r/ClimateChange, r/ClimateActionPlan, r/Environment, r/Sustainability, and r/ZeroWaste.

The dataset includes the full comment text, along with metadata such as subreddit, post title, comment author, timestamp, upvotes, and the hierarchical structure of comments and replies.

🔍 Technical note: The dataset has a nested structure (posts → comments → replies). You may need to flatten this structure depending on your analysis approach.

Dataset 3: CEO Departures Dataset (Tidy Tuesday)

Download: Tidy Tuesday GitHub

This dataset contains 9,423 CEO departure events from S&P 1500 firms between 1987 and 2020. Each record includes a narrative text field describing the circumstances of the departure, along with structured data about the departure reason, company, industry, and whether the departure was voluntary or involuntary.

The dataset was compiled from various sources including news reports, company filings, and databases tracking executive movements.

🔍 Technical note: The dataset combines narrative text with structured categorical and numerical variables.

📋 Your Task

🎯 Important: Quality over quantity

Do not try every method you know. Choose your methods thoughtfully and go deep with analysis and interpretation. We value:

Focused analysis with well-justified choices over scattered attempts at multiple methods
Depth of insight over breadth of techniques
Decisive methodology with clear reasoning

A submission using 2-3 well-chosen methods with deep analysis will score higher than one trying 6-7 methods superficially.

Also, aim for clear and concise writing. Avoid unnecessary filler or overly long paragraphs: each paragraph should develop a single idea and contribute directly to your analysis.

Part A: Similarity and Anomalies (40 marks)

Scenario: You are a data analyst working for a research organization studying your chosen dataset. Your manager has asked you to provide insights into both the typical patterns and exceptional cases in the data. They want to understand: what’s normal, what’s unusual, and why both perspectives matter. Your manager will use your findings to brief senior stakeholders, so clarity and insight matter.

Your task: Using your chosen dataset, investigate both typical and atypical documents by:

Identifying documents that are similar to each other
Identifying documents that are anomalous or unusual
Comparing what these two perspectives reveal about your data

Your analysis should use appropriate text mining methods and two techniques: one for assessing similarity and one for detecting anomalies. Explain what you discover from each approach and what the comparison teaches us about the social world.

What you need to decide

What does “similarity” mean in the context of your dataset? How will you measure it?
What makes a document “anomalous” in your data? How will you identify unusual cases?
Why are these definitions and methods appropriate for your specific dataset and analysis goals?

These are analytical decisions you must make and justify.

Part B: Research Question (60 marks)

Scenario: Your research organization is preparing a report on your chosen dataset for relevant stakeholders (e.g., policymakers, business leaders, advocacy organizations, or other decision-makers appropriate to your data). You’ve been asked to investigate a specific question using data-driven methods and present findings that will inform their understanding or decisions.

Your task: Formulate and investigate a research question using unsupervised learning methods. Your question should be distinct from the similarity/anomaly analysis in Part A.

Your response must address:

Your research question: What are you investigating and why does it matter to your stakeholders?
Your approach: Which methods did you use and why? How do they address your question?
Your findings: What did you discover and what does it mean in context? Provide concrete examples from your data.
Limitations: What are the limitations of your approach? What assumptions did you make? What cautions should stakeholders have when interpreting your findings?
Ethical considerations: Are there any ethical issues with your data or methods? How might your findings be misused or misinterpreted? Who might benefit or be harmed?
Why unsupervised learning?: Explain clearly why unsupervised learning is appropriate for your research question and dataset. Why was supervised learning not used here? Is supervised learning feasible for the question you’re trying to answer? If supervised learning is not feasible, explain why; if it is, explain why unsupervised is still the better choice.

Note: Your Part B analysis should use different primary methods from Part A (e.g., clustering, topic modeling, dimensionality reduction rather than similarity/anomaly detection). However, you may build on Part A findings if it makes sense for your research question. For example, you might use Part A’s anomaly findings to motivate your Part B question, or use similarity patterns from Part A as one input to a broader analysis in Part B. Well-justified connections between parts will be rewarded, not penalized.

Guidance on methods

You need at least TWO unsupervised learning techniques (e.g., topic modeling + clustering, or clustering + anomaly detection, or topic modeling + hierarchical clustering).

Note: Dimensionality reduction (PCA, UMAP, MCA, etc.) doesn’t count as one of your two required techniques - it’s a preprocessing/visualization step. So “PCA + k-means” counts as ONE technique (clustering). You’d need to add another method (e.g., PCA + k-means + topic modeling, or PCA + k-means + DBSCAN).

Choose methods that work together to answer your question. Don’t just apply multiple methods independently - show how they complement each other in your analysis. Focus on integration and insight over trying many techniques.

What we’re assessing

Part A assesses: Your ability to work with similarity measures and anomaly detection, integrate multiple methods, compare different analytical perspectives, and communicate insights clearly.

Part B assesses: Your ability to formulate research questions, select appropriate methods, interpret findings with concrete examples, and think critically about limitations and ethics.

Across both parts: We value focused, decisive analysis over trying every method. Choose your methods thoughtfully, justify your choices, tie them to your narrative and dataset, and provide concrete examples from your findings. Quality and insight matter more than quantity.

✔️ How we will grade your work

A 70% or above score is considered a distinction in the typical LSE expectation. This means that you can expect to score around 70% if you provide adequate responses to both parts, in line with the learning outcomes of the course and the instructions provided.

Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not help. You need to provide unique insights or analyses that make us go, “wow, that’s a great idea! I hadn’t thought of that”.

Here is a detailed rubric of how we will grade your answers. Note that the rigor of our marking varies with the expected difficulty of the question.

Part A: Similarity and Anomalies (40 marks)

>29 marks: Your response, besides being accurate, well-explained, effectively formatted, and surprisingly concise, surprised us positively. You might have devised a custom similarity measure logically justified for your chosen dataset, or used sophisticated anomaly detection methods with clear rationale. Most importantly, your comparison of what similarity vs. anomaly perspectives reveal is insightful and non-obvious. You provided concrete examples from your data - specific documents or groups - showing exactly what makes them similar or anomalous and what this reveals about social dynamics. Your code (when provided) is well-documented and your plots are excellent. You tied everything together into a coherent narrative about typical and exceptional patterns in your dataset.
29 marks: Your response is well-structured and covers all requirements concisely. You applied appropriate similarity and anomaly detection methods with good justification specific to your dataset (not generic). Your comparison clearly explains what each perspective reveals and how they complement each other. You provided concrete examples from your data to illustrate your findings. Your interpretation focuses on what the patterns mean for your specific dataset, not generic observations. Code (when provided) is well-documented. Markdown formatting is excellent.
23-28 marks: Your response is accurate and relevant. You applied similarity and anomaly detection methods appropriately. You included some comparison of the two perspectives, though it could be deeper. You provided some concrete examples but could have more detail. Some justifications are vague or generic. Formatting could be improved. Minor gaps in explanation.
17-22 marks: Your response has significant issues. Methods are applied but poorly justified. The comparison between similarity and anomaly perspectives is weak or superficial - more like two separate analyses without real integration. Examples from data are sparse or generic. Interpretation doesn’t connect well to the specific dataset. Formatting is poor. Missing crucial details.
16 marks: A pass. Methods are barely applied. Little to no comparison of perspectives. No concrete examples from data. Generic observations that could apply to any dataset. Response resembles content copy-pasted from the web or AI chatbot without adaptation.
<16 marks: No answer provided, response is very inadequate, or missing key required elements (e.g., no anomaly detection, no comparison).

Part B: Research Question (60 marks)

>43 marks: Besides being correct, precise, excellently formatted, and well-explained, you formulated a compelling research question distinct from Part A with clear stakeholder relevance. You applied sophisticated unsupervised learning methods (at least two distinct techniques) with excellent justification for why these methods suit your question and dataset. Most importantly, you provided detailed analysis with concrete examples - specific findings from your data with document excerpts, cluster descriptions, topic examples, or pattern illustrations. Your interpretation ties methods to the narrative and shows deep understanding of what the analysis reveals about social phenomena. Your limitations discussion is honest and specific (not generic “more data would help”), showing what your methods capture and what they miss. Your ethics discussion is thoughtful and considers multiple dimensions. Code (when provided) aligns with course style, is well-documented, and produces insights beyond basic outputs.
43 marks: Your response is well-structured and covers all requirements concisely. Research question is clear, appropriate for unsupervised learning, and distinct from Part A. You explained clearly why unsupervised rather than supervised learning is appropriate. Methods are well-chosen and justified specifically for your dataset and question. You provided concrete examples from your findings - showing specific patterns, clusters, topics, or anomalies you discovered with illustrative detail. Interpretation is dataset-specific and avoids generic explanations. Limitations discussion is thoughtful and specific. Ethics discussion shows genuine engagement. Code (when provided) is well-documented. Markdown formatting is excellent.
34-42 marks: Your response is accurate and relevant. Research question is appropriate, though may have minor overlap with Part A. Methods are applied competently with reasonable justification. You provided some concrete examples but could have more detail or specificity. Interpretation is mostly dataset-specific with some generic elements. Limitations are addressed but could be more specific. Ethics is discussed but somewhat superficial. Some minor gaps in explanation or formatting.
25-33 marks: Your response has significant issues. Research question may substantially overlap with Part A or not be well-suited to unsupervised learning. Methods are applied but poorly justified or don’t clearly connect to the question. Few concrete examples from data; mostly generic observations. Interpretation is superficial and could apply to any dataset. Limitations are generic platitudes. Ethics is perfunctory or checkbox exercise. Formatting is poor. Missing crucial details.
24 marks: A pass. Research question is vague or essentially repeats Part A. Methods barely connect to question. No concrete examples from data. Generic observations throughout. Limitations are “more data would help” type statements. Ethics is missing or completely generic. Response resembles content copy-pasted from web or AI without adaptation.
<24 marks: No answer provided, response is very inadequate, irrelevant to the question, or missing key required elements (e.g., no research question, no limitations discussion, no ethics discussion).

Key Criteria for High Marks

To achieve distinction-level marks (70+), your work must demonstrate:

Concrete examples and detail: Don’t just say “I found 3 clusters” - describe what’s IN those clusters with specific document examples, representative terms, or illustrative cases. Show us, don’t just tell us.
Deep justification of methods and parameters:
- WHY this method for THIS dataset and question? What perspective does it bring?
- WHY these parameters? How were they determined? Why are they appropriate for your data?
- Connect every choice to your analytical goals and data characteristics
Dataset-specific insights: Your interpretation must engage with the specific social context of your data. Generic observations that could apply to any text dataset will not achieve high marks.
Concrete, dataset-specific limitations:
- Not “more data would help” but specific issues you encountered or trade-offs you made
- Problems revealed by YOUR analysis of YOUR data
- What your methods captured and what they missed in THIS context
Thoughtful, grounded ethics discussion:
- Specific ethical issues arising from YOUR dataset and YOUR findings
- Consider concrete harms or misuses particular to your analysis
- Who specifically might be affected by your work?
Code that demonstrates your methods: Some working code is expected to show how you implemented your analysis and generated your findings. Well-documented code enhances your submission, but the focus should be on using code to support your narrative, not code for its own sake.
Integration and coherence: Part A should compare perspectives meaningfully. Part B should tell a coherent analytical story from question through methods to findings. Building on Part A in Part B (when well-justified) demonstrates sophisticated thinking.

Important: Code and Reproducibility

Code is expected to support your analysis and demonstrate your methods. You should include code that: - Implements your text mining and unsupervised learning methods - Generates your findings and visualizations - Supports your narrative with evidence

Working code demonstrating your methods is required. For distinction-level work, your code should be well-documented and produce meaningful outputs that support your analysis.

Acceptable code formats: - Fully executable R code (ideal - runs when we render your .qmd) - Non-executable code chunks with results shown (fully acceptable for high marks if computationally expensive - you must explain why code is set to not execute and include the results/outputs) - Pseudo-code or proof-of-concept (acceptable for explaining approach before implementation, or for supplementary analyses where you encountered bugs - explain what you intended. Note: your main analysis must use working code, but occasional pseudo-code for supplementary ideas is fine)

Code quality matters for high marks: - Load all libraries in a single chunk at the start of your document - Use relative paths that work within your repo structure (not absolute paths like C:/Users/YourName/...) - Follow tidyverse/quanteda style we taught in the course - Document your code with clear comments - Produce meaningful outputs (not just running code for the sake of it)

Data and reproducibility: - Do NOT upload large data files to your repo - Provide clear instructions for where to download data and where to place it - Your code should be reproducible if we follow your instructions

Penalties will apply for: - Not submitting HTML render - Submitting HTML that is not self-contained (missing images/plots) - Using absolute paths that break reproducibility - Loading libraries scattered throughout document instead of at start - Missing reproducibility instructions for data - Using AI without acknowledgment (AI use is allowed but must be disclosed in “Use of AI Tools” section) - Code that produces no meaningful outputs or doesn’t support the analysis

Subsampling Data

You are allowed to subsample large datasets for computational reasons. However: - You must explain how and why you sampled - You must justify your sampling approach (random? stratified? time period? why?) - You must discuss limitations - how might sampling affect your results? - Without this, you will be penalized for unclear methodology

General Advice

Be concise and decisive: Don’t try every method under the sun. Choose methods thoughtfully, justify them, and go deep with interpretation.
Show concrete examples: The best responses show us specific findings from the data, not just summaries.
Tie methods to narrative: Every method should serve your analytical story and be justified in context.
Think about stakeholders: Remember your scenario - you’re providing insights to inform decisions. What do they need to know?
Be honest about limitations: Good researchers acknowledge what their methods can and cannot tell us.
Consider ethics seriously: This is not a checkbox. Think genuinely about implications.

How to get help and collaborate with others

🙋 Getting help

You can post general clarifying questions on Slack.

For example, you can ask:

“I’m working with the UN dataset and having trouble with the nested structure. What’s the best way to approach this in R?”
“What’s a reasonable subsample size for topic modeling with 10,000 documents?”

You cannot post questions that reveal your specific approach, analysis, or findings.

You won’t be penalized for posting something on Slack that violates this principle without realizing it. We will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with others and work alongside each other. However:

You cannot share or copy code
You cannot share your specific analytical approach or findings
You cannot work together on the same analysis
You can discuss general concepts, technical problems, or dataset characteristics
You can help each other troubleshoot R errors or technical issues

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT, Claude, or GitHub Copilot when doing this assignment and you can search online for help.

However:

You must report any AI use. Add a section at the end of your notebook titled “Use of AI Tools” that describes:
- Which tool(s) you used
- What you used them for
- Approximately how much you relied on them
Be aware of AI limitations:
- AI tools often generate plausible-sounding responses that are incorrect
- They tend to produce generic, formulaic responses that limit your chances for a high mark
- For coding, they often generate inefficient code that doesn’t follow the principles we teach
- They cannot understand your specific dataset context as well as you can
AI use will not excuse errors. If you submit incorrect code or analysis based on AI suggestions, you are responsible for those errors.

Examples of reporting AI use:

“I used ChatGPT to help debug an error in my quanteda code where tokens weren’t being created correctly. I also used it to get a second explanation of how DBSCAN’s eps parameter works after reviewing the lecture notes.”

“I used GitHub Copilot for code completion throughout the assignment. It helped with syntax but I wrote all the analytical logic myself.”

“I did not use any AI tools for this assignment.”

Good luck! 🍀

📝 Instructions

“What do I submit?”

🗄️ The Data

Dataset 1: UN Gener al Debate Corpus (1946-2024)

Dataset 2: Reddit Climate Change Discussion Dataset (2024)

Dataset 3: CEO Departures Dataset (Tidy Tuesday)

Dataset 4: GDPR Violations Dataset (Tidy Tuesday)

📋 Your Task

Part A: Similarity and Anomalies (40 marks)

Part B: Research Question (60 marks)

✔️ How we will grade your work

Part A: Similarity and Anomalies (40 marks)

Part B: Research Question (60 marks)

Key Criteria for High Marks

Important: Code and Reproducibility

Subsampling Data

General Advice

How to get help and collaborate with others

🙋 Getting help

👯 Collaborating with others

🤖 Using AI help?