LSE DS202 – Data Science for Social Scientists
19 Jan 2026
decision support systems
machine learning applications
databases
provenance
ethical AI/XAI
Write an e-mail to Kevin:

Sign up for DSI events at lse.ac.uk/DSI/Events



Follow the seminar series: 🔗 Link
Hear from alumni or industry experts about their career paths and how they got to where they are today.
Latest events:
🗓️ Data Science across industries (03 December 2024- 4.00 to 5.30pm)
Machine learning is transforming large parts of the economy, and data scientists have the opportunity of to apply their skills in an incredibly broad variety of domains. The technical field is in rapid progress and professional roles in continuous development as companies navigate successive waves of technological and economic change. Data scientists must therefore craft skill paths which balance focus on rapid learning with capabilities complementing their domain, organisations and wider industry.
Drawing on his experience from startups, consulting and tech, Christian Svalesen, Senior Machine Learning Engineer at SoundCloud will provide insights into what data science roles and projects can involve across industries. He will share advice on how students can prepare and develop through their professional journey.
Read more about this series of events: 🔗 Link
Hear from alumni or industry experts about their career paths and how they got to where they are today.
Latest events:
🗓️ Navigating Data Science from Academia to Media, and Beyond (23 October 2024 - 4.30 to 6pm)
With the rise in adoption of AI/ML technologies and the increasing demand for data-driven decision-making, data science has become a vital component across many industries, including media. As data science transforms the media landscape - enhancing content personalisation, optimising conversion strategies, and improving audience engagement, it is also becoming an increasingly popular tool for addressing complex business challenges. Navigating a role in this field can be both exciting and challenging.
Tabtim Duenger, Senior Data Scientist at The Economist and Riya Chhikara, Data Scientist at the Economist, both LSE graduates1, will offer insights into their paths to entering the field of data science. They will discuss their experiences in landing their first roles, negotiating their functions and responsibilities within the media sector, and how they use these experiences and networks to continue guiding their careers.
Read more about this series of events: 🔗 Link


‘Winners’ of the upcoming Bank of England trip will be announced soon!
Sign up for DSI events at lse.ac.uk/DSI/Events
| Programme | Freq |
|---|---|
| General Course | 14 |
| BSc in Economics | 10 |
| BSc in Psychological and Behavioural Science | 4 |
| BSc in Politics and Data Science | 3 |
| BSc in Social Anthropology | 2 |
| BSc in Philosophy and Economics | 1 |
| BSc in Philosophy,Politics and Economics | 1 |
| BSc in Sociology | 1 |
| Year | Count |
|---|---|
| 1 | 17 |
| 2 | 10 |
| 3 | 11 |
| 4 | 2 |

Key insight: Diverse backgrounds → diverse perspectives on DS problems
Course Rep Selection:
What is this course about?
Focus: learn and understand the most fundamental machine learning algorithms
How: practical use of machine learning techniques and its metrics, applied to relevant data sets
What is this course about?
Two Critical Principles:
1. Learn to Learn
2. No Single “Right Answer”
How will this course be taught?
How do I prepare for this course?
Important
There might be some preparatory work to do before each lab!
Always check Moodle/the webpage at least a day before coming to the lab.
Each week, you will have a roadmap of what to do.
The roadmap will typically contain the following elements:
| Type of activity | Description |
|---|---|
| 🧑🏻🏫 TEACHING MOMENT | Your class teacher deserves your full attention |
| 🎯 ACTION POINTS | Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher. |
| 👥 IN PAIRS/GROUPS | You will benefit from completing that task with your peers more than doing it alone |
| 🗣️ CLASSROOM DISCUSSION | Your class teacher will facilitate a discussion about the task |
| 📝 SUBMISSION | Submit your work |
👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.
If you are reading this but you are not an LSE student, the same content is available on the course’s 🌐 public website
We assume that you have some basic knowledge of:
We assume that you have some basic knowledge of:
- If you took ST102, you should be fine.
- Nothing crazy, mostly matrix operations (simpler than MA107)
- It’s ok if you are new to Python, but do reserve some extra hours in the first weeks to practice the basics.
New to Python?
Why it matters:
Our approach in this course:
Weeks 1-3: Python 3.13 - Latest Anaconda distribution - Great for pandas, numpy, matplotlib, seaborn - Basic data science work
Week 4 onwards: Python 3.12 - Better support for scikit-learn, statsmodels - Some advanced ML packages not yet on 3.13 - We’ll guide you through the switch
There are three official positions at LSE:
Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)
Position 2: Limited authorised use of generative AI in assessment.
Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course
Source: School position on generative AI, LSE Website, September 2024
Our Policy - Responsible Use (NOT Optional!):
✅ You CAN use: - ChatGPT, Copilot, Claude, etc. for lectures, labs, assignments
⚠️ You MUST: - Acknowledge every use in your submissions - Explain HOW you used it (see examples below) - Check and understand all AI-generated code/content - Critically evaluate AI suggestions against course materials
❌ You CANNOT: - Use AI when explicitly told not to - Submit AI output without understanding it - Claim AI work as entirely your own
Example acknowledgment:
“I used ChatGPT to debug my pandas merge operation. It suggested using pd.merge() with on='date', but this produced duplicates. I revised it to include how='left' after reviewing the pandas documentation.”
Why this matters:
👉 Full policy on Moodle - read it carefully!
Empirical, experience-focused learning:
This is not a “spoon-feeding” course:
Quick Poll (Mentimeter):
How comfortable are you with Python basics? (1-5 scale)
Have you used pandas before? (Yes/No/A little)
Have you used numpy before? (Yes/No/A little)
Results will guide our Python review depth
“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.
Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),
and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”
New data to answer old questions:
New questions enabled by new data/new technologies:
We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.
For Economics students (10) & General Course (14, mostly business/econ): - Predicting UK inflation trends using consumer spending data - Analyzing income inequality patterns across London boroughs - Forecasting housing market shifts using property transaction data
For Politics & Data Science students (3): - Tracking public sentiment on Brexit using survey data over time - Predicting UK election outcomes at the constituency level - Analyzing parliamentary voting patterns to identify party factions
For Psychology & Behavioural Science students (4): - Understanding mental health trends among university students from NHS data - Predicting therapy dropout rates based on early session patterns - Analyzing social media usage and wellbeing correlations
For Sociology & Anthropology students (3): - Mapping gentrification patterns in East London using census data - Understanding migration flows and integration outcomes - Analyzing cultural consumption patterns across UK demographics
👉 Traditional Statistics in the social sciences: the goal is typically explanation
👉 Data science: the focus is frequently put more on data exploration and prediction
It is often said that 80% of the time and effort spent on a data science project goes to data gathering, cleaning, and preparation.
In this course:
Note
If you want to practice the “80%” more, check out our other course: DS105.
The Question: “Will the Bank of England raise, lower, or hold interest rates next month?”
Why it matters: - Affects mortgages for homeowners - Changes savings account interest - Impacts business loan costs - Influences overall UK economic growth
This was a real assignment from last year
By the end of this course, you’ll be able to tackle this problem yourself!
The research phase (requires domain knowledge!):
What factors influence BoE decisions?
Identify relevant indicators:
The reality check: - Is this data available? Where? - What sources: ONS, OECD, Bank of England, Federal Reserve - In what format? How far back does it go? - Can we legally use it?
Important
This IS data science: Finding the right data to answer your question comes first!
Download from multiple sources:
Bank of England: Interest rate decisions - First discovery: Decisions don’t happen every month! - They occur roughly every 6-8 weeks
Economic indicators from different sources:
| Indicator | Source |
|---|---|
| Consumer Confidence Index (CCI) | OECD |
| CPIH inflation | ONS |
| GDP monthly estimates | ONS |
| GBP/EUR and GBP/USD exchange rates | Bank of England |
| 10-year gilt yields | Federal Reserve Bank of St. Louis |
| Unemployment rate | ONS |
Warning
The challenge: Different sources = different formats, different frequencies, different date conventions!
The tricky bit:
For each BoE decision date, calculate 3-month average of each indicator
Example: - Decision date: 06/05/1997 - Calculate average GDP for: May 1997, April 1997, March 1997 - Repeat separately for all 6 other indicators - Align everything to the decision date
Why this matters: - BoE makes decisions based on recent economic context - Need to capture the “state of the economy” at decision time - Must handle: - Monthly vs quarterly data - Missing values - Different date formatting (UK vs US formats!) - Alignment of different time series
Tip
This is the 80%: Getting data into the right shape for analysis
After all that preparation:
A major challenge - Distribution Shift: - Models learn from past BoE decisions - But what if the economic environment changes fundamentally? - Examples: Post-2008 financial crisis, COVID-19 pandemic, Brexit - Past patterns may not apply to new contexts - We’ll discuss this more in Week 11
Important
The reality: No model is perfect. Your job is to:
Key insight: “By the time you’re ready to do ‘machine learning,’ you’ve already done the hard work”
Your journey in this course:
Weeks 1-4: Data Foundation - Data handling with pandas - Data cleaning and transformation - Exploratory data analysis - Visualization
Weeks 5-10: Machine Learning - Classification algorithms - Regression models - Model evaluation - Parameter tuning - Interpreting results
Week 11: Advanced Topics - Distribution shift and model limitations - Ethical considerations - Real-world deployment challenges
Before: Messy reality

After: Clean and ready
Note
The message: Even government data needs serious cleaning. You’ll spend most of your time here, but that’s where real insights emerge.
Coming up next: Python Review & Setup
Take a 10-minute break, then we’ll dive into Python!
Python Review & Environment Setup
A few stats
Data types
Python lists
returns:
[1, 2, 3, 4]
You could use the append method to add elements to a list:
returns:
[1, 2, 3, 4, 5]
Python lists (cont.)
Other types of data collections
returns
(1, 2, 3, 4)
returns
2
What do you think is the difference here?
Other types of data collections
Aside from lists and tuples, you also have dictionaries and other more complex data collection types (see the documentation).
A Python dictionary is a collection of key-value pairs:
returns:
{'first_name': 'Jane', 'last_name': 'Doe', 'city': 'London'}
Some basic operations
you will get the type of your Python object.
The above returns:
<class 'dict'>
<class 'float'> # since var = float(2.0)
Sometimes, we need to perform operations repeatedly
We have loops (for or while loops):
(Note that Python needs indentation and you absolutely can’t mix tabs and spaces!)
And you have list comprehensions (as well as dictionary comprehensions):
returns
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25,
6: 36, 7: 49, 8: 64, 9: 81}
{'b': 2, 'd': 4}
Custom functions definition
Custom functions definition
Let’s define functions based on the loops and list comprehension from before. We’ll do some code profiling!
import cProfile
def for_loop_example():
result = []
for i in range(100000):
result.append(i * 2)
def while_loop_example():
result = []
i = 0
while i < 100000:
result.append(i * 2)
i += 1
def list_comprehension_example():
result = [i * 2 for i in range(100000)]
# Profile each function
print("Profiling for loop:")
cProfile.run("for_loop_example()")
print("\nProfiling while loop:")
cProfile.run("while_loop_example()")
print("\nProfiling list comprehension:")
cProfile.run("list_comprehension_example()")Results from the loops and list comprehension profiling
Profiling for loop:
100004 function calls in 0.022 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 0.021 0.021 <python-input>:3(for_loop_example)
100000 0.008 0.000 0.008 0.000 {method 'append' of 'list' objects}
Profiling while loop:
100004 function calls in 0.021 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.014 0.014 0.020 0.020 <python-input>:8(while_loop_example)
100000 0.006 0.000 0.006 0.000 {method 'append' of 'list' objects}
Profiling list comprehension:
4 function calls in 0.003 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 0.002 0.002 <python-input>:15(list_comprehension_example)
Takeaway: List comprehensions are much faster! (~7x speedup)
pandas and scikit-learn (briefly)os, collections, math, etc. - see the Python documentation)
The pandas, numpy and scikit-learn libraries are not part of the standard Python libraries, but they are very popular and actively maintained packages
These packages contain most of the functionality needed to handle datasets, manipulate them (pandas mainly), perform statistical operations on them and apply machine learning models
These are the libraries we will rely on most in this course
Note to R users
Think of pandas as what tidyverse is to R and to some extent of scikit-learn, as what tidymodels (and perhaps caret) are to R.
pandasExample: reading a csv file
pandaspandas, you write multiple functions in succession or use method chaining:Without method chaining
pandas
Example: filtering rows
Filtering when the values are integers
Filtering when the values are strings
Example: concatenating dataframes
Say we have two random datasets:
If we want to concatenate vertically:
which returns
Let’s load some real economic data:
Try it yourself:
This connects immediately to the BoE example and gets you coding!
Next Week’s Lab: - Hands-on with pandas - Data cleaning exercises - Prep work: Check Moodle by Wednesday
Resources: - Former DS105 students: Check ME204 for more pandas practice - New to Python: Budget extra time for basics, use office hours/Digital Skills Lab
Office Hours: Check Moodle for schedule
![]()
LSE DS202A (2025/26) – Week 01