🗓️ Week 01
Welcome to the course

LSE DS202 – Data Science for Social Scientists

Dr. Ghita Berrada

LSE Data Science Institute

19 Jan 2026

Who we are

Your lecturer

Dr. Ghita Berrada
Assist. Prof. (Education)
LSE Data Science Institute
📧 E-mail
lecturer
course convenor

PhD in Computer Science (University of Twente, Netherlands)
Background: Engineering, Databases, Health Informatics, ML for cybersecurity
Formerly Research Associate at King’s College London and the University of Edinburgh (School of Informatics)

decision support systems
machine learning applications
databases
provenance
ethical AI/XAI

Teaching Assistants

Dr Stuart Bramwell
Guest Lecturer
Data Science Institute
DPhil in Politics (Oxford University)
📧 E-mail
guest teacher

Yassine Lahna
Data Scientist
MSc in Statistical Science (Oxford University)
📧 E-mail
guest teacher

Jonas Weinert
MPhil/PhD candidate in Health Policy and Health Economics (LSE)
Research Consultant: Development Impact Evaluations (World Bank Group)
📧 E-mail
guest teacher

Administrative Support

Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

if you cannot find the lecture recording on Moodle
when you need an extension for an assignment
(👉 check LSE’s extension policy)
to request a class group change
(you will be asked to provide a reason for this)
to inform us of any other issues that may affect your studies

The Data Science Institute

This course is offered by the LSE Data Science Institute (DSI).
DSI is the hub for LSE’s interdisciplinary collaboration in data science
⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Latest events:

🗓️ Data Science across industries (03 December 2024- 4.00 to 5.30pm)

Machine learning is transforming large parts of the economy, and data scientists have the opportunity of to apply their skills in an incredibly broad variety of domains. The technical field is in rapid progress and professional roles in continuous development as companies navigate successive waves of technological and economic change. Data scientists must therefore craft skill paths which balance focus on rapid learning with capabilities complementing their domain, organisations and wider industry.

Drawing on his experience from startups, consulting and tech, Christian Svalesen, Senior Machine Learning Engineer at SoundCloud will provide insights into what data science roles and projects can involve across industries. He will share advice on how students can prepare and develop through their professional journey.

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Latest events:

🗓️ Navigating Data Science from Academia to Media, and Beyond (23 October 2024 - 4.30 to 6pm)

With the rise in adoption of AI/ML technologies and the increasing demand for data-driven decision-making, data science has become a vital component across many industries, including media. As data science transforms the media landscape - enhancing content personalisation, optimising conversion strategies, and improving audience engagement, it is also becoming an increasingly popular tool for addressing complex business challenges. Navigating a role in this field can be both exciting and challenging.

Tabtim Duenger, Senior Data Scientist at The Economist and Riya Chhikara, Data Scientist at the Economist, both LSE graduates¹, will offer insights into their paths to entering the field of data science. They will discuss their experiences in landing their first roles, negotiating their functions and responsibilities within the media sector, and how they use these experiences and networks to continue guiding their careers.

Industry “field trips”

Who are you?

Programme	Freq
General Course	14
BSc in Economics	10
BSc in Psychological and Behavioural Science	4
BSc in Politics and Data Science	3
BSc in Social Anthropology	2
BSc in Philosophy and Economics	1
BSc in Philosophy,Politics and Economics	1
BSc in Sociology	1

Year	Count
1	17
2	10
3	11
4	2

Who are you? (cont.)

Key insight: Diverse backgrounds → diverse perspectives on DS problems

Course Rep Selection:

We’ll elect a course representative in Week 2
Important role: Represent student voice, provide feedback to teaching team
Think about whether you’d like to nominate yourself!

What is this course about?

Course Brief

What is this course about?

Focus: learn and understand the most fundamental machine learning algorithms
How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Course Brief

What is this course about?

Focus: learn and understand the most fundamental machine learning algorithms

No neural networks, no deep learning, no large-scale data

How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Some but not a lot of theory, math proofs and derivations
Lots of coding, examples and exercises

Course Brief

Two Critical Principles:

1. Learn to Learn

We provide essential building blocks, not everything on a platter
You’ll need to read documentation, explore independently
Homework is not optional - it’s where deep learning happens
Ask questions, but try to solve problems first

2. No Single “Right Answer”

Data science is about justified choices, not just code
You must explain WHY you chose a model, parameter, metric
Interpretation matters as much as implementation
Context of the problem/dataset shapes your decisions

🎯 Learning Objectives

Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or data mining.
Know how to fit and apply supervised machine learning models for classification and prediction.
Know how to evaluate and compare fitted models, and improve model performance.
Use applied computer programming, including the hands-on use of programming through course exercises.
Apply the methods learned to real data through hands-on exercises.
Integrate the insights from data analytics into knowledge generation and decision-making.
Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
Learn how data science methods have been applied to a particular domain of study (applications).

📚 Course Structure

How will this course be taught?
How do I prepare for this course?

🧑🏻‍💻 Labs (90 min each week)

Purpose: introduce new concepts and tools which will only be explored in more detail in the lectures
- Why? So you can come to the lectures with good questions!
Typically:
- your class teacher might give you some context about the new tools/algorithms
- you will be given time to work on something by yourself
- there will be moments to share your interpretation of the results of algorithms with the classroom
You have to attend the lab you are enrolled in. You can’t switch on the day

Important

There might be some preparatory work to do before each lab!

Always check Moodle/the webpage at least a day before coming to the lab.

More about 🧑🏻‍💻 Labs (90 min each week)

Each week, you will have a roadmap of what to do.

The roadmap will typically contain the following elements:

Type of activity	Description
🧑🏻‍🏫 TEACHING MOMENT	Your class teacher deserves your full attention
🎯 ACTION POINTS	Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher.
👥 IN PAIRS/GROUPS	You will benefit from completing that task with your peers more than doing it alone
🗣️ CLASSROOM DISCUSSION	Your class teacher will facilitate a discussion about the task
📝 SUBMISSION	Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

👩🏻‍🏫 Lectures (2 hours per week)

The first sessions will have slides, but mostly, it will be live coding
Feel free to code along with your lecturer
Pair/group exercises and discussions to interpret results
Bring a laptop if you can! (💡 you can borrow one from the library)
Recorded sessions will be available on Moodle on the next working day

Programming

Programming Language:

Python

Integrated Development Environment (IDE) option:

VS Code

In this course, we use:
- VS Code is a more general IDE, good for many programming languages. The flip-side is that it requires a bit more configuration.

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Descriptive Statistics
Some linear algebra
Programming

If you took ST102, you should be fine.

Nothing crazy, mostly matrix operations (simpler than MA107)

It’s ok if you are new to Python, but do reserve some extra hours in the first weeks to practice the basics.

New to Python?

LSE Digital Skills Lab offers pre-sessional Python workshops
Check: lse.ac.uk/dsl
Highly recommended if you’re starting from scratch!

Python Environment Management

Why it matters:

Different packages need different Python versions
ML/AI packages often lag behind latest Python release

Our approach in this course:

Weeks 1-3: Python 3.13 - Latest Anaconda distribution - Great for pandas, numpy, matplotlib, seaborn - Basic data science work

Week 4 onwards: Python 3.12 - Better support for scikit-learn, statsmodels - Some advanced ML packages not yet on 3.13 - We’ll guide you through the switch

How to manage environments:

# Check your Python version
python --version

# Create environment with specific version
conda create -n ds202 python=3.12
conda activate ds202

Note

We’ll walk through this together in Week 4!

Generative AI Tools Policy

There are three official positions at LSE:

Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)

Position 2: Limited authorised use of generative AI in assessment.

Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course

Source: School position on generative AI, LSE Website, September 2024

Responsible Use of AI - REQUIRED

Our Policy - Responsible Use (NOT Optional!):

✅ You CAN use: - ChatGPT, Copilot, Claude, etc. for lectures, labs, assignments

⚠️ You MUST: - Acknowledge every use in your submissions - Explain HOW you used it (see examples below) - Check and understand all AI-generated code/content - Critically evaluate AI suggestions against course materials

❌ You CANNOT: - Use AI when explicitly told not to - Submit AI output without understanding it - Claim AI work as entirely your own

Example acknowledgment:

“I used ChatGPT to debug my pandas merge operation. It suggested using pd.merge() with on='date', but this produced duplicates. I revised it to include how='left' after reviewing the pandas documentation.”

Why this matters:

Part of our GenAI Learning research project
Learning to use AI responsibly is a key skill
We’re studying how it affects your learning

👉 Full policy on Moodle - read it carefully!

Teaching Philosophy

Empirical, experience-focused learning:

Learning by doing (trial and error is encouraged!)
You’ll often encounter new concepts before formal explanations
Struggle is part of the process - ask “dumb questions”
Help and learn from each other

This is not a “spoon-feeding” course:

We provide building blocks and guidance
You build understanding through practice and exploration
Read documentation, try things, bring us your questions

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

Python Comfort Check

Quick Poll (Mentimeter):

How comfortable are you with Python basics? (1-5 scale)
- 1 = Never used it
- 5 = Very comfortable
Have you used pandas before? (Yes/No/A little)
Have you used numpy before? (Yes/No/A little)

Results will guide our Python review depth

If >70% comfortable: Quick 10-min refresher, focus on pandas nuances
If 40-70% comfortable: Moderate pace (15 min), emphasize pandas
If <40% comfortable: Full review (25 min) - we’ll catch you up!

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The academic possibilities

Humans and machines nowadays generate A LOT of data ALL THE TIME

It has become very cheap to collect and store this data

Source:(Roser, Ritchie, and Mathieu 2023)

This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

How do rumours spread?
How can we predict unemployment rates accurately?

New questions enabled by new data/new technologies:

Is social media a threat to democracy/public order?
Is generative AI a threat to the job market?

We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.

Applications in YOUR Fields

For Economics students (10) & General Course (14, mostly business/econ): - Predicting UK inflation trends using consumer spending data - Analyzing income inequality patterns across London boroughs - Forecasting housing market shifts using property transaction data

For Politics & Data Science students (3): - Tracking public sentiment on Brexit using survey data over time - Predicting UK election outcomes at the constituency level - Analyzing parliamentary voting patterns to identify party factions

For Psychology & Behavioural Science students (4): - Understanding mental health trends among university students from NHS data - Predicting therapy dropout rates based on early session patterns - Analyzing social media usage and wellbeing correlations

For Sociology & Anthropology students (3): - Mapping gentrification patterns in East London using census data - Understanding migration flows and integration outcomes - Analyzing cultural consumption patterns across UK demographics

The Data Science Workflow

It is often said that 80% of the time and effort spent on a data science project goes to data gathering, cleaning, and preparation.

In this course:

We focus mainly on the “20%” (exploration → ML → insights)
But you’ll definitely see the “80%” reality too!
Most datasets we give you are already clean…
…but you’ll experience the data wrangling challenge in assignments

Note

If you want to practice the “80%” more, check out our other course: DS105.

Real Example: Bank of England Interest Rates

The Challenge

The Question: “Will the Bank of England raise, lower, or hold interest rates next month?”

Why it matters: - Affects mortgages for homeowners - Changes savings account interest - Impacts business loan costs - Influences overall UK economic growth

This was a real assignment from last year

By the end of this course, you’ll be able to tackle this problem yourself!

Step 0: What Data Do We Need?

The research phase (requires domain knowledge!):

What factors influence BoE decisions?

Identify relevant indicators:

Consumer Confidence Index (CCI): Are people optimistic about spending?
Inflation (CPIH): How fast are prices rising?
GDP Growth: Is the economy expanding or contracting?

Exchange Rates: GBP vs EUR/USD strength
10-year Gilt Yields: What do bond markets expect?
Unemployment Rate: Labor market health

The reality check: - Is this data available? Where? - What sources: ONS, OECD, Bank of England, Federal Reserve - In what format? How far back does it go? - Can we legally use it?

Important

This IS data science: Finding the right data to answer your question comes first!

Step 1: Gather the Data

Download from multiple sources:

Bank of England: Interest rate decisions - First discovery: Decisions don’t happen every month! - They occur roughly every 6-8 weeks

Economic indicators from different sources:

Indicator	Source
Consumer Confidence Index (CCI)	OECD
CPIH inflation	ONS
GDP monthly estimates	ONS
GBP/EUR and GBP/USD exchange rates	Bank of England
10-year gilt yields	Federal Reserve Bank of St. Louis
Unemployment rate	ONS

Warning

The challenge: Different sources = different formats, different frequencies, different date conventions!

Step 2: The Alignment Challenge

The tricky bit:

For each BoE decision date, calculate 3-month average of each indicator

Example: - Decision date: 06/05/1997 - Calculate average GDP for: May 1997, April 1997, March 1997 - Repeat separately for all 6 other indicators - Align everything to the decision date

Why this matters: - BoE makes decisions based on recent economic context - Need to capture the “state of the economy” at decision time - Must handle: - Monthly vs quarterly data - Missing values - Different date formatting (UK vs US formats!) - Alignment of different time series

Tip

This is the 80%: Getting data into the right shape for analysis

Step 3: Only NOW Can You Predict

After all that preparation:

Explore patterns: Which indicators correlate with rate increases?
Build models: Learn from historical decisions (Weeks 5-10 of this course)
Make predictions: Up, down, or hold?
Evaluate: How accurate were we?

A major challenge - Distribution Shift: - Models learn from past BoE decisions - But what if the economic environment changes fundamentally? - Examples: Post-2008 financial crisis, COVID-19 pandemic, Brexit - Past patterns may not apply to new contexts - We’ll discuss this more in Week 11

Important

The reality: No model is perfect. Your job is to:

Build the best model you can with available data
Understand its limitations
Communicate uncertainty clearly
Justify your modeling choices

The Journey Ahead

Key insight: “By the time you’re ready to do ‘machine learning,’ you’ve already done the hard work”

Your journey in this course:

Weeks 1-4: Data Foundation - Data handling with pandas - Data cleaning and transformation - Exploratory data analysis - Visualization

Weeks 5-10: Machine Learning - Classification algorithms - Regression models - Model evaluation - Parameter tuning - Interpreting results

Week 11: Advanced Topics - Distribution shift and model limitations - Ethical considerations - Real-world deployment challenges

The 80/20 Reality

Before: Messy reality

Merged cells, inconsistent formatting
Multiple sheets with different structures
Date formats: “Jan-97”, “01/1997”, “1997-01”
Missing values marked as “..”, “N/A”, or blank
Column names with special characters
Footnotes mixed with data

After: Clean and ready

   decision_date  rate_change  cci   cpih   gdp  ...
0     1997-05-06         0.25  102  2.8  1.2
1     1997-07-10         0.25  104  2.9  1.3
2     1997-09-04         0.00  103  3.0  1.1
3     1997-11-06         0.25  105  3.1  1.4
...

Consistent date format (YYYY-MM-DD)
No missing values (or properly handled)
Clean column names
One observation per row
Ready for analysis!

Note

The message: Even government data needs serious cleaning. You’ll spend most of your time here, but that’s where real insights emerge.

☕️ Time for a break

Image created with DALL·E2. Prompt: “Cat drinking tea in a classroom, Renoir style.”

Coming up next: Python Review & Setup

Take a 10-minute break, then we’ll dive into Python!

Let’s get more technical

Python Review & Environment Setup

Python vs R

A few stats

Python ranked number 1 in the TIOBE Programming Community index of January 2025 (rating of 23.28% and change of +9.32% compared to January 2024) vs R at number 18 (rating of 1.00% and change of +0.27%)
Python at the top of the IEEE Spectrum rankings of programming languages of 2024, in two aspects of popularity measured. R sits at rank 20 in the “Spectrum” ranking, rank 17 in the “Trending” one and at rank 21 for the “Jobs” ranking
the PYPL PopularitY of Programming Language index also has Python at the top in January 2025 (share 29.8% with +1.7% increase). R ranks at number 6 (share 4.63% and no increase)

Python vs R

Python

Python is a general-purpose programming language
It is used for web development, scientific computing, data science, advanced machine learning tools (deep learning), etc.

R is more niche. It is a programming language created for statistical computing
You can do many other things with R, but it is mostly used for statistics and general data science (except for heavy Machine Learning)

Some Python basics

Data types

In R, you assign a variable using the operator <- :

var <- 2

Some basic data types:

var <- "value" # A string. Single quotes are OK too

var <- 2.2     # A double (aka numeric)
var <- 2       # Also a double! 😱

# Want an integer? You have to be explicit:
var <- as.integer(2)

Whereas in Python, assignments are done with = :

var = 2

The python equivalent:

var = "value" # A string. Single quotes are OK too
var = """I want to write 
         sentence without caring for line breaks""" 

var = 2.2      # A float
var = 2        # An int (🏅)
var = float(2) # A float

In Python, less is more! Always be explicit when using the greedier data types…

Python basics

Python lists

We can put basic data types (i.e strings, integers, floats) in collections of data (e.g lists, dictionaries, tuples)

# I am creating a list of integers
l = [1, 2, 3, 4]
l

returns:

[1, 2, 3, 4]

You could use the append method to add elements to a list:

l.append(5) # Adding element to list l in-place
l

returns:

[1, 2, 3, 4, 5]

You could also use the extend method to add several elements:

l.extend([6,7,8]) # Adding elements in-place
l

returns:

[1, 2, 3, 4, 5, 6, 7, 8]

Yet another way to add elements to a list:

l2 = [5,7,9]   # defining a list
l2 += [8,0,4]  # adding elements
l2

returns:

[5, 7, 9, 8, 0, 4]

Python basics

Python lists (cont.)

Lists don’t need to contain a single data type

mixed_type = [2.0, 7, "bananas"] # this is valid!

You can define lists of lists and lists of lists of lists, etc…

nested_list = [[1,2,3], [4,5,6]] # nested list

You can access elements of lists and nested lists:

l = [1, 2, 3, 4, 5, 6, 7, 8]
l[0]

returns:

nested_list = [[1, 2, 3], [4, 5, 6]]
nested_list[0][-1]
nested_list[1][1]

returns:

3
5

Python basics

Other types of data collections

Aside from lists, you also have tuples

my_tuple = (1,2,3,4)
my_tuple

returns

(1, 2, 3, 4)

my_tuple[1]

returns

What do you think is the difference here?

Tuples are immutable!

Is there a way to update tuples? Yes!

First method

my_tuple += (9,) # adding an element

Second method

# convert tuple → list → tuple
temp = list(my_tuple)  # to list
temp += [9]            # append
my_tuple = tuple(temp) # back to tuple

Python basics

Other types of data collections

Aside from lists and tuples, you also have dictionaries and other more complex data collection types (see the documentation).

A Python dictionary is a collection of key-value pairs:

my_dict = {
    "first_name": "Jane",
    "last_name": "Doe",
    "city": "London"
}
my_dict

returns:

{'first_name': 'Jane', 'last_name': 'Doe', 'city': 'London'}

You can access dictionary elements:

my_dict["first_name"]

returns

'Jane'

You can add an element to the dictionary:

my_dict["country"] = "UK"

returns

{'first_name': 'Jane', 'last_name': 'Doe', 
 'city': 'London', 'country': 'UK'}

Python basics

Some basic operations

If you run:

type(my_dict)
type(var)

you will get the type of your Python object.

The above returns:

<class 'dict'>
<class 'float'> # since var = float(2.0)

You can get the length of a collection with the len function:

len(l)        # l = [1, 2, 3, 4, 5, 6, 7, 8]
len(my_tuple) # my_tuple = (1,2,3,4)
len(my_dict)  # my_dict = {'first_name': 'Jane', ...}

returns

8
4
4

Python basics

Sometimes, we need to perform operations repeatedly

We have loops (for or while loops):

result = []
for i in range(100000):
    result.append(i * 2)

result = []
i = 0
while i < 100000:
    result.append(i * 2)
    i += 1

(Note that Python needs indentation and you absolutely can’t mix tabs and spaces!)

And you have list comprehensions (as well as dictionary comprehensions):

result = [i * 2 for i in range(100000)]

# dictionary: number → its square
squares = {x: x**2 for x in range(10)}
print(squares)

returns

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 
 6: 36, 7: 49, 8: 64, 9: 81}

# filter only even values
original_dict = {"a": 1, "b": 2, "c": 3, "d": 4}
filtered_dict = {k: v for k, v in original_dict.items() 
                 if v % 2 == 0}
print(filtered_dict)

{'b': 2, 'd': 4}

Python basics

Custom functions definition

def my_function(x):
    return x + 1

my_function(2)

In R, the return keyword exists, but it is optional. Whatever is at the last line of the function will be returned.

my_function <- function(x) {
  x + 1
}

Python basics

Custom functions definition

Let’s define functions based on the loops and list comprehension from before. We’ll do some code profiling!

import cProfile

def for_loop_example():
    result = []
    for i in range(100000):
        result.append(i * 2)

def while_loop_example():
    result = []
    i = 0
    while i < 100000:
        result.append(i * 2)
        i += 1

def list_comprehension_example():
    result = [i * 2 for i in range(100000)]

# Profile each function
print("Profiling for loop:")
cProfile.run("for_loop_example()")

print("\nProfiling while loop:")
cProfile.run("while_loop_example()")

print("\nProfiling list comprehension:")
cProfile.run("list_comprehension_example()")

Python basics

Results from the loops and list comprehension profiling

Profiling for loop:
         100004 function calls in 0.022 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.013    0.013    0.021    0.021 <python-input>:3(for_loop_example)
   100000    0.008    0.000    0.008    0.000 {method 'append' of 'list' objects}


Profiling while loop:
         100004 function calls in 0.021 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.014    0.014    0.020    0.020 <python-input>:8(while_loop_example)
   100000    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}


Profiling list comprehension:
         4 function calls in 0.003 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    0.002    0.002 <python-input>:15(list_comprehension_example)

Takeaway: List comprehensions are much faster! (~7x speedup)

`pandas` and `scikit-learn` (briefly)

Python has a base set of functions and libraries that come with the installation (e.g os, collections, math, etc. - see the Python documentation)

The pandas, numpy and scikit-learn libraries are not part of the standard Python libraries, but they are very popular and actively maintained packages
These packages contain most of the functionality needed to handle datasets, manipulate them (pandas mainly), perform statistical operations on them and apply machine learning models
These are the libraries we will rely on most in this course

Note to R users

Think of pandas as what tidyverse is to R and to some extent of scikit-learn, as what tidymodels (and perhaps caret) are to R.

A touch of `pandas`

Example: reading a csv file

import pandas as pd
my_data = pd.read_csv("my_file.csv")

Example: selecting columns

# accessing a column in the dataframe
my_data['col'] 
my_data.col  # works if column name has no spaces
my_data[['col1', 'col2']]  # multiple columns

A touch of `pandas`

In pandas, you write multiple functions in succession or use method chaining:

Without method chaining

df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

With method chaining

df = (
    pd.read_csv('data.csv')
    .fillna(...)
    .query('some_condition')
    .assign(new_column=lambda x: pd.cut(...))
    .pivot_table(...)
    .rename(...)
)

A touch of `pandas`

Example: filtering rows

Filtering when the values are integers

df.query("col == 2")  # integer column

Filtering when the values are strings

df.query("col == 'python'")  # string column

Example: concatenating dataframes

Say we have two random datasets:

df1 = pd.DataFrame({
    "Name": ["Alice", "Bob"],
    "Age": [25, 30]
})

df2 = pd.DataFrame({
    "Name": ["Charlie", "David"],
    "Age": [35, 40]
})

If we want to concatenate vertically:

# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)
print(result)

which returns

       Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40

Quick Hands-on - Real Data

Let’s load some real economic data:

import pandas as pd
import matplotlib.pyplot as plt

# Load UK inflation data
df = pd.read_csv('uk_cpih.csv')

# Quick exploration
print(df.head())
print(df.describe())

# Simple visualization
df.plot(x='date', y='cpih', kind='line')
plt.title('UK Inflation (CPIH) Over Time')
plt.show()

Try it yourself:

What patterns do you see?
When did inflation spike?
What might explain those spikes?

This connects immediately to the BoE example and gets you coding!

Coming Up

Next Week’s Lab: - Hands-on with pandas - Data cleaning exercises - Prep work: Check Moodle by Wednesday

Resources: - Former DS105 students: Check ME204 for more pandas practice - New to Python: Budget extra time for basics, use office hours/Digital Skills Lab

Office Hours: Check Moodle for schedule

References

Roser, Max, Hannah Ritchie, and Edouard Mathieu. 2023. “Technological Change.” Our World in Data. https://ourworldindata.org/technological-change.

Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.

🗓️ Week 01 Welcome to the course

Who we are

Your lecturer

Teaching Assistants

Administrative Support

The Data Science Institute

CIVICA Seminar Series

Careers in Data Science

Careers in Data Science

Industry “field trips”

Who are you?

Who are you? (cont.)

What is this course about?

Course Brief

Course Brief

Course Brief

🎯 Learning Objectives

📚 Course Structure

🧑🏻‍💻 Labs (90 min each week)

More about 🧑🏻‍💻 Labs (90 min each week)

👩🏻‍🏫 Lectures (2 hours per week)

Programming

Pre-requisites and assumptions

Pre-requisites and assumptions

Python Environment Management

Generative AI Tools Policy

Responsible Use of AI - REQUIRED

Teaching Philosophy

Python Comfort Check

What do we mean by data science?

Data science is…

The academic possibilities

Applications in YOUR Fields

Data Science and Social Science

The Data Science Workflow

The Data Science Workflow

Real Example: Bank of England Interest Rates

The Challenge

Step 0: What Data Do We Need?

Step 1: Gather the Data

Step 2: The Alignment Challenge

Step 3: Only NOW Can You Predict

The Journey Ahead

The 80/20 Reality

☕️ Time for a break

Let’s get more technical

Python vs R

Python vs R

Some Python basics

Python basics

Python basics

Python basics

Python basics

Python basics

Python basics

Python basics

Python basics

Python basics

pandas and scikit-learn (briefly)

A touch of pandas

A touch of pandas

A touch of pandas

Quick Hands-on - Real Data

Coming Up

References

🗓️ Week 01
Welcome to the course

`pandas` and `scikit-learn` (briefly)

A touch of `pandas`

A touch of `pandas`

A touch of `pandas`