🗓️ Week 01
Welcome to the course

LSE DS202 – Data Science for Social Scientists

Dr. Ghita Berrada

LSE Data Science Institute

04 Oct 2024

Who we are

Your lecturer

Dr. Ghita Berrada
Assist. Prof. (Education)
LSE Data Science Institute
📧 E-mail
lecturer
course convenor

PhD in Computer Science (University of Twente, Netherlands)
Background: Engineering, Databases, Health Informatics, ML for cybersecurity
Formerly Research Associate at King’s College London and the University of Edinburgh (School of Informatics)

decision support systems
machine learning applications
databases
provenance
ethical AI/XAI

Teaching Assistants

Tabtim Duenger
Data Scientist
The Economist
MSc in Applied Social Data Science (LSE)
📧 E-mail
guest teacher

Andreas Stöffelbauer
Data Scientist
Microsoft
MSc in Data Science (LSE)
📧 E-mail
guest teacher

Dr Stuart Bramwell
Guest Lecturer
Data Science Institute
DPhil in Politics (Oxford University)
📧 E-mail
guest teacher

Yassine Lahna
Data Scientist
MSc in Statistical Science (Oxford University)
📧 E-mail
guest teacher

Support Sessions

Sara Luxmoore
Research Officer
LSE Data Science Institute and LSE Cities
📧 E-mail

🦸🏻‍♀️ Runs weekly drop-in sessions for all DSI courses!

DS202A Weekly Drop-in sessions:

Typically every Tuesday from 12.15 to 13.30 at the COL.1.06 (Visualisation studio) but check announcements and calendar invites for updates.

Administrative Support

Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

if you cannot find the lecture recording on Moodle
when you need an extension for an assignment
(👉 check LSE’s extension policy)
to request a class group change
(you will be asked to provide a reason for this)
to inform us of any other issues that may affect your studies

The Data Science Institute

This course is offered by the LSE Data Science Institute (DSI).
DSI is the hub for LSE’s interdisciplinary collaboration in data science
⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Latest event:

🗓️ Keeping London Moving with Data (28 February 4 - 5.30pm)

A talk about life in the data world at TfL. Jemima, Graduate Data Scientist at Transport for London (TfL) will talk about her experience as a Data Science Graduate in our inaugural programme. Lauren Sager Weinstein, Chief Data Officer, at Transport for London (TfL) will talk about how she’s leading TfL’s data strategy, and how all the components of data careers (data scientists, data developers, data product managers, and data users) can come together to deliver on our data vision: To empower our people to make better decisions with data.

Industry “field trips”

Who are you?

Programme	Freq
BSc in Psychological and Behavioural Science	44
BSc in Politics and Data Science	4
BSc in Economics	3
General Course	3
BSc in International Social and Public Policy	2
BSc in Philosophy, Politics and Economics	2
Exchange Programme for Students from Stockholm School of Economics	2
BSc in International Relations	1
BSc in International Social and Public Policy with Politics	1
BSc in Sociology	1
Exchange Programme for Students from Central European University	1
Exchange Programme for Students from SGH Warsaw School of Economics	1

Year	Count
1	9
2	6
3	49
4	1

Who are you? (cont.)

What is this course about?

Course Brief

What is this course about?

Focus: learn and understand the most fundamental machine learning algorithms
How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Course Brief

What is this course about?

Focus: learn and understand the most fundamental machine learning algorithms

No neural networks, no deep learning, no large-scale data

How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Some but not a lot of theory, math proofs and derivations
Lots of coding, examples and exercises

🎯 Learning Objectives

Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or data mining.
Know how to fit and apply supervised machine learning models for classification and prediction.
Know how to evaluate and compare fitted models, and improve model performance.
Use applied computer programming, including the hands-on use of programming through course exercises.
Apply the methods learned to real data through hands-on exercises.
Integrate the insights from data analytics into knowledge generation and decision-making.
Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
Learn how data science methods have been applied to a particular domain of study (applications).

📚 Course Structure

How will this course be taught?
How do I prepare for this course?

🧑🏻‍💻 Labs (90 min each week)

Purpose: introduce new concepts and tools which will only be explored in more detail in the lectures
- Why? So you can come to the lectures with good questions!
Typically:
- your class teacher might give you some context about the new tools/algorithms
- you will be given time to work on something by yourself
- there will be moments to share your interpretation of the results of algorithms with the classroom
You have to attend the lab you are enrolled in. You can’t switch on the day

Important

There might be some preparatory work to do before each lab!

Always check Moodle/the webpage at least a day before coming to the lab.

More about 🧑🏻‍💻 Labs (90 min each week)

Each week, you will have a roadmap of what to do.

The roadmap will typically contain the following elements:

Type of activity	Description
🧑🏻‍🏫 TEACHING MOMENT	Your class teacher deserves your full attention
🎯 ACTION POINTS	Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher.
👥 IN PAIRS/GROUPS	You will benefit from completing that task with your peers more than doing it alone
🗣️ CLASSROOM DISCUSSION	Your class teacher will facilitate a discussion about the task
📝 SUBMISSION	Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

👩🏻‍🏫 Lectures (2 hours per week)

The first sessions will have slides, but mostly, it will be live coding
Feel free to code along with your lecturer
Pair/group exercises and discussions to interpret results
Bring a laptop if you can! (💡 you can borrow one from the library)
Recorded sessions will be available on Moodle on the next working day

Programming

Programming Language:

Integrated Development Environment (IDE) options:

RStudio

VS Code

You choose:
- RStudio is the most popular IDE for R
- VS Code is a more general IDE, good for many programming languages. It is more lightweight than RStudio, but it requires more configuration.

Software and Tools

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Descriptive Statistics
Some linear algebra
Programming

If you took ST102, you should be fine.

Nothing crazy, mostly matrix operations (simpler than MA107)

It’s ok if you are new to R, but do reserve some extra hours in the first weeks to practice the basics.

Teaching Philosophy

My teaching approach is grounded in empiricism.
I see learning as a transformative process, something that conduces to change, which is best facilitated by active, experience-focused, and exploration-driven activities.¹
In summary: learning by doing (or said, more bluntly😂, learning by trial and error) serves as the cornerstone of this course.

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

What does that mean in practice?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”

Frequently, we will present you with tasks that involve new concepts before diving into the corresponding theory or background knowledge.
- For example, I might ask you to consult the tidyverse documentation instead of explaining it directly.
Reasoning: letting your ‘struggles’ guide the learning process.
👉 allow yourself to make silly mistakes and to ask ‘dumb questions’.
- You are very much encouraged to help and learn from each other.
If this course is too easy for you, try to apply its concepts to your own data sets or to more complex problems and bring us your questions.
If you feel this teaching style is not working, drop us an e-mail or discuss it during office hours (see 📟 Communication)

AI tools in this course

Do you use ChatGPT, GitHub Copilot, or other AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An image that shows a classroom where people have their pet AI bot on their desks, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical cute bot. Each student has their own.”

LSE Policy on AI tools

There are three official positions at LSE:

Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)

Position 2: Limited authorised use of generative AI in assessment.

Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course

Source: School position on generative AI, LSE Website, September 2024

Our policy in this course

You can use AI tools during lectures, labs, and for your assignments.
- Except when the lecturer or class teachers expressly ask you not to use it.
When using for assignments, you must acknowledge the use of AI tools and tell us how you used it.
- Examples:
  
  “I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.”
  
  “I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the pd.merge command I learned in Week 08, so I went back and edited it.”

What do you think of generative AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “A university student typing on their laptop. The student has a pet AI bot on their desk, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical bot. Clean, flat design, photo. Friend or foe?”

The GENIAL project

We see many students using ChatGPT during lectures, labs, and assessments.
Frankly, most university instructors are clueless as to whether this is helping or hindering your learning.
So we did some research to try to figure out:
- How are students using generative AI tools in their studies?
- What are the benefits and drawbacks of using generative AI tools?

Participating Courses:

DS105W (Data for Data Science)
DS202W (Data Science for Social Scientists)
ST456 (Deep Learning)
PP422 (Data Science for Public Policy)

The GENIAL project

You can read more about the GENIAL project on the project page.

What we have learned so far:

We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:

Good: The students who made the most resourceful use of GenAI remained in control of their learning. They often gave the chatbots a lot of context (“I want to perform web scraping of this website with the library scrapy, the code must contain functions – no classes – and I want to save the data in a CSV file.”) and would always check the code/output generated by GenAI against the course materials or reputable sources. They were able to identify when the AI was suggesting something that was not correct or not following best practices and would never blindly accept the AI’s suggestions.

The GENIAL project

What we have learned so far:

We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:

Bad: If you don’t master a subject, GenAI can make you feel like you do. This pattern was frequent, for example, among students who had gaps in their understanding of programming concepts. They would ask the AI to generate code for them, and the AI would produce code that seemed to work but that generated the incorrect response or was so complex, it was virtually impossible to edit.

☕️ Time for a break

Image created with DALL·E2. Prompt: “Cat drinking tea in a classroom, Renoir style.”

Our first proper lecture will start in a few minutes.

“What really is data science? + R tips”

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The academic possibilities

Humans and machines nowadays generate A LOT of data ALL THE TIME

It has become very cheap to collect and store this data

Source:Roser, Ritchie, and Mathieu (2023)

This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

How do rumours spread?
How can we predict unemployment rates accurately?

New questions enabled by new data/new technologies:

Is social media a threat to democracy/public order?
Is generative AI a threat to the job market?

We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.

You might ask:

“How is data science any different from what I have learned in other stats courses?”

The Data Science Workflow

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.

Next week, we will discuss together what it means for a machine to learn something.

But first, a word about programming skills 👉

Let’s get more technical

Python vs R
base R vs tidyverse

Python vs R

Python

Python is a general-purpose programming language
It is used for web development, scientific computing, data science, advanced machine learning tools (deep learning), etc.

R is more niche. It is a programming language created for statistical computing
You can do many other things with R, but it is mostly used for statistics and general data science (except for heavy Machine Learning)

R for Python users

Data types

In R, you assign a variable using the operator <- :

var <- 2

Some basic data types:

var <- "value" # A string. Single quotes are OK too

var <- 2.2     # A double (aka numeric)
var <- 2       # Also a double! 😱

# Want an integer? You have to be explicit:
var <- as.integer(2)

Whereas in Python, assignments are done with = :

var = 2

The python equivalent:

var = "value" # A string. Single quotes are OK too

var = 2.2     # A float
var = 2       # An int (🏅)

R for Python users

R Vectors

R is a vectorized language. Everything is a vector!
Amongst other things, this means we can call length() on any variable:

In the example below, var is a vector of length 1.

# This is a vector!
var <- 2.2

length(var)

returns:

[1] 1

Use the c( ) function to concatenate vectors:

# A vector of length 3
var <- c(2.2, 3.3, 4.4)

length(var)

returns:

[1] 3

R for Python users

R Vectors (cont.)

⚠️ R vectors can only have one data type!

This is straightforward:

# A vector of numbers
c(1, 2, 3, 4, 5)

# A vector of characters
c("a", "b", "c", "d", "e")

# A vector of booleans
c(TRUE, FALSE, TRUE, FALSE, TRUE)

But beware! The code below is also valid:

my_vec <- c(2, "3", as.integer(4))

It won’t throw an error, but once you inspect the type of the vector, you will see that typeof(my_vec) is a "character".

If you type:

my_vec

You will see:

[1] "2" "3" "4"

R for Python users

Lists (not the same as vectors)

If you need to keep elements of different data types, create a list instead.

my_list <- list(2, "3", as.integer(4))

Now, if you type:

my_list

You will see:

[[1]]
[1] 2

[[2]]
[1] "3"

[[3]]
[1] 4

We see a list with a length of 3
Each element of the list is shown after the double brackets, [[ ]]
The first element of the list ([[1]]) is a vector of size 1 ([1]) that contains the number 2, etc.

R for Python users

Lists are more flexible than vectors (they are also slower to process)

Vectors are always flat:

# Me trying to do something complicated
my_silly_vector <- c(1, c(2, c(3, 4)), 5)
my_silly_vector

yields a simple vector:

[1] 1 2 3 4 5

Lists preserve the structure:

# Let's try converting some vectors into a list
my_silly_list <- list(1, list(2, c(3, 4)), 5)
my_silly_list

This produces a list of length 3 (not 5) with a more complex, nested structure:

[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 2

[[2]][[2]]
[1] 3 4


[[3]]
[1] 5

R for Python users

Obs: Python does not have vectors, only lists

If you run:

elements = [2, "3", 4]

You will get a list of length 3, with elements of different data types:

type(elements)

list

len(elements)

elements

[2, '3', 4]

(preserved structure)

R for Python users

Loops are not that different

for (i in 1:10) {
  print(i)
}

while (i < 10) {
  print(i)
  i <- i + 1
}

(R needs the curly brackets)

for i in range(1, 11):
    print(i)

while i < 10:
    print(i)
    i += 1

(Python needs the indentation)

R for Python users

Custom functions definition, compared

my_function <- function(x) {
  return(x + 1)
}

my_function(2)

In R, the return keyword exists, but it is optional. Whatever is at the last line of the function will be returned.

my_function <- function(x) {
  x + 1
}

def my_function(x):
    return x + 1

my_function(2)

Base R vs tidyverse

R has a base set of functions that come with the installation of the language
The base functions are OK - they are just not awesome.

The tidyverse is not part of the base R installation, but it is a very popular package
It is actually a collection of several packages that make it easier to manipulate data (+ databases + plotting + modelling + etc.)
This is what we will use in this course. (We suffered tremendously teaching base R two years ago)

Note to Python users

Think of the tidyverse as what pandas is to Python

Base R vs tidyverse

Example: reading a csv file

# Base R
my_data <- read.csv("my_file.csv")

# tidyverse
my_data <- read_csv("my_file.csv")

Base R vs tidyverse

Example: selecting columns

# Base R
my_data <- my_data[, c("col1", "col2")]

# tidyverse
my_data <- select(my_data, col1, col2)

Base R vs tidyverse

The pipe operator

If there is one thing that beginners tend to find counterintuitive about tidyverse, it is the pipe operator %>%. But it is quite simple:

my_data <- read_csv("my_file.csv") %>% select(col1, col2)

This is equivalent to the common, nested way of writing:

my_data <- select(read_csv("my_file.csv"), col1, col2)

The pipe operator takes the output of the function on the left and passes it as the first argument of the function on the right.
When you see %>%, think of it as the word “then”.

Base R vs tidyverse

The pipe operator

This method chaining operator became so popular that even base R has a pipe operator now (|>)
- In fact, you can interchangeably use %>% and |>:

# This also works
my_data <- read_csv("my_file.csv") |> select(col1, col2)

pandas (in Python) also supports method chaining:

Without method chaining

df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

With method chaining

df = (
    pd.read_csv('data.csv')
    .fillna(...)
    .query('some_condition')
    .assign(new_column=df.cut(...))
    .pivot_table(...)
    .rename(...)
)

Base R vs tidyverse

Example: filtering rows

# Base R
my_data <- my_data[my_data$col1 == 1, ]

# tidyverse
my_data <- my_data %>% filter(col1 == 1)

Example: combing columns together

# Base R
my_data$col3 <- my_data$col1 + my_data$col2

# tidyverse
my_data <- my_data %>% mutate(col3 = col1 + col2)

Base R vs tidyverse

Example: grouping and summarizing

Say we have a random dataset:

# Generate a random my_data
my_data <- data.frame(col1 = sample(1:3, 100, replace = TRUE), col2 = rnorm(100))

If we want to calculate the mean of col2 for each value of col1:

# The Base R way
my_data <- aggregate(my_data, by = list(my_data$col1), FUN = mean)

# Overtime, you will see that the tidyverse way becomes more intuitive
my_data <- my_data %>% group_by(col1) %>% summarize(mean(col2))

Coming Up

Next Week’s Lab: Prepare for hands-on exercises in tidyverse.
If you are a former DS105 student: Explore ME204 for code and exercises in the DS105 style, but in tidyverse instead of pandas.

References

Roser, Max, Hannah Ritchie, and Edouard Mathieu. 2023. “Technological Change.” Our World in Data.

Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.

🗓️ Week 01 Welcome to the course

Who we are

Your lecturer

Teaching Assistants

Support Sessions

Administrative Support

The Data Science Institute

CIVICA Seminar Series

Careers in Data Science

Industry “field trips”

Who are you?

Who are you? (cont.)

What is this course about?

Course Brief

Course Brief

🎯 Learning Objectives

📚 Course Structure

🧑🏻‍💻 Labs (90 min each week)

More about 🧑🏻‍💻 Labs (90 min each week)

👩🏻‍🏫 Lectures (2 hours per week)

Programming

Software and Tools

Pre-requisites and assumptions

Pre-requisites and assumptions

Teaching Philosophy

What does that mean in practice?

AI tools in this course

LSE Policy on AI tools

Our policy in this course

The GENIAL project

The GENIAL project

The GENIAL project

☕️ Time for a break

What do we mean by data science?

Data science is…

The academic possibilities

Data Science and Social Science

The Data Science Workflow

The Data Science Workflow

The Data Science Workflow

Let’s get more technical

Python vs R

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Coming Up

References

🗓️ Week 01
Welcome to the course