🗓️ Week 01
Welcome to the course

LSE DS202 – Data Science for Social Scientists

04 Oct 2024

Who we are

Your lecturer

Photo of Ghita Berrada
Dr. Ghita Berrada
Assist. Prof. (Education)
LSE Data Science Institute
📧 E-mail
lecturer
course convenor
  • PhD in Computer Science (University of Twente, Netherlands)
  • Background: Engineering, Databases, Health Informatics, ML for cybersecurity
  • Formerly Research Associate at King’s College London and the University of Edinburgh (School of Informatics)

decision support systems
machine learning applications
databases
provenance
ethical AI/XAI

Teaching Assistants

Photo of Tabtim Duenger
Tabtim Duenger
Data Scientist
The Economist
MSc in Applied Social Data Science (LSE)
📧 E-mail
guest teacher
Photo of Andreas Stöffelbauer
Andreas Stöffelbauer
Data Scientist
Microsoft
MSc in Data Science (LSE)
📧 E-mail
guest teacher
Photo of Stuart Bramwell
Dr Stuart Bramwell
Guest Lecturer
Data Science Institute
DPhil in Politics (Oxford University)
📧 E-mail
guest teacher
Photo of Yassine Lahna
Yassine Lahna
Data Scientist
MSc in Statistical Science (Oxford University)
📧 E-mail
guest teacher

Support Sessions

Photo of Sara Luxmoore
Sara Luxmoore
Research Officer
LSE Data Science Institute and LSE Cities
📧 E-mail
  • 🦸🏻‍♀️ Runs weekly drop-in sessions for all DSI courses!

DS202A Weekly Drop-in sessions:

  • Typically every Tuesday from 12.15 to 13.30 at the COL.1.06 (Visualisation studio) but check announcements and calendar invites for updates.

Administrative Support

Photo of Kevin Kittoe
Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

  • if you cannot find the lecture recording on Moodle
  • when you need an extension for an assignment
    (👉 check LSE’s extension policy)
  • to request a class group change
    (you will be asked to provide a reason for this)
  • to inform us of any other issues that may affect your studies

The Data Science Institute

  • This course is offered by the LSE Data Science Institute (DSI).
  • DSI is the hub for LSE’s interdisciplinary collaboration in data science
  • ⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Latest event:

🗓️ Keeping London Moving with Data (28 February 4 - 5.30pm)

A talk about life in the data world at TfL. Jemima, Graduate Data Scientist at Transport for London (TfL) will talk about her experience as a Data Science Graduate in our inaugural programme. Lauren Sager Weinstein, Chief Data Officer, at Transport for London (TfL) will talk about how she’s leading TfL’s data strategy, and how all the components of data careers (data scientists, data developers, data product managers, and data users) can come together to deliver on our data vision: To empower our people to make better decisions with data.

Industry “field trips”

Visit at Lloyds (2023)

Who are you?

Programme Freq
BSc in Psychological and Behavioural Science 44
BSc in Politics and Data Science 4
BSc in Economics 3
General Course 3
BSc in International Social and Public Policy 2
BSc in Philosophy, Politics and Economics 2
Exchange Programme for Students from Stockholm School of Economics 2
BSc in International Relations 1
BSc in International Social and Public Policy with Politics 1
BSc in Sociology 1
Exchange Programme for Students from Central European University 1
Exchange Programme for Students from SGH Warsaw School of Economics 1
Year Count
1 9
2 6
3 49
4 1

Who are you? (cont.)

What is this course about?

Course Brief

What is this course about?

  • Focus: learn and understand the most fundamental machine learning algorithms

  • How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Course Brief

What is this course about?

  • Focus: learn and understand the most fundamental machine learning algorithms
  • No neural networks, no deep learning, no large-scale data
  • How: practical use of machine learning techniques and its metrics, applied to relevant data sets
  • Some but not a lot of theory, math proofs and derivations
  • Lots of coding, examples and exercises

🎯 Learning Objectives

  • Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
  • Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or data mining.
  • Know how to fit and apply supervised machine learning models for classification and prediction.
  • Know how to evaluate and compare fitted models, and improve model performance.
  • Use applied computer programming, including the hands-on use of programming through course exercises.
  • Apply the methods learned to real data through hands-on exercises.
  • Integrate the insights from data analytics into knowledge generation and decision-making.
  • Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
  • Learn how data science methods have been applied to a particular domain of study (applications).

📚 Course Structure

  • How will this course be taught?

  • How do I prepare for this course?

🧑🏻‍💻 Labs (90 min each week)

  • Purpose: introduce new concepts and tools which will only be explored in more detail in the lectures
    • Why? So you can come to the lectures with good questions!
  • Typically:
    • your class teacher might give you some context about the new tools/algorithms
    • you will be given time to work on something by yourself
    • there will be moments to share your interpretation of the results of algorithms with the classroom
  • You have to attend the lab you are enrolled in. You can’t switch on the day

Important

There might be some preparatory work to do before each lab!

Always check Moodle/the webpage at least a day before coming to the lab.

More about 🧑🏻‍💻 Labs (90 min each week)

Each week, you will have a roadmap of what to do.

The roadmap will typically contain the following elements:

Type of activity Description
🧑🏻‍🏫 TEACHING MOMENT Your class teacher deserves your full attention
🎯 ACTION POINTS Time to follow the steps in the roadmap.
Try it for a bit, but if you get stuck, call your class teacher.
👥 IN PAIRS/GROUPS You will benefit from completing that task with your peers more than doing it alone
🗣️ CLASSROOM DISCUSSION Your class teacher will facilitate a discussion about the task
📝 SUBMISSION Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

👩🏻‍🏫 Lectures (2 hours per week)

  • The first sessions will have slides, but mostly, it will be live coding
  • Feel free to code along with your lecturer
  • Pair/group exercises and discussions to interpret results
  • Bring a laptop if you can! (💡 you can borrow one from the library)
  • Recorded sessions will be available on Moodle on the next working day

Programming

  • Programming Language:
Logo of the programming language R
R
  • Integrated Development Environment (IDE) options:
Logo of the software RStudio
RStudio
Logo of the software Visual Studio Code
VS Code


  • You choose:
    • RStudio is the most popular IDE for R
    • VS Code is a more general IDE, good for many programming languages. It is more lightweight than RStudio, but it requires more configuration.

Software and Tools

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

  1. Descriptive Statistics
  2. Some linear algebra
  3. Programming
  • If you took ST102, you should be fine.
  • Nothing crazy, mostly matrix operations (simpler than MA107)
  • It’s ok if you are new to R, but do reserve some extra hours in the first weeks to practice the basics.

Teaching Philosophy


  • My teaching approach is grounded in empiricism.
  • I see learning as a transformative process, something that conduces to change, which is best facilitated by active, experience-focused, and exploration-driven activities.1
  • In summary: learning by doing (or said, more bluntly😂, learning by trial and error) serves as the cornerstone of this course.

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

What does that mean in practice?


Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”

  • Frequently, we will present you with tasks that involve new concepts before diving into the corresponding theory or background knowledge.
    • For example, I might ask you to consult the tidyverse documentation instead of explaining it directly.
  • Reasoning: letting your ‘struggles’ guide the learning process.
  • 👉 allow yourself to make silly mistakes and to ask ‘dumb questions’.
    • You are very much encouraged to help and learn from each other.
  • If this course is too easy for you, try to apply its concepts to your own data sets or to more complex problems and bring us your questions.
  • If you feel this teaching style is not working, drop us an e-mail or discuss it during office hours (see 📟 Communication)

AI tools in this course

Do you use ChatGPT, GitHub Copilot, or other AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An image that shows a classroom where people have their pet AI bot on their desks, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical cute bot. Each student has their own.”

LSE Policy on AI tools

There are three official positions at LSE:

Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)

Position 2: Limited authorised use of generative AI in assessment.

Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course

Source: School position on generative AI, LSE Website, September 2024

Our policy in this course

  • You can use AI tools during lectures, labs, and for your assignments.
    • Except when the lecturer or class teachers expressly ask you not to use it.
  • When using for assignments, you must acknowledge the use of AI tools and tell us how you used it.
    • Examples:

      I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.

      I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the pd.merge command I learned in Week 08, so I went back and edited it.

What do you think of generative AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “A university student typing on their laptop. The student has a pet AI bot on their desk, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical bot. Clean, flat design, photo. Friend or foe?”

The GENIAL project

  • We see many students using ChatGPT during lectures, labs, and assessments.
  • Frankly, most university instructors are clueless as to whether this is helping or hindering your learning.
  • So we did some research to try to figure out:
    • How are students using generative AI tools in their studies?
    • What are the benefits and drawbacks of using generative AI tools?

Participating Courses:

  • DS105W (Data for Data Science)
  • DS202W (Data Science for Social Scientists)
  • ST456 (Deep Learning)
  • PP422 (Data Science for Public Policy)

The GENIAL project

You can read more about the GENIAL project on the project page.

What we have learned so far:

We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:

  • Good: The students who made the most resourceful use of GenAI remained in control of their learning. They often gave the chatbots a lot of context (“I want to perform web scraping of this website with the library scrapy, the code must contain functions – no classes – and I want to save the data in a CSV file.”) and would always check the code/output generated by GenAI against the course materials or reputable sources. They were able to identify when the AI was suggesting something that was not correct or not following best practices and would never blindly accept the AI’s suggestions.

The GENIAL project

What we have learned so far:

We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:

  • Bad: If you don’t master a subject, GenAI can make you feel like you do. This pattern was frequent, for example, among students who had gaps in their understanding of programming concepts. They would ask the AI to generate code for them, and the AI would produce code that seemed to work but that generated the incorrect response or was so complex, it was virtually impossible to edit.

Read more about it in our preprint:

Dorottya Sallai, Jonathan Cardoso-Silva, Marcos E. Barreto, Francesca Panero,Ghita Berrada, and Sara Luxmoore. “Approach Generative AI Tools Proactively or Risk Bypassing the Learning Process in Higher Education”, Preprint, July 2024.

☕️ Time for a break

Image created with DALL·E2. Prompt: “Cat drinking tea in a classroom, Renoir style.”

Our first proper lecture will start in a few minutes.

What really is data science? + R tips

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The academic possibilities

  • Humans and machines nowadays generate A LOT of data ALL THE TIME
  • It has become very cheap to collect and store this data

Source:Roser, Ritchie, and Mathieu (2023)
  • This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

  • How do rumours spread?
  • How can we predict unemployment rates accurately?

New questions enabled by new data/new technologies:

  • Is social media a threat to democracy/public order?
  • Is generative AI a threat to the job market?

We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.

You might ask:

“How is data science any different from what I have learned in other stats courses?”

Data Science and Social Science

👉 Traditional Statistics in the social sciences: the goal is typically explanation

👉 Data science: the focus is frequently put more on data exploration and prediction

  • Data science is heavily influenced by computer science and engineering
  • There is a strong emphasis on computational efficiency and scalability (due to big data)
  • Many of the algorithms and methods you will learn in this course can be used in both contexts (explanation vs prediction)
    • We will try to highlight the differences in these approaches throughout the course

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.



Next week, we will discuss together what it means for a machine to learn something.


But first, a word about programming skills 👉

Let’s get more technical

  • Python vs R
  • base R vs tidyverse

Python vs R

Logo of the programming language python
Python
  • Python is a general-purpose programming language
  • It is used for web development, scientific computing, data science, advanced machine learning tools (deep learning), etc.
Logo of the programming language R
R
  • R is more niche. It is a programming language created for statistical computing
  • You can do many other things with R, but it is mostly used for statistics and general data science (except for heavy Machine Learning)

R for Python users

Data types

  • In R, you assign a variable using the operator <- :
var <- 2
  • Some basic data types:
var <- "value" # A string. Single quotes are OK too

var <- 2.2     # A double (aka numeric)
var <- 2       # Also a double! 😱

# Want an integer? You have to be explicit:
var <- as.integer(2)

  • Whereas in Python, assignments are done with = :
var = 2
  • The python equivalent:
var = "value" # A string. Single quotes are OK too

var = 2.2     # A float
var = 2       # An int (🏅)

R for Python users

R Vectors

  • R is a vectorized language. Everything is a vector!
  • Amongst other things, this means we can call length() on any variable:

In the example below, var is a vector of length 1.

# This is a vector!
var <- 2.2

length(var)

returns:

[1] 1

Use the c( ) function to concatenate vectors:

# A vector of length 3
var <- c(2.2, 3.3, 4.4)

length(var)

returns:

[1] 3

R for Python users

R Vectors (cont.)

  • ⚠️ R vectors can only have one data type!

This is straightforward:

# A vector of numbers
c(1, 2, 3, 4, 5)

# A vector of characters
c("a", "b", "c", "d", "e")

# A vector of booleans
c(TRUE, FALSE, TRUE, FALSE, TRUE)

But beware! The code below is also valid:

my_vec <- c(2, "3", as.integer(4))

It won’t throw an error, but once you inspect the type of the vector, you will see that typeof(my_vec) is a "character".

If you type:

my_vec

You will see:

[1] "2" "3" "4"

R for Python users

Lists (not the same as vectors)

  • If you need to keep elements of different data types, create a list instead.
my_list <- list(2, "3", as.integer(4))

Now, if you type:

my_list

You will see:

[[1]]
[1] 2

[[2]]
[1] "3"

[[3]]
[1] 4
  • We see a list with a length of 3
  • Each element of the list is shown after the double brackets, [[ ]]
  • The first element of the list ([[1]]) is a vector of size 1 ([1]) that contains the number 2, etc.

R for Python users

Lists are more flexible than vectors (they are also slower to process)

Vectors are always flat:

# Me trying to do something complicated
my_silly_vector <- c(1, c(2, c(3, 4)), 5)
my_silly_vector

yields a simple vector:

[1] 1 2 3 4 5

Lists preserve the structure:

# Let's try converting some vectors into a list
my_silly_list <- list(1, list(2, c(3, 4)), 5)
my_silly_list

This produces a list of length 3 (not 5) with a more complex, nested structure:

[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 2

[[2]][[2]]
[1] 3 4


[[3]]
[1] 5

R for Python users

Obs: Python does not have vectors, only lists

  • If you run:
elements = [2, "3", 4]
  • You will get a list of length 3, with elements of different data types:
type(elements)
list
len(elements)
3
elements
[2, '3', 4]

(preserved structure)

R for Python users

Loops are not that different

for (i in 1:10) {
  print(i)
}
while (i < 10) {
  print(i)
  i <- i + 1
}

(R needs the curly brackets)

for i in range(1, 11):
    print(i)
  
while i < 10:
    print(i)
    i += 1

(Python needs the indentation)

R for Python users

Custom functions definition, compared

my_function <- function(x) {
  return(x + 1)
}
my_function(2)

In R, the return keyword exists, but it is optional. Whatever is at the last line of the function will be returned.

my_function <- function(x) {
  x + 1
}

def my_function(x):
    return x + 1
my_function(2)

Base R vs tidyverse

  • R has a base set of functions that come with the installation of the language

  • The base functions are OK - they are just not awesome.

  • The tidyverse is not part of the base R installation, but it is a very popular package

  • It is actually a collection of several packages that make it easier to manipulate data (+ databases + plotting + modelling + etc.)

  • This is what we will use in this course. (We suffered tremendously teaching base R two years ago)

Note to Python users

Think of the tidyverse as what pandas is to Python

Base R vs tidyverse

Example: reading a csv file

# Base R
my_data <- read.csv("my_file.csv")

# tidyverse
my_data <- read_csv("my_file.csv")

Base R vs tidyverse

Example: selecting columns

# Base R
my_data <- my_data[, c("col1", "col2")]

# tidyverse
my_data <- select(my_data, col1, col2)

Base R vs tidyverse

The pipe operator

  • If there is one thing that beginners tend to find counterintuitive about tidyverse, it is the pipe operator %>%. But it is quite simple:
my_data <- read_csv("my_file.csv") %>% select(col1, col2)
  • This is equivalent to the common, nested way of writing:
my_data <- select(read_csv("my_file.csv"), col1, col2)
  • The pipe operator takes the output of the function on the left and passes it as the first argument of the function on the right.
  • When you see %>%, think of it as the word “then”.

Base R vs tidyverse

The pipe operator

  • This method chaining operator became so popular that even base R has a pipe operator now (|>)
    • In fact, you can interchangeably use %>% and |>:
# This also works
my_data <- read_csv("my_file.csv") |> select(col1, col2)
  • pandas (in Python) also supports method chaining:

Without method chaining

df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

With method chaining

df = (
    pd.read_csv('data.csv')
    .fillna(...)
    .query('some_condition')
    .assign(new_column=df.cut(...))
    .pivot_table(...)
    .rename(...)
)

Base R vs tidyverse

Example: filtering rows

# Base R
my_data <- my_data[my_data$col1 == 1, ]

# tidyverse
my_data <- my_data %>% filter(col1 == 1)

Example: combing columns together

# Base R
my_data$col3 <- my_data$col1 + my_data$col2

# tidyverse
my_data <- my_data %>% mutate(col3 = col1 + col2)

Base R vs tidyverse

Example: grouping and summarizing

Say we have a random dataset:

# Generate a random my_data
my_data <- data.frame(col1 = sample(1:3, 100, replace = TRUE), col2 = rnorm(100))

If we want to calculate the mean of col2 for each value of col1:

# The Base R way
my_data <- aggregate(my_data, by = list(my_data$col1), FUN = mean)

# Overtime, you will see that the tidyverse way becomes more intuitive
my_data <- my_data %>% group_by(col1) %>% summarize(mean(col2))

Coming Up

  • Next Week’s Lab: Prepare for hands-on exercises in tidyverse.
  • If you are a former DS105 student: Explore ME204 for code and exercises in the DS105 style, but in tidyverse instead of pandas.

References

Roser, Max, Hannah Ritchie, and Edouard Mathieu. 2023. “Technological Change.” Our World in Data.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.
Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.