🗓️ Week 01
Welcome to the course

LSE DS202 – Data Science for Social Scientists

29 Sep 2023

Who we are

Your lecturers (W01 - W05)

Dr. Jon Cardoso-Silva
Assistant Professor of Data Science (Education)
LSE Data Science Institute
📧 E-mail
lecturer
course convenor

PhD in Computer Science (King’s College London)
Background: Engineering, Bio & Health Informatics
Former Lead Data Scientist

networks
optimisation
software engineering
data science workflow
machine learning applications

Your lecturers (W07 - W11)

Dr. Ghita Berrada
Assist. Prof. (Education)
LSE Data Science Institute
📧 E-mail
lecturer

PhD in Computer Science (University of Twente, Netherlands)
Background: Engineering, Databases, Health Informatics, ML for cybersecurity
Formerly Research Associate at King’s College London and the University of Edinburgh (School of Informatics)

decision support systems
machine learning applications
databases
provenance
ethical AI/XAI

Teaching Assistants

Tabtim Duenger
Data Scientist
The Economist
MSc in Applied Social Data Science (LSE)
📧 E-mail
guest teacher

Andreas Stöffelbauer
Data Scientist
Microsoft
MSc in Data Science (LSE)
📧 E-mail
guest teacher

Dr Stuart Bramwell
Postdoctoral researcher
New News Project
Royal Holloway
📧 E-mail
guest teacher

Xiaowei Gao
PhD Candidate
SpaceTimeLab
University College London (UCL)
📧 E-mail
support

Administrative Support

Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

if you cannot find the lecture recording on Moodle
when you need an extension for an assignment
(👉 check LSE’s extension policy)
to request a class group change
(you will be asked to provide a reason for this)
to inform us of any other issues that may affect your studies

The Data Science Institute

This course is offered by the LSE Data Science Institute (DSI).
DSI is the hub for LSE’s interdisciplinary collaboration in data science
⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Example of past event:

🗓️ Data Science Careers Panel and Networking (31 January)

A panel of alumni followed by Q&A and a networking session.

Panel:

Chinonye Dianne Pat-Ekeji, Data Scientist at Tripledot Studios
Mark Perfect, Data Scientist and Senior Consultant at Deloitte
Micha Panagiotidi, Data Scientist at Updraft
Tabtim Duenger, Data Scientist at Greater London Authority
Pauline Ting, Data Scientist at Amazon Web Services

Industry “field trips”

Summer internships

Who are you?

Programme	Count
BSc in Psychological and Behavioural Science	39
BSc in Economics	4
General Course	4
BSc in Philosophy and Economics	2
BSc in Politics and Data Science	2
BSc in Politics and International Relations	2
BA in Geography	1
BA in Social Anthropology	1
BSc in International Social and Public Policy	1
BSc in International Social and Public Policy with Politics	1
BSc in Sociology	1
Erasmus Reciprocal Programme of Study	1
MPhil/PhD in Psychological and Behavioural Science	1

Year	Freq
1	8
2	4
3	48

Who are you? (cont.)

What is this course about?

Course Brief

What is this course about?

Focus: learn and understand the most fundamental machine learning algorithms
How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Course Brief

What is this course about?

Focus: learn and understand the most fundamental machine learning algorithms

-  No neural networks, no deep learning, no large-scale data

How: practical use of machine learning techniques and its metrics, applied to relevant data sets

- Some but not a lot of theory, math proofs and derivations
- Lots of coding, examples and exercises

🎯 Learning Objectives

Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or data mining.
Know how to fit and apply supervised machine learning models for classification and prediction.
Know how to evaluate and compare fitted models, and improve model performance.
Use applied computer programming, including the hands-on use of programming through course exercises.
Apply the methods learned to real data through hands-on exercises.
Integrate the insights from data analytics into knowledge generation and decision-making.
Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
Learn how data science methods have been applied to a particular domain of study (applications).

📚 Course Structure

How will this course be taught?
How do I prepare for this course?

👨🏻‍🏫 Lectures (2 hours per week)

Some sessions will have slides, others will be live coding
Feel free to code along with me
Pair/group exercises and discussions to interpret results
Bring a laptop if you can! (💡 you can borrow one from the library)
Recorded sessions will be available on Moodle on the next day

🧑🏻‍💻 Labs (90 min each week)

Purpose: reinforce concepts from the lecture
Typically:
- you will be given time to work on something by yourself
- there will be moments to share your interpretation with the classroom
You have to attend the lab you are enrolled in, you can’t switch on the day

Important

There might be some preparatory work to do before each lab!

Always check Moodle/the webpage at least a day before coming to the lab.

More about 🧑🏻‍💻 Labs (90 min each week)

Each week, you will have a roadmap of what to do.

The roadmap will typically contain the following elements:

Type of activity	Description
🧑🏻‍🏫 TEACHING MOMENT	Your class teacher deserves your full attention
🎯 ACTION POINTS	Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher.
👥 IN PAIRS/GROUPS	You will benefit from completing that task with your peers more than doing it alone
🗣️ CLASSROOM DISCUSSION	Your class teacher will facilitate a discussion about the task
📝 SUBMISSION	Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

Programming

Programming Language:

Integrated Development Environment (IDE) options:

RStudio

VS Code

You choose:
- RStudio is the most popular IDE for R
- VS Code is a more general IDE, good for many programming languages. It is more lightweight than RStudio, but it requires more configuration.

Software and Tools

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Descriptive Statistics
Some linear algebra
Programming

If you took ST102, you should be fine.

Nothing crazy, mostly matrix operations (simpler than MA107)

It’s ok if you are new to R, but do reserve some extra hours in the first weeks to practice the basics.

Teaching Philosophy

My teaching approach is grounded in empiricism.
I see learning as a transformative process, something that conduces to change, which is best facilitated by active, experience-focused, and exploration-driven activities.¹
In summary: learning by doing serves as the cornerstone of this course.

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

What does that mean in practice?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”

Occasionally, I’ll present you with tasks before diving into the corresponding theory or background knowledge.
- For example: asking you to consult the tidyverse documentation instead of explaining it directly.
Reasoning: letting your ‘struggles’ guide the learning process.
👉 allow yourself to make silly mistakes and to ask ‘dumb questions’.
- But if you feel this is not working, drop me an e-mail or come to my office hours (see 📟 Communication)

AI tools in this course

Do you use ChatGPT, GitHub Copilot, or other AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An image that shows a classroom where people have their pet AI bot on their desks, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical cute bot. Each student has their own.”

LSE Policy on AI tools

“LSE takes challenges to academic integrity and to the value of its degrees with the utmost seriousness. The School has detailed regulations and processes for ensuring academic integrity in summative work.

Unless Departments provide otherwise in guidance on the authorised use of generative AI, its use in summative and formative assessment is prohibited. Departmental Teaching Committees are strongly encouraged to define what constitutes authorised use of Generative AI tools (if any) for students taking courses in their Department. Where they do so, they must clearly communicate this to colleagues, and to students.”

Source: LSE (2023) (Emphasis added)

Our policy in this course

You can use AI tools during lectures, labs, and for your assignments.
- Except when the lecturer or class teachers expressly ask you not to use it.
When using for assignments, you must acknowledge the use of AI tools and tell us how you used it.
- Examples:
  
  “I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.”
  
  “I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the pd.merge command I learned in Week 08, so I went back and edited it.”

What do you think of generative AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “A university student typing on their laptop. The student has a pet AI bot on their desk, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical bot. Clean, flat design, photo. Friend or foe?”

The GENIAL project

More and more students are giving ChatGPT a try during lectures and labs.
So we are doing some research to try to figure out:
- When do AI tools like ChatGPT and GitHub Copilot help you learn better?
- Are they genuinely useful or just a distraction?

Participating Courses:

DS105

Data for Data Science

DS202

Data Science for Social Scientists

ST207

Databases

The GENIAL project

How will it work:

W01 lab

Normal lab

W02 lab

Normal lab

W03 lab

1st Half: Normal lab
2nd Half: Work independently
(just you + ChatGPT)

W04 lab

Participants will be split into two groups:

Those who can use ChatGPT
Those who can’t use ChatGPT

W05 lab

Normal lab

W07 lab

Participants will be split into two groups at random:

Those who CAN vs CANNOT use GitHub Copilot/ChatGPT

W08 lab

Normal lab

W09 lab

Participants will be split into two groups at random:

Those who CAN vs CANNOT use GitHub Copilot/ChatGPT

W10 lab

Normal lab

W11 lab

Normal lab

The GENIAL project

Participation is voluntary but you MUST opt-in to participate in the study 👉
You can opt-out at any time.

☕️ Time for a break

Image created with DALL·E via Bing Chat AI bot. Prompt: “robots enjoying a coffee break. Circular tables, white room, pops of color, modern, cosy, clean flat design.”

Our first proper lecture will start in a few minutes.

“What really is data science? + R tips”

In the meantime, consider signing up for the GENIAL project:

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The academic possibilities

Humans and machines nowadays generate A LOT of data ALL THE TIME

It has become cheap to collect and store this data

This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

How do rumours spread?

New questions enabled by new data:

Is social media a threat to democracy?

We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.

You might ask:

“How is data science any different from what I have learned in other stats courses?”

The Data Science Workflow

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.

Next week, we will discuss together what it means for a machine to learn something.

But first, a word about programming skills 👉

Let’s get more technical

Python vs R
base R vs tidyverse

Python vs R

Python

Python is a general-purpose programming language
It is used for web development, scientific computing, data science, advanced machine learning tools (deep learning), etc.

R is more niche. It is a programming language created for statistical computing
You can do many other things with R, but it is mostly used for statistics and general data science (except for heavy Machine Learning)

R for Python users

Data types

In R, you assign a variable using the operator <- :

var <- 2

Some basic data types:

var <- "value" # A string. Single quotes are OK too

var <- 2.2     # A double (aka numeric)
var <- 2       # Also a double! 😱

# Want an integer? You have to be explicit:
var <- as.integer(2)

Whereas in Python, assignments are done with = :

var = 2

The python equivalent:

var = "value" # A string. Single quotes are OK too

var = 2.2     # A float
var = 2       # An int (🏅)

R for Python users

R Vectors

R is a vectorized language. Everything is a vector!
Amongst other things, this means we can call length() on any variable:

In the example below, var is a vector of length 1.

# This is a vector!
var <- 2.2

length(var)

returns:

[1] 1

Use the c( ) function to concatenate vectors:

# A vector of length 3
var <- c(2.2, 3.3, 4.4)

length(var)

returns:

[1] 3

R for Python users

R Vectors (cont.)

⚠️ R vectors can only have one data type!

This is straightforward:

# A vector of numbers
c(1, 2, 3, 4, 5)

# A vector of characters
c("a", "b", "c", "d", "e")

# A vector of booleans
c(TRUE, FALSE, TRUE, FALSE, TRUE)

But beware! The code below is also valid:

my_vec <- c(2, "3", as.integer(4))

It won’t throw an error, but once you inspect the type of the vector, you will see that typeof(my_vec) is a "character".

If you type:

my_vec

You will see:

[1] "2" "3" "4"

R for Python users

Lists (not the same as vectors)

If you need to keep elements of different data types, create a list instead.

my_list <- list(2, "3", as.integer(4))

Now, if you type:

my_list

You will see:

[[1]]
[1] 2

[[2]]
[1] "3"

[[3]]
[1] 4

We see a list with a length of 3
Each element of the list is shown after the double brackets, [[ ]]
The first element of the list ([[1]]) is a vector of size 1 ([1]) that contains the number 2, etc.

R for Python users

Lists are more flexible than vectors (they are also slower to process)

Vectors are always flat:

# Me trying to do something complicated
my_silly_vector <- c(1, c(2, c(3, 4)), 5)
my_silly_vector

yields a simple vector:

[1] 1 2 3 4 5

Lists preserve the structure:

# Let's try converting some vectors into a list
my_silly_list <- list(1, list(2, c(3, 4)), 5)
my_silly_list

This produces a list of length 3 (not 5) with a more complex, nested structure:

[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 2

[[2]][[2]]
[1] 3 4


[[3]]
[1] 5

R for Python users

Obs: Python does not have vectors, only lists

If you run:

elements = [2, "3", 4]

You will get a list of length 3, with elements of different data types:

type(elements)

list

len(elements)

elements

[2, '3', 4]

(preserved structure)

R for Python users

Loops are not that different

for (i in 1:10) {
  print(i)
}

while (i < 10) {
  print(i)
  i <- i + 1
}

(R needs the curly brackets)

for i in range(1, 11):
    print(i)

while i < 10:
    print(i)
    i += 1

(Python needs the indentation)

R for Python users

Custom functions definition, compared

my_function <- function(x) {
  return(x + 1)
}

my_function(2)

In R, the return keyword exists, but it is optional. Whatever is at the last line of the function will be returned.

my_function <- function(x) {
  x + 1
}

def my_function(x):
    return x + 1

my_function(2)

Base R vs tidyverse

R has a base set of functions that come with the installation of the language
The base functions are OK - they are just not awesome.

The tidyverse is not part of the base R installation, but it is a very popular package
It is actually a collection of several packages that make it easier to manipulate data (+ databases + plotting + modelling + etc.)
This is what we will use in this course. (We suffered tremendously teaching base R last year)

Note to Python users

Think of the tidyverse as what pandas is to Python

Base R vs tidyverse

Example: reading a csv file

# Base R
my_data <- read.csv("my_file.csv")

# tidyverse
my_data <- read_csv("my_file.csv")

Base R vs tidyverse

Example: selecting columns

# Base R
my_data <- my_data[, c("col1", "col2")]

# tidyverse
my_data <- select(my_data, col1, col2)

Base R vs tidyverse

The pipe operator

If there is one thing that beginners tend to find counterintuitive about tidyverse, it is the pipe operator %>%. But it is quite simple:

my_data <- read_csv("my_file.csv") %>% select(col1, col2)

This is equivalent to the common, nested way of writing:

my_data <- select(read_csv("my_file.csv"), col1, col2)

The pipe operator takes the output of the function on the left and passes it as the first argument of the function on the right.
When you see %>%, think of it as the word “then”.

Base R vs tidyverse

The pipe operator

This method chaining operator became so popular that even base R has a pipe operator now (|>)
- In fact, you can interchangeably use %>% and |>:

# This also works
my_data <- read_csv("my_file.csv") |> select(col1, col2)

pandas (in Python) also supports method chaining:

Without method chaining

df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

With method chaining

df = (
    pd.read_csv('data.csv')
    .fillna(...)
    .query('some_condition')
    .assign(new_column=df.cut(...))
    .pivot_table(...)
    .rename(...)
)

Base R vs tidyverse

Example: filtering rows

# Base R
my_data <- my_data[my_data$col1 == 1, ]

# tidyverse
my_data <- my_data %>% filter(col1 == 1)

Example: combing columns together

# Base R
my_data$col3 <- my_data$col1 + my_data$col2

# tidyverse
my_data <- my_data %>% mutate(col3 = col1 + col2)

Base R vs tidyverse

Example: grouping and summarizing

Say we have a random dataset:

# Generate a random my_data
my_data <- data.frame(col1 = sample(1:3, 100, replace = TRUE), col2 = rnorm(100))

If we want to calculate the mean of col2 for each value of col1:

# The Base R way
my_data <- aggregate(my_data, by = list(my_data$col1), FUN = mean)

# Overtime, you will see that the tidyverse way becomes more intuitive
my_data <- my_data %>% group_by(col1) %>% summarize(mean(col2))

Coming Up

First Stop: Check out the 📋 Getting Ready page.
Next Week’s Lab: Prepare for hands-on exercises in tidyverse.
If you are a former DS105 student: Explore ME204 for code and exercises in the DS105 style, but in tidyverse instead of pandas.

References

LSE. 2023. “LSE Short-Term Guidance for Teachers on Artificial Intelligence, Assessment and Academic Integrity in Preparation for the 2022-23 Assessment Period.” https://info.lse.ac.uk/staff/divisions/Eden-Centre/Assets-EC/Documents/AI-web-expansion-Feb-23/Updated-Guidance-for-staff-on-AI-A-AI-March-15-2023.Final.pdf.

Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.

🗓️ Week 01 Welcome to the course

Who we are

Your lecturers (W01 - W05)

Your lecturers (W07 - W11)

Teaching Assistants

Administrative Support

The Data Science Institute

CIVICA Seminar Series

Careers in Data Science

Industry “field trips”

Summer internships

Who are you?

Who are you? (cont.)

What is this course about?

Course Brief

Course Brief

🎯 Learning Objectives

📚 Course Structure

👨🏻‍🏫 Lectures (2 hours per week)

🧑🏻‍💻 Labs (90 min each week)

More about 🧑🏻‍💻 Labs (90 min each week)

Programming

Software and Tools

Pre-requisites and assumptions

Pre-requisites and assumptions

Teaching Philosophy

What does that mean in practice?

AI tools in this course

LSE Policy on AI tools

Our policy in this course

The GENIAL project

DS105

DS202

ST207

The GENIAL project

The GENIAL project

☕️ Time for a break

What do we mean by data science?

Data science is…

The academic possibilities

Data Science and Social Science

The Data Science Workflow

The Data Science Workflow

The Data Science Workflow

Let’s get more technical

Python vs R

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

R for Python users

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Base R vs tidyverse

Coming Up

References

🗓️ Week 01
Welcome to the course