🗓️ Week 01
Welcome to the course

LSE DS202 – Data Science for Social Scientists

29 Sep 2023

Who we are

Your lecturers (W01 - W05)

Photo of Dr Jon Cardoso Silva
Dr. Jon Cardoso-Silva
Assistant Professor of Data Science (Education)
LSE Data Science Institute
📧 E-mail
lecturer
course convenor

  • PhD in Computer Science (King’s College London)
  • Background: Engineering, Bio & Health Informatics
  • Former Lead Data Scientist

networks
optimisation
software engineering
data science workflow
machine learning applications

Your lecturers (W07 - W11)

Photo of Ghita Berrada
Dr. Ghita Berrada
Assist. Prof. (Education)
LSE Data Science Institute
📧 E-mail
lecturer
  • PhD in Computer Science (University of Twente, Netherlands)
  • Background: Engineering, Databases, Health Informatics, ML for cybersecurity
  • Formerly Research Associate at King’s College London and the University of Edinburgh (School of Informatics)

decision support systems
machine learning applications
databases
provenance
ethical AI/XAI

Teaching Assistants

Photo of Tabtim Duenger
Tabtim Duenger
Data Scientist
The Economist
MSc in Applied Social Data Science (LSE)
📧 E-mail
guest teacher
Photo of Andreas Stöffelbauer
Andreas Stöffelbauer
Data Scientist
Microsoft
MSc in Data Science (LSE)
📧 E-mail
guest teacher
Photo of Stuart Bramwell
Dr Stuart Bramwell
Postdoctoral researcher
New News Project
Royal Holloway
📧 E-mail
guest teacher
Photo of Xiaowei Gao
Xiaowei Gao
PhD Candidate
SpaceTimeLab
University College London (UCL)
📧 E-mail
support

Administrative Support

Photo of Kevin Kittoe
Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

  • if you cannot find the lecture recording on Moodle
  • when you need an extension for an assignment
    (👉 check LSE’s extension policy)
  • to request a class group change
    (you will be asked to provide a reason for this)
  • to inform us of any other issues that may affect your studies

The Data Science Institute

  • This course is offered by the LSE Data Science Institute (DSI).
  • DSI is the hub for LSE’s interdisciplinary collaboration in data science
  • ⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Example of past event:

🗓️ Data Science Careers Panel and Networking (31 January)

A panel of alumni followed by Q&A and a networking session.

Panel:

  • Chinonye Dianne Pat-Ekeji, Data Scientist at Tripledot Studios
  • Mark Perfect, Data Scientist and Senior Consultant at Deloitte
  • Micha Panagiotidi, Data Scientist at Updraft
  • Tabtim Duenger, Data Scientist at Greater London Authority
  • Pauline Ting, Data Scientist at Amazon Web Services

Industry “field trips”

Summer internships

Who are you?

Programme Count
BSc in Psychological and Behavioural Science 39
BSc in Economics 4
General Course 4
BSc in Philosophy and Economics 2
BSc in Politics and Data Science 2
BSc in Politics and International Relations 2
BA in Geography 1
BA in Social Anthropology 1
BSc in International Social and Public Policy 1
BSc in International Social and Public Policy with Politics 1
BSc in Sociology 1
Erasmus Reciprocal Programme of Study 1
MPhil/PhD in Psychological and Behavioural Science 1
Year Freq
1 8
2 4
3 48

Who are you? (cont.)

What is this course about?

Course Brief

What is this course about?

  • Focus: learn and understand the most fundamental machine learning algorithms

  • How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Course Brief

What is this course about?

  • Focus: learn and understand the most fundamental machine learning algorithms
-  No neural networks, no deep learning, no large-scale data
  • How: practical use of machine learning techniques and its metrics, applied to relevant data sets
- Some but not a lot of theory, math proofs and derivations
- Lots of coding, examples and exercises

🎯 Learning Objectives

  • Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
  • Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or data mining.
  • Know how to fit and apply supervised machine learning models for classification and prediction.
  • Know how to evaluate and compare fitted models, and improve model performance.
  • Use applied computer programming, including the hands-on use of programming through course exercises.
  • Apply the methods learned to real data through hands-on exercises.
  • Integrate the insights from data analytics into knowledge generation and decision-making.
  • Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
  • Learn how data science methods have been applied to a particular domain of study (applications).

📚 Course Structure

  • How will this course be taught?

  • How do I prepare for this course?

👨🏻‍🏫 Lectures (2 hours per week)

  • Some sessions will have slides, others will be live coding
  • Feel free to code along with me
  • Pair/group exercises and discussions to interpret results
  • Bring a laptop if you can! (💡 you can borrow one from the library)
  • Recorded sessions will be available on Moodle on the next day

🧑🏻‍💻 Labs (90 min each week)

  • Purpose: reinforce concepts from the lecture
  • Typically:
    • you will be given time to work on something by yourself
    • there will be moments to share your interpretation with the classroom
  • You have to attend the lab you are enrolled in, you can’t switch on the day

Important

There might be some preparatory work to do before each lab!

Always check Moodle/the webpage at least a day before coming to the lab.

More about 🧑🏻‍💻 Labs (90 min each week)

Each week, you will have a roadmap of what to do.

The roadmap will typically contain the following elements:

Type of activity Description
🧑🏻‍🏫 TEACHING MOMENT Your class teacher deserves your full attention
🎯 ACTION POINTS Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher.
👥 IN PAIRS/GROUPS You will benefit from completing that task with your peers more than doing it alone
🗣️ CLASSROOM DISCUSSION Your class teacher will facilitate a discussion about the task
📝 SUBMISSION Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

Programming

  • Programming Language:
Logo of the programming language R
R
  • Integrated Development Environment (IDE) options:
Logo of the software RStudio
RStudio
Logo of the software Visual Studio Code
VS Code


  • You choose:
    • RStudio is the most popular IDE for R
    • VS Code is a more general IDE, good for many programming languages. It is more lightweight than RStudio, but it requires more configuration.

Software and Tools

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

  1. Descriptive Statistics
  2. Some linear algebra
  3. Programming
  • If you took ST102, you should be fine.
  • Nothing crazy, mostly matrix operations (simpler than MA107)
  • It’s ok if you are new to R, but do reserve some extra hours in the first weeks to practice the basics.

Teaching Philosophy


  • My teaching approach is grounded in empiricism.
  • I see learning as a transformative process, something that conduces to change, which is best facilitated by active, experience-focused, and exploration-driven activities.1
  • In summary: learning by doing serves as the cornerstone of this course.

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

What does that mean in practice?


Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”

  • Occasionally, I’ll present you with tasks before diving into the corresponding theory or background knowledge.
    • For example: asking you to consult the tidyverse documentation instead of explaining it directly.
  • Reasoning: letting your ‘struggles’ guide the learning process.
  • 👉 allow yourself to make silly mistakes and to ask ‘dumb questions’.
    • But if you feel this is not working, drop me an e-mail or come to my office hours (see 📟 Communication)

AI tools in this course

Do you use ChatGPT, GitHub Copilot, or other AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An image that shows a classroom where people have their pet AI bot on their desks, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical cute bot. Each student has their own.”

LSE Policy on AI tools

LSE takes challenges to academic integrity and to the value of its degrees with the utmost seriousness. The School has detailed regulations and processes for ensuring academic integrity in summative work.

Unless Departments provide otherwise in guidance on the authorised use of generative AI, its use in summative and formative assessment is prohibited. Departmental Teaching Committees are strongly encouraged to define what constitutes authorised use of Generative AI tools (if any) for students taking courses in their Department. Where they do so, they must clearly communicate this to colleagues, and to students.

Source: LSE (2023) (Emphasis added)

Our policy in this course

  • You can use AI tools during lectures, labs, and for your assignments.
    • Except when the lecturer or class teachers expressly ask you not to use it.
  • When using for assignments, you must acknowledge the use of AI tools and tell us how you used it.
    • Examples:

      I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.

      I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the pd.merge command I learned in Week 08, so I went back and edited it.

What do you think of generative AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “A university student typing on their laptop. The student has a pet AI bot on their desk, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical bot. Clean, flat design, photo. Friend or foe?”

The GENIAL project

  • More and more students are giving ChatGPT a try during lectures and labs.
  • So we are doing some research to try to figure out:
    • When do AI tools like ChatGPT and GitHub Copilot help you learn better?
    • Are they genuinely useful or just a distraction?

Participating Courses:

DS105

Data for Data Science

DS202

Data Science for Social Scientists

ST207

Databases

The GENIAL project

How will it work:

W01 lab

Normal lab

W02 lab

Normal lab

W03 lab

  • 1st Half: Normal lab
  • 2nd Half: Work independently
    (just you + ChatGPT)

W04 lab

Participants will be split into two groups:

  • Those who can use ChatGPT
  • Those who can’t use ChatGPT

W05 lab

Normal lab

W07 lab

Participants will be split into two groups at random:

  • Those who CAN vs CANNOT use GitHub Copilot/ChatGPT

W08 lab

Normal lab

W09 lab

Participants will be split into two groups at random:

  • Those who CAN vs CANNOT use GitHub Copilot/ChatGPT

W10 lab

Normal lab

W11 lab

Normal lab

The GENIAL project

  • Participation is voluntary but you MUST opt-in to participate in the study 👉
  • You can opt-out at any time.

☕️ Time for a break

Image created with DALL·E via Bing Chat AI bot. Prompt: “robots enjoying a coffee break. Circular tables, white room, pops of color, modern, cosy, clean flat design.”

Our first proper lecture will start in a few minutes.

What really is data science? + R tips

In the meantime, consider signing up for the GENIAL project:

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The academic possibilities

  • Humans and machines nowadays generate A LOT of data ALL THE TIME
  • It has become cheap to collect and store this data
  • This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

How do rumours spread?

New questions enabled by new data:

Is social media a threat to democracy?

We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.

You might ask:

“How is data science any different from what I have learned in other stats courses?”

Data Science and Social Science

👉 Traditional Statistics in the social sciences: the goal is typically explanation

👉 Data science: the focus is frequently put more on data exploration and prediction

  • Data science is heavily influenced by computer science and engineering
  • There is a strong emphasis on computational efficiency and scalability (due to big data)
  • Many of the algorithms and methods you will learn in this course can be used in both contexts (explanation vs prediction)
    • We will try to highlight the differences in these approaches throughout the course

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.



Next week, we will discuss together what it means for a machine to learn something.


But first, a word about programming skills 👉

Let’s get more technical

  • Python vs R
  • base R vs tidyverse

Python vs R

Logo of the programming language python
Python
  • Python is a general-purpose programming language
  • It is used for web development, scientific computing, data science, advanced machine learning tools (deep learning), etc.
Logo of the programming language R
R
  • R is more niche. It is a programming language created for statistical computing
  • You can do many other things with R, but it is mostly used for statistics and general data science (except for heavy Machine Learning)

R for Python users

Data types

  • In R, you assign a variable using the operator <- :
var <- 2
  • Some basic data types:
var <- "value" # A string. Single quotes are OK too

var <- 2.2     # A double (aka numeric)
var <- 2       # Also a double! 😱

# Want an integer? You have to be explicit:
var <- as.integer(2)

  • Whereas in Python, assignments are done with = :
var = 2
  • The python equivalent:
var = "value" # A string. Single quotes are OK too

var = 2.2     # A float
var = 2       # An int (🏅)

R for Python users

R Vectors

  • R is a vectorized language. Everything is a vector!
  • Amongst other things, this means we can call length() on any variable:

In the example below, var is a vector of length 1.

# This is a vector!
var <- 2.2

length(var)

returns:

[1] 1

Use the c( ) function to concatenate vectors:

# A vector of length 3
var <- c(2.2, 3.3, 4.4)

length(var)

returns:

[1] 3

R for Python users

R Vectors (cont.)

  • ⚠️ R vectors can only have one data type!

This is straightforward:

# A vector of numbers
c(1, 2, 3, 4, 5)

# A vector of characters
c("a", "b", "c", "d", "e")

# A vector of booleans
c(TRUE, FALSE, TRUE, FALSE, TRUE)

But beware! The code below is also valid:

my_vec <- c(2, "3", as.integer(4))

It won’t throw an error, but once you inspect the type of the vector, you will see that typeof(my_vec) is a "character".

If you type:

my_vec

You will see:

[1] "2" "3" "4"

R for Python users

Lists (not the same as vectors)

  • If you need to keep elements of different data types, create a list instead.
my_list <- list(2, "3", as.integer(4))

Now, if you type:

my_list

You will see:

[[1]]
[1] 2

[[2]]
[1] "3"

[[3]]
[1] 4
  • We see a list with a length of 3
  • Each element of the list is shown after the double brackets, [[ ]]
  • The first element of the list ([[1]]) is a vector of size 1 ([1]) that contains the number 2, etc.

R for Python users

Lists are more flexible than vectors (they are also slower to process)

Vectors are always flat:

# Me trying to do something complicated
my_silly_vector <- c(1, c(2, c(3, 4)), 5)
my_silly_vector

yields a simple vector:

[1] 1 2 3 4 5

Lists preserve the structure:

# Let's try converting some vectors into a list
my_silly_list <- list(1, list(2, c(3, 4)), 5)
my_silly_list

This produces a list of length 3 (not 5) with a more complex, nested structure:

[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 2

[[2]][[2]]
[1] 3 4


[[3]]
[1] 5

R for Python users

Obs: Python does not have vectors, only lists

  • If you run:
elements = [2, "3", 4]
  • You will get a list of length 3, with elements of different data types:
type(elements)
list
len(elements)
3
elements
[2, '3', 4]

(preserved structure)

R for Python users

Loops are not that different

for (i in 1:10) {
  print(i)
}
while (i < 10) {
  print(i)
  i <- i + 1
}

(R needs the curly brackets)

for i in range(1, 11):
    print(i)
  
while i < 10:
    print(i)
    i += 1

(Python needs the indentation)

R for Python users

Custom functions definition, compared

my_function <- function(x) {
  return(x + 1)
}
my_function(2)

In R, the return keyword exists, but it is optional. Whatever is at the last line of the function will be returned.

my_function <- function(x) {
  x + 1
}

def my_function(x):
    return x + 1
my_function(2)

Base R vs tidyverse

  • R has a base set of functions that come with the installation of the language

  • The base functions are OK - they are just not awesome.

  • The tidyverse is not part of the base R installation, but it is a very popular package

  • It is actually a collection of several packages that make it easier to manipulate data (+ databases + plotting + modelling + etc.)

  • This is what we will use in this course. (We suffered tremendously teaching base R last year)

Note to Python users

Think of the tidyverse as what pandas is to Python

Base R vs tidyverse

Example: reading a csv file

# Base R
my_data <- read.csv("my_file.csv")

# tidyverse
my_data <- read_csv("my_file.csv")

Base R vs tidyverse

Example: selecting columns

# Base R
my_data <- my_data[, c("col1", "col2")]

# tidyverse
my_data <- select(my_data, col1, col2)

Base R vs tidyverse

The pipe operator

  • If there is one thing that beginners tend to find counterintuitive about tidyverse, it is the pipe operator %>%. But it is quite simple:
my_data <- read_csv("my_file.csv") %>% select(col1, col2)
  • This is equivalent to the common, nested way of writing:
my_data <- select(read_csv("my_file.csv"), col1, col2)
  • The pipe operator takes the output of the function on the left and passes it as the first argument of the function on the right.
  • When you see %>%, think of it as the word “then”.

Base R vs tidyverse

The pipe operator

  • This method chaining operator became so popular that even base R has a pipe operator now (|>)
    • In fact, you can interchangeably use %>% and |>:
# This also works
my_data <- read_csv("my_file.csv") |> select(col1, col2)
  • pandas (in Python) also supports method chaining:

Without method chaining

df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

With method chaining

df = (
    pd.read_csv('data.csv')
    .fillna(...)
    .query('some_condition')
    .assign(new_column=df.cut(...))
    .pivot_table(...)
    .rename(...)
)

Base R vs tidyverse

Example: filtering rows

# Base R
my_data <- my_data[my_data$col1 == 1, ]

# tidyverse
my_data <- my_data %>% filter(col1 == 1)

Example: combing columns together

# Base R
my_data$col3 <- my_data$col1 + my_data$col2

# tidyverse
my_data <- my_data %>% mutate(col3 = col1 + col2)

Base R vs tidyverse

Example: grouping and summarizing

Say we have a random dataset:

# Generate a random my_data
my_data <- data.frame(col1 = sample(1:3, 100, replace = TRUE), col2 = rnorm(100))

If we want to calculate the mean of col2 for each value of col1:

# The Base R way
my_data <- aggregate(my_data, by = list(my_data$col1), FUN = mean)

# Overtime, you will see that the tidyverse way becomes more intuitive
my_data <- my_data %>% group_by(col1) %>% summarize(mean(col2))

Coming Up

  • First Stop: Check out the 📋 Getting Ready page.
  • Next Week’s Lab: Prepare for hands-on exercises in tidyverse.
  • If you are a former DS105 student: Explore ME204 for code and exercises in the DS105 style, but in tidyverse instead of pandas.

References

LSE. 2023. LSE Short-Term Guidance for Teachers on Artificial Intelligence, Assessment and Academic Integrity in Preparation for the 2022-23 Assessment Period.” https://info.lse.ac.uk/staff/divisions/Eden-Centre/Assets-EC/Documents/AI-web-expansion-Feb-23/Updated-Guidance-for-staff-on-AI-A-AI-March-15-2023.Final.pdf.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.
Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.