🗓️ Week 01
Structure of this course

DS202 Data Science for Social Scientists

9/30/22

Who are we

The Data Science Institute

The Data Science Institute

Activities of interest to you:

Our courses

DSI offer accessible introductions to Data Science:

DS101

Fundamentals of
Data Science

🎯 Focus:
theoretical concepts of data science

📂 How:
reflections through reading and writing

DS105

Data for
Data Scientists

🎯 Focus:
collection and handling of real data

📂 How:
hands-on coding exercises and a group project

DS202

Data Science for
Social Scientists

🎯 Focus:
fundamental machine learning algorithms

📂 How:
practical use of ML techniques and metrics

Your lecturer



Dr. Jonathan Cardoso-Silva

  • PhD in Computer Science
  • Background: Engineering, Bio & Health Informatics
  • Former Lead Data Scientist
  • Research:
    • Networks
    • Optimisation
    • Machine Learning applications
    • Data Science Workflow

Teaching Assistants

Photo of Stuart Bramwell
Dr. Stuart Bramwell
ESRC Postdoctoral Fellow
Department of Methodology
PhD in Politics (Oxford)

Photo of Yijun Wang
Yijun Wang
Guest Teacher at the DSI
PhD cand. in Health Informatics (KCL)
MSc in Data Science (KCL)

Photo of Mustafa Can Ozkan
Mustafa Can Ozkan
Guest Teacher at the DSI
PhD cand. in the Spacetime Lab (UCL)
MSc in Transport (Imperial/UCL)

Photo of Xiaowei Gao
Xiaowei Gao
Guest Teacher at the DSI
PhD cand. in the Spacetime Lab (UCL)
MSc in Data Science (KCL)

Photo of Anton Boichenko
Anton Boichenko
Guest Teacher at the DSI
Product Developer at Decoded
MSc in Applied Social Data Science (LSE)

Who are you

Programme Freq
BSc in Economics 34
BSc in Pyschological and Behavioural Science 32
General Course 11
BSc in Politics and Economics 4
LLB in Laws 3
BSc in International Relations 2
BSc in Philosophy and Economics 2
BSc in Philosophy, Politics and Economics 2
BSc in Economic History and Geography 1
BSc in Economics and Economic History 1
BSc in Geography with Economics 1
BSc in International Relations and History 1
BSc in Mathematics, Statistics and Business 1
BSc in Philosophy, Logic and Scientific Method 1
Erasmus Reciprocal Programme of Study 1
Exchange Programme for Students from University of California, Berkeley 1

Degree Programme vs Year of Study

Course Selection Options

BSc in Economics

Course Selection Options

BSc Psychological and Behavioural Science

Course Selection Options

BSc in Politics and Economics

Course Selection Options

BSc in International Relations

Learning Objectives

  • Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
  • Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or for data mining.
  • Know how to fit and apply supervised machine learning models for classification and prediction.

Learning Objectives (cont.)

  • Know how to evaluate and compare fitted models, and to improve model performance.
  • Use applied computer programming, including the hands-on use of programming through course exercises.
  • Apply the methods learned to real data through hands-on exercises.
  • Integrate the insights from data analytics into knowledge generation and decision-making;

Learning Objectives (cont.)

  • Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
  • Learn how data science methods have been applied to a particular domain of study (applications).

Philosophy of this course

  • It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
  • One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
  • It is important to accurately assess the performance of a method, to know how well or how badly it is working (simpler methods often perform as well as fancier ones!).

Philosophy of this course (cont.)

  • This is an exciting research area, having important applications in science, industry and policy.
  • Machine learning is a fundamental ingredient in the training of a modern data scientist.

What do you need to know to get the most of this course?

The basics of statistics

Basic concepts of Statistics you might want to recap:

  • Expected value, mean, median, variance, standard deviation
  • Probabilities and simple probability distributions
  • Types of data
    • discrete vs continuous
    • categorical vs numerical vs ordinal

Resources (Stats)

A few references that might be useful to read or skim through:

The basics of R programming

Basic concepts of programming in R to recap:

  • data structures (vectors, matrices, data frames)
  • how to manipulate data (filter, subset, select)
  • read/write data files (for example: CSV, JSON, TXT)
  • (optional but encouraged) some knowledge tidyverse can give you a productive boost

Resources (R)

‘What if I struggle with R’?

➡️ Our first lab (Week 02) is a recap of some basic R commands, plus some ggplot2.

  • If you are not confident with your R skills, I strongly encourage you invest in studying the basics in the next couple of weeks.
  • Contact LSE Digital Skills Lab to attend in-person workshops or self-paced online R courses.

Any questions?

Structure of this course

Syllabus

Intro
    Introduction, Context & Key Concepts Week 01
Supervised Learning
    Simple and Multiple Linear Regression
    Classifiers (Logistic Regression & Naive Bayes)
    Resampling methods
     Non-linear algorithms (SVM & tree-based models)
Week 02
Week 03
Week 04
Week 05
Unsupervised Learning
    Unsupervised Learning: Clustering
    Unsupervised Learning: PCA         
Week 07
Week 08
Applications
    Applications: Predictive Modelling on Tabular Data
    Applications: Text as Data & Topic Modelling
    Applications: Social Media Data
Week 09
Week 10
Week 11

Structure of lectures 👨🏻‍🏫

Our lectures will be split in two parts:

  • Part I (~ 50 min): Traditional exposition of theoretical content
  • break (~ 10 min): Grab coffee or relax 🧘
  • Part II (~ 50 min): Live demo
    • Typically, an exploratory analysis or application of an algorithm
    • Feel free to follow along in your own laptops.

Structure of classes 👩‍💻

  • Students will work on weekly, structured problem sets in the staff-led class sessions.
  • Tips to get the most of classes:
    • Bring your own laptops 💻 (most tablets are not suitable for programming)
    • Read the recommended reading prior to the class
    • Skim through the problem set before class

Class groups

Group 01

  • 📆 Mondays
  • ⌚ 09:00 — 10:30
  • 📍 PAN.1.03

Group 02

  • 📆 Mondays
  • ⌚ 10:30 — 12:00
  • 📍 PAN.1.03

Group 03

  • 📆 Mondays
  • ⌚ 13:00 — 14:30
  • 📍 MAR.1.09

Group 04

  • 📆 Fridays
  • ⌚ 16:00 — 17:30
  • 📍 NAB.1.04

Group 05

  • 📆 Mondays
  • ⌚ 09:00 — 10:30
  • 📍 32L.LG.11

Group 06

  • 📆 Mondays
  • ⌚ 10:30 — 12:00
  • 📍 32L.LG.11

Group 07

  • 📆 Fridays
  • ⌚ 09:30 — 11:00
  • 📍 CBG.2.06

Your background knowledge

  • Please, help our teaching team understand your needs as we prepare for the first labs next week.

  • Find the link to the survey on our Slack group or point your phone to the QR code below

Assessments 📔

The breakdown of assessment for this class will be as follows:

Assessments 📔


Problem sets (60%)

  • Summative problem sets released on Weeks 5, 8 & 11.
  • These will have a similar style to the formative problem sets, a mix of R tasks and your written interpretation of the analyses.
  • You will have 4-6 days to submit your solutions.
  • Each of the three summative problem sets is worth 20% of the final mark, and will be graded on a 100 point scale.

Assessments 📔


Take-home exam (40%)

  • An open-book take-home exam, taken during the January exams period.
  • Exam questions will be comparable in style to the problem sets.
  • The exam questions will be released on Moodle on 5 January 2023. (tentative)
  • The exam is due on 11 January at 4pm (tentative)
  • ⚠️ Update 11/10/2022: Last year, DS202 exam was performed entirely online due to COVID-19 mitigation procedures. We want to run it online via our own Moodle page again this academic term, we just need to understand LSE regulations about exams for this year. We will update you on this very soon (hopefully by the end of W04).

Office hours

  • It is probably a good idea to book office hours if:
    • you struggled with a technical or theoretical aspect of a problem set in the previous week,
    • you have queries about careers in data science,
    • you want guidance in how to apply data science to other things you are studying outside this course.
  • Come prepared. You only have 15 minutes.
  • Ask for help sooner rather than later.
  • Book slots via StudentHub up to 12 hours in advance.

Communication

  • Join our Slack group (more info here).
  • Use the public Slack channels to talk to share links, content (or memes) with your colleagues.
  • Our teaching team will dedicate some time during the week to answer questions or other interactions on Slack.
  • Reserve 📧 e-mail for formal requests: extensions, deferrals, etc.
    • No need to e-mail to inform you will skip a class, for example.

Any questions?

How did we get here?

This abundance of data is strongly associated with the dramatic changes in technology in the past few decades.

St.Peter’s Basilica at the Vatican in
📅 19 April 2005
when Ratzinger
was elected the 265th pope.

St.Peter’s Basilica at the Vatican in
📅 13 March 2013
when Pope Francis
was elected the 266th pope.

We changed how we consume music 🎧

We changed how we consume video 🎞️

Smartphones 📱 are a very recent thing

We spend a lot more time connected

… and our social media habits keep on changing

The possibilities

  • Humans and machines nowadays generate A LOT of data ALL THE TIME
  • It has become cheap to collect and store this data
  • This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

How do rumours spread?

New questions enabled by new data:

Is social media a threat to democracy?

What’s next

After our 10-min break ☕:

  • Given all this, what do we mean by data science?
  • A tale of unicorns
  • How do machines learn?
  • Different types of learning
  • What to expect of the rest of this course
  • The tools you will need

References

Fischer-Baum, Reuben. 2017. “What ‘Tech World’ Did You Grow up In?” Washington Post. https://www.washingtonpost.com/graphics/2017/entertainment/tech-generations/.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781139161879.
Ismay, Chester, and Albert Young-Sun Kim. 2020. Statistical Inference via Data Science: A ModernDive into R and the Tidyverse. Chapman & Hall/CRC the R Series. Boca Raton: CRC Press / Taylor & Francis Group. https://moderndive.com/.
Kolawole, Emi. 2013. “About Those 2005 and 2013 Photos of the Crowds in St. Peter’s Square.” Washington Post. http://wapo.st/WKKTMh.
Warne, Russell T. 2018. Statistics for the Social Sciences: A General Linear Model Approach. https://www.cambridge.org/highereducation/books/statistics-for-the-social-sciences/716FF25785A6154CC6822D067A959445.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly. https://r4ds.had.co.nz/.