LSE DS202 – Data Science for Social Scientists
04 Oct 2024
decision support systems
machine learning applications
databases
provenance
ethical AI/XAI
DS202A Weekly Drop-in sessions:
Write an e-mail to Kevin:
Sign up for DSI events at lse.ac.uk/DSI/Events
Follow the seminar series: 🔗 Link
Hear from alumni or industry experts about their career paths and how they got to where they are today.
Latest event:
🗓️ Keeping London Moving with Data (28 February 4 - 5.30pm)
A talk about life in the data world at TfL. Jemima, Graduate Data Scientist at Transport for London (TfL) will talk about her experience as a Data Science Graduate in our inaugural programme. Lauren Sager Weinstein, Chief Data Officer, at Transport for London (TfL) will talk about how she’s leading TfL’s data strategy, and how all the components of data careers (data scientists, data developers, data product managers, and data users) can come together to deliver on our data vision: To empower our people to make better decisions with data.
Read more about this series of events: 🔗 Link
‘Winners’ of the upcoming Bank of England trip will be announced soon!
Sign up for DSI events at lse.ac.uk/DSI/Events
Programme | Freq |
---|---|
BSc in Psychological and Behavioural Science | 44 |
BSc in Politics and Data Science | 4 |
BSc in Economics | 3 |
General Course | 3 |
BSc in International Social and Public Policy | 2 |
BSc in Philosophy, Politics and Economics | 2 |
Exchange Programme for Students from Stockholm School of Economics | 2 |
BSc in International Relations | 1 |
BSc in International Social and Public Policy with Politics | 1 |
BSc in Sociology | 1 |
Exchange Programme for Students from Central European University | 1 |
Exchange Programme for Students from SGH Warsaw School of Economics | 1 |
Year | Count |
---|---|
1 | 9 |
2 | 6 |
3 | 49 |
4 | 1 |
What is this course about?
Focus: learn and understand the most fundamental machine learning algorithms
How: practical use of machine learning techniques and its metrics, applied to relevant data sets
What is this course about?
How will this course be taught?
How do I prepare for this course?
Important
There might be some preparatory work to do before each lab!
Always check Moodle/the webpage at least a day before coming to the lab.
Each week, you will have a roadmap of what to do.
The roadmap will typically contain the following elements:
Type of activity | Description |
---|---|
🧑🏻🏫 TEACHING MOMENT | Your class teacher deserves your full attention |
🎯 ACTION POINTS | Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher. |
👥 IN PAIRS/GROUPS | You will benefit from completing that task with your peers more than doing it alone |
🗣️ CLASSROOM DISCUSSION | Your class teacher will facilitate a discussion about the task |
📝 SUBMISSION | Submit your work |
👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.
If you are reading this but you are not an LSE student, the same content is available on the course’s 🌐 public website
More on that later…
We assume that you have some basic knowledge of:
We assume that you have some basic knowledge of:
- If you took ST102, you should be fine.
- Nothing crazy, mostly matrix operations (simpler than MA107)
- It’s ok if you are new to R, but do reserve some extra hours in the first weeks to practice the basics.
Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”
tidyverse
documentation instead of explaining it directly.Do you use ChatGPT, GitHub Copilot, or other AI tools?
There are three official positions at LSE:
Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)
Position 2: Limited authorised use of generative AI in assessment.
Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course
Source: School position on generative AI, LSE Website, September 2024
Examples:
“I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.”
“I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the
pd.merge
command I learned in Week 08, so I went back and edited it.”
What do you think of generative AI tools?
Participating Courses:
You can read more about the GENIAL project on the project page.
What we have learned so far:
We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:
scrapy
, the code must contain functions – no classes – and I want to save the data in a CSV file.”) and would always check the code/output generated by GenAI against the course materials or reputable sources. They were able to identify when the AI was suggesting something that was not correct or not following best practices and would never blindly accept the AI’s suggestions.What we have learned so far:
We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:
Read more about it in our preprint:
Dorottya Sallai, Jonathan Cardoso-Silva, Marcos E. Barreto, Francesca Panero,Ghita Berrada, and Sara Luxmoore. “Approach Generative AI Tools Proactively or Risk Bypassing the Learning Process in Higher Education”, Preprint, July 2024.
Image created with DALL·E2. Prompt: “Cat drinking tea in a classroom, Renoir style.”
Our first proper lecture will start in a few minutes.
“What really is data science? + R tips”
“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.
Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),
and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”
New data to answer old questions:
New questions enabled by new data/new technologies:
We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.
You might ask:
“How is data science any different from what I have learned in other stats courses?”
👉 Traditional Statistics in the social sciences: the goal is typically explanation
👉 Data science: the focus is frequently put more on data exploration and prediction
It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.
This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.
Next week, we will discuss together what it means for a machine to learn something.
But first, a word about programming skills 👉
Data types
R Vectors
length()
on any variable:R Vectors (cont.)
This is straightforward:
Lists (not the same as vectors)
[[ ]]
[[1]]
) is a vector of size 1 ([1]
) that contains the number 2
, etc.Lists are more flexible than vectors (they are also slower to process)
Vectors are always flat:
yields a simple vector:
[1] 1 2 3 4 5
Obs: Python does not have vectors, only lists
Loops are not that different
Custom functions definition, compared
R has a base set of functions that come with the installation of the language
The base functions are OK - they are just not awesome.
The tidyverse
is not part of the base R installation, but it is a very popular package
It is actually a collection of several packages that make it easier to manipulate data (+ databases + plotting + modelling + etc.)
This is what we will use in this course. (We suffered tremendously teaching base R two years ago)
Note to Python users
Think of the tidyverse
as what pandas
is to Python
Example: reading a csv file
Example: selecting columns
The pipe operator
%>%
. But it is quite simple:%>%
, think of it as the word “then”.The pipe operator
|>
)
%>%
and |>
:Without method chaining
Example: filtering rows
Example: combing columns together
Example: grouping and summarizing
Say we have a random dataset:
# Generate a random my_data
my_data <- data.frame(col1 = sample(1:3, 100, replace = TRUE), col2 = rnorm(100))
If we want to calculate the mean of col2
for each value of col1
:
tidyverse
instead of pandas.LSE DS202A (2023/24) – Week 01