LSE DS202 – Data Science for Social Scientists
24 Jan 2025
decision support systems
machine learning applications
databases
provenance
ethical AI/XAI
DS202W Weekly Drop-in sessions:
Write an e-mail to Kevin:
Sign up for DSI events at lse.ac.uk/DSI/Events
Follow the seminar series: 🔗 Link
Hear from alumni or industry experts about their career paths and how they got to where they are today.
Latest events:
🗓️ Data Science across industries (03 December 2024- 4.00 to 5.30pm)
Machine learning is transforming large parts of the economy, and data scientists have the opportunity of to apply their skills in an incredibly broad variety of domains. The technical field is in rapid progress and professional roles in continuous development as companies navigate successive waves of technological and economic change. Data scientists must therefore craft skill paths which balance focus on rapid learning with capabilities complementing their domain, organisations and wider industry.
Drawing on his experience from startups, consulting and tech, Christian Svalesen, Senior Machine Learning Engineer at SoundCloud will provide insights into what data science roles and projects can involve across industries. He will share advice on how students can prepare and develop through their professional journey.
Read more about this series of events: 🔗 Link
Hear from alumni or industry experts about their career paths and how they got to where they are today.
Latest events:
🗓️ Navigating Data Science from Academia to Media, and Beyond (23 October 2024 - 4.30 to 6pm)
With the rise in adoption of AI/ML technologies and the increasing demand for data-driven decision-making, data science has become a vital component across many industries, including media. As data science transforms the media landscape - enhancing content personalisation, optimising conversion strategies, and improving audience engagement, it is also becoming an increasingly popular tool for addressing complex business challenges. Navigating a role in this field can be both exciting and challenging.
Tabtim Duenger, Senior Data Scientist at The Economist and Riya Chhikara, Data Scientist at the Economist, both LSE graduates1, will offer insights into their paths to entering the field of data science. They will discuss their experiences in landing their first roles, negotiating their functions and responsibilities within the media sector, and how they use these experiences and networks to continue guiding their careers.
Read more about this series of events: 🔗 Link
‘Winners’ of the upcoming Bank of England trip will be announced soon!
Sign up for DSI events at lse.ac.uk/DSI/Events
Programme | Freq |
---|---|
General Course | 14 |
BSc in Economics | 11 |
BSc in Politics and Data Science | 6 |
BSc in Psychological and Behavioural Science | 3 |
BSc in Philosophy, Politics and Economics | 2 |
BSc in Economics and Economic History | 1 |
BSc in International Social and Public Policy and Economics | 1 |
BSc in Politics and Economics | 1 |
BSc in Politics and International Relations | 1 |
Year | Count |
---|---|
1 | 17 |
2 | 10 |
3 | 11 |
4 | 2 |
What is this course about?
Focus: learn and understand the most fundamental machine learning algorithms
How: practical use of machine learning techniques and its metrics, applied to relevant data sets
What is this course about?
How will this course be taught?
How do I prepare for this course?
Important
There might be some preparatory work to do before each lab!
Always check Moodle/the webpage at least a day before coming to the lab.
Each week, you will have a roadmap of what to do.
The roadmap will typically contain the following elements:
Type of activity | Description |
---|---|
🧑🏻🏫 TEACHING MOMENT | Your class teacher deserves your full attention |
🎯 ACTION POINTS | Time to follow the steps in the roadmap. Try it for a bit, but if you get stuck, call your class teacher. |
👥 IN PAIRS/GROUPS | You will benefit from completing that task with your peers more than doing it alone |
🗣️ CLASSROOM DISCUSSION | Your class teacher will facilitate a discussion about the task |
📝 SUBMISSION | Submit your work |
👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.
If you are reading this but you are not an LSE student, the same content is available on the course’s 🌐 public website
We assume that you have some basic knowledge of:
We assume that you have some basic knowledge of:
- If you took ST102, you should be fine.
- Nothing crazy, mostly matrix operations (simpler than MA107)
- It’s ok if you are new to Python, but do reserve some extra hours in the first weeks to practice the basics.
Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”
pandas
or scikit-learn
or numpy
or Python documentation instead of explaining it directly.Do you use ChatGPT, GitHub Copilot, or other AI tools?
There are three official positions at LSE:
Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)
Position 2: Limited authorised use of generative AI in assessment.
Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course
Source: School position on generative AI, LSE Website, September 2024
Examples:
“I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.”
“I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the
pd.merge
command I learned in Week 08, so I went back and edited it.”
What do you think of generative AI tools?
Participating Courses:
You can read more about the GENIAL project on the project page.
What we have learned so far:
We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:
scrapy
, the code must contain functions – no classes – and I want to save the data in a CSV file.”) and would always check the code/output generated by GenAI against the course materials or reputable sources. They were able to identify when the AI was suggesting something that was not correct or not following best practices and would never blindly accept the AI’s suggestions.What we have learned so far:
We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:
Read more about it in our preprint:
Dorottya Sallai, Jonathan Cardoso-Silva, Marcos E. Barreto, Francesca Panero,Ghita Berrada, and Sara Luxmoore. “Approach Generative AI Tools Proactively or Risk Bypassing the Learning Process in Higher Education”, LSE Public Policy Review, 3(3), p. 7, 2024.
Our first proper lecture will start in a few minutes.
“What really is data science? + Python tips”
“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.
Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),
and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”
New data to answer old questions:
New questions enabled by new data/new technologies:
We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.
You might ask:
“How is data science any different from what I have learned in other stats courses?”
👉 Traditional Statistics in the social sciences: the goal is typically explanation
👉 Data science: the focus is frequently put more on data exploration and prediction
It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.
This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.
Next week, we will discuss together what it means for a machine to learn something.
But first, a word about programming skills 👉
A few stats
Data types
var = "value" # A string. Single quotes are OK too
var = """I want to write
sentence without caring for line breaks"""
# Python also has an additional option (triple double quotes!!!) to simplify the handling of strings that contain newlines
var = 2.2 # A float
var = 2 # An int (🏅)
var = float(2) # A float
In Python, less is more! Always, be explicit when using the greedier data types…
Python lists
returns:
[1, 2, 3, 4]
You could use the append
method to add elements to a list:
returns:
[1, 2, 3, 4, 5]
You could also use the extend
method to add several elements in one go:
returns:
[1, 2, 3, 4, 5, 6, 7, 8]
Yet another way to elements to a list is as follows:
returns:
[5, 7, 9, 8, 0, 4]
Python lists (cont.)
Other types of data collections
returns
(1, 2, 3)
returns
2
What do you think is the difference here?
Tuples are immutable!
Is there a way to update tuples? Yes!
First method
Second method
Other types of data collections
Aside from lists and tuples, you also have dictionaries and other more complex data collection types (for these, see the documentation).
A Python dictionary is a collection of key-value pairs, where each key corresponds to its associated value. For example:
returns:
{'first_name': 'Jane', 'last_name': 'Doe', 'city': 'London'}
Some basic operations
you will get the type of your Python object.
The above returns:
<class 'dict'>
<class 'float'> #since var=float(2.0)
Something, we need to perform operations repeatedly
We have loops (for
or while
loops):
(Note that Python needs indentation and you absolutely can’t mix tabs and spaces!)
And you have list comprehensions (as well as dictionary comprehensions)
# dictionary that associates a number with its square
squares = {x: x**2 for x in range(10)}
print(squares)
returns
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
# code that produces a dictionary that only contains the pair numbers from the original dictionary
original_dict = {"a": 1, "b": 2, "c": 3, "d": 4}
filtered_dict = {k: v for k, v in original_dict.items() if v % 2 == 0}
print(filtered_dict)
{'b': 2, 'd': 4}
Custom functions definition
Custom functions definition
Let’s define functions based on the loops and list comprehension from before. We’ll do some code profiling!
import cProfile
def for_loop_example():
result = []
for i in range(100000):
result.append(i * 2)
def while_loop_example():
result = []
i = 0
while i < 100000:
result.append(i * 2)
i += 1
def list_comprehension_example():
result = [i * 2 for i in range(100000)]
# Profile each function
print("Profiling for loop:")
cProfile.run("for_loop_example()")
print("\nProfiling while loop:")
cProfile.run("while_loop_example()")
print("\nProfiling list comprehension:")
cProfile.run("list_comprehension_example()")
Results from the loops and list comprehension profiling
Profiling for loop:
100004 function calls in 0.022 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 0.021 0.021 <python-input-66>:3(for_loop_example)
1 0.001 0.001 0.022 0.022 <string>:1(<module>)
1 0.000 0.000 0.022 0.022 {built-in method builtins.exec}
100000 0.008 0.000 0.008 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Profiling while loop:
100004 function calls in 0.021 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.014 0.014 0.020 0.020 <python-input-66>:8(while_loop_example)
1 0.001 0.001 0.021 0.021 <string>:1(<module>)
1 0.000 0.000 0.021 0.021 {built-in method builtins.exec}
100000 0.006 0.000 0.006 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Profiling list comprehension:
4 function calls in 0.003 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 0.002 0.002 <python-input-66>:15(list_comprehension_example)
1 0.001 0.001 0.003 0.003 <string>:1(<module>)
1 0.000 0.000 0.003 0.003 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
pandas
and scikit-learn
(briefly)os
, collections
, math
, etc. - see the Python documentation for details)
The pandas
, numpy
and scikit-learn
libraries are not part of the standard Python libraries, but they are very popular and very actively maintained packages.
These packages contain most of the functionality needed to handle datasets, manipulate them (pandas
mainly), perform statistical operations on them and apply machine learning models on them
These are the libraries we will rely on most in this course.
Note to R users
Think of the pandas
as what tidyverse
is to R and to some extent of scikit-learn
, as what tidymodels
(and perhaps caret
) are to R.
pandas
Example: reading a csv file
pandas
pandas
, you write multiple functions in succession or use method chaining:Without method chaining
pandas
Example: filtering rows
Filtering when the values are integers
Filtering when the values are strings
Example: concatenating dataframes
Say we have two random datasets:
df1 = pd.DataFrame({
"Name": ["Alice", "Bob"],
"Age": [25, 30]
})
df2 = pd.DataFrame({
"Name": ["Charlie", "David"],
"Age": [35, 40]
})
If we want to concatenate both dataframes vertically (i.e name
and age
stay the columns) then:
which returns
pandas
.LSE DS202W (2024/25) – Week 01