ME204 – Data Engineering for the Social World
14 July 2025
Dr Jon Cardoso-Silva
🌐 jonjoncardoso.github.io
📧 J.Cardoso-Silva@lse.ac.uk
Assistant Professor (Education)
Background:
Highlighted research:
GENIAL: Generative AI in Education
📢 I will talk more about our findings shortly when I talk about the AI policy for this course!
Learn more about the GENIAL project
Dr Stuart Bramwell
Guest Teacher
📧 s.bramwell@lse.ac.uk
The LSE Data Science Institute offers a range of undergraduate data science courses. These courses are aimed at students who are NOT primarily on a data science track.
🎯 Focus: Theoretical concepts of data science.
📂 How: Reflections through reading and writing.
(ME204 is inspired by this course!)
🎯 Focus: Fundamental machine learning algorithms.
📂 How: Practical use of ML techniques and metrics.
🎯 Focus: Professional-grade data engineering.
📂 How: Building APIs, web scraping, and NLP pipelines.
I will share a link/QR code to a Mentimeter activity and we will interact with it together.
“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon. Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.), and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”
Source of this definition: Shah, C. (2020). A hands-on introduction to data science. Cambridge University Press.
Emphasis and emojis are of my own making.
knows everything about statistics
able to communicate insights perfectly
fully understands businesses like no one
is a fluent computer programmer
We are all jugglers 🤹
Image from Chapter 1 of Schutt, R., & O’Neil, C. (2013). Doing data science (1st edition). O’Reilly Media.
⚠️ In practice, the process is not linear, and many feedback loops exist.
People in the field like to joke that 80% of the time and effort spent on a data science project goes to the tasks highlighted in the diagram above.
And this is what this course is about! You will learn some of the most common tools used during this data wrangling process.
The meme is real:
The struggle is real.
by u/ali_azg in r/dataengineering
Preparing data for analysis doesn’t get as much attention as algorithms (what people usually think of when they hear the term data science), but it’s an ESSENTIAL skill if you want to work with data.
In industry, data engineers are the ones who are responsible for the data pipeline.
👉 You won’t be qualified for these jobs after this course just yet, but if you focus on understanding the why behind the things we will teach you, at the end you will be thinking like a data engineer.
.svg
this year) inspired by it using a procedural art script in Python.Let’s look at the course syllabus:
📔 Syllabus on Moodle.
By the way, everything you need is on Moodle but we also have a public-facing website, so you can refer to it even after the course is over:
25% | Individual |
✍️ Midterm Project Is London really all that rainy? |
Due: 22 July 2025 |
75% | Individual |
📦 Final Project (bring your own data) |
Due: 01 August 2025 |
Important
This course is designed for students with some basic coding experience in mind.
I recommended the free online book 📗 Automate the Boring Stuff as a pre-course reading.
💆♂️ HOWEVER: coding beginners have done really well in the previous ME204 iterations (and in my other regular course, DS105)! As long as you are aware that you will need to put in a bit more effort than others, you will be totally fine.
This book, written by Al Sweigart, is a great resource for beginners. Find it on: https://automatetheboringstuff.com/.
You have some basic coding experience to the following level:
str
, int
, float
, bool
)and
, or
, not
)if
, elif
, else
)for
and while
)def
)[]
), dictionaries ({}
) and sets ({}
) Coding novices, check out the pre-course reading: 📗 Automate the Boring Stuff. You might need to refer to it here and there during the course.
Let me go back to the Mentimeter activity 👉
🤖 LSE’s AI Policy Framework
LSE asks course leaders to adopt one of three positions on generative AI use.
For ME204, I’ve chosen Position 3: Full authorised use of generative AI in assessment which means:
👉 You are also allowed NOT to use AI tools if you prefer.
📑 Our marking strategy:
In this course, we really favour process over output when it comes to learning.
When marking, we will always be searching for evidence of learning. It is important that you explain, at a high-level, the rationale behind your choices for all your key decisions.
Even if you produce incredibly advanced stuff, if it does not engage with the things we have been discussing in lectures and labs (because or not because of AI), you will not be likely to get a good grade.
Here are a few things that often tell us you have not been interacting much with us and the course material:
pandas
a certain way but you do it differently, you must provide a clear explanation for why your version is clearly superior.Throughout this course, we will try to help you make use of Generative AI tools in ways that support your learning.
Let’s take 15 minutes to get a coffee and come back refreshed.
When we return:
Before thinking about coding, let’s see how if feels like to think about a workflow to process data for analysis.
This report unites the efforts of the Boston Women’s Workforce Council to understand the wage gap in the city. It is a great model of data reporting to aspire to do in this course.
They communicate key takeaways in a simple, clear and engaging way and the text accompanying the figures tell us exactly what the author’s point is. The readers are directed to the plot for confirmation of that point.
:::
Boston Women’s Workforce Council. (2023). Driving Wage Equity for Ten Years (Annual Report, p. 51). Boston Women’s Workforce Council.
I have no idea what their raw data actually look like, but just for the sake of our exercise, let’s imagine a hypothetical scenario where they had to deal with:
Typical problems:
MM/DD/YYYY
and others in DD/MM/YYYY
and others in YYYY-MM-DD
). The same could be said, perhaps, for salary information where some will be in the format £100,000
and others in 100000
or even that ones will be calculated per hour and others per year.Time: 12 minutes
Task: Form groups of 3-4 and produce a flowchart of the data cleaning and data processing steps you imagine you would need to take before you could arrive at the charts in the report. We don’t know you have not been taught anything formal yet, so don’t worry too much about the technical details. Think about it in terms of the big steps, like “write a script to put all dates in the same format”, and in which order you would do them.
Format: Either draw on paper, use a tool like draw.io or even Google Slides or PowerPoint.
Share-out: At the end, take a screenshot of your flowchart and share it on the #social
channel on Slack.
Then, a 🗣️ CLASSROOM DISCUSSION will follow.
In case you are curious about how equivalent the two languages are, here is a comparison of the most common data science packages in each language.
Python
Github!
Use Github for everything related to your project!
Important
Don’t share code via e-mail, Dropbox, Google Drive, or anything like that!
It is a bad practice. Things get messy very quickly.
We have a cloud environment called Nuvolos that has been set up for you to use during these 3 weeks of ME204.
Before we break for lunch:
Your labs happen either from 2.00pm - 3.30pm or from 3.30pm - 5.00pm. Check the timetable information you received from the Summer School Office.
ME204 - Data Engineering for the Social World
LSE Summer School 2025 | ME204 Week 01 Day 01