🗓️ Week 01 | Day 01
Foundations of Data Wrangling with Python

ME204 – Data Engineering for the Social World

Dr Jon Cardoso-Silva
Assistant Professor (Education)

14 July 2025

Today’s Goals:

Welcome to LSE Summer School 2025!
- Introduce myself and Dr Stuart Bramwell
- Get to know you
- Set the stage for the course
Understand: what is data engineering (out there in the world)?
Discover the foundational skills for data engineering that we will cover in this course.
Learn our expectation about your coding background

Who we are 10:05 – 10:20

Meet your lecturer
Meet your class teacher
Hear about the LSE Data Science Institute and our regular courses

Your lecturer

Photo of Jon Cardoso-Silva

Dr Jon Cardoso-Silva

🌐 jonjoncardoso.github.io
📧 J.Cardoso-Silva@lse.ac.uk

Assistant Professor (Education)

Background:

PhD in Computer Science (King’s College London)
Industry experience in:
- software development (Java, Python, R, SQL, etc.)
- as a lead data scientist (big projects, team management, etc.)

Highlighted research:

GENIAL project logo

GENIAL: Generative AI in Education

How are generative AI tools (like ChatGPT, Gemini, Claude, etc.) changing the way we learn and code?
Building evidence and practical guidance for using GenAI in data science education.

📢 I will talk more about our findings shortly when I talk about the AI policy for this course!

Who we are: your class teacher

Photo of Dr Stuart Bramwell

Dr Stuart Bramwell

Guest Teacher
📧 s.bramwell@lse.ac.uk

Has been teaching at the LSE Data Science Institute for the past couple of years.
Teaches on DS105 (Jon’s regular course that inspires ME204).
Also teaches in our machine learning course (DS202) and the introductory course to data science (DS101).
Research background: Political scientist (DPhil, Oxford). Co-creator of the WhoGov dataset, which won the Lijphart/Przeworski/Verba data set award. Researches political elites, social identity (gender, class, ethnicity), and democratisation. Read more

The Data Science Institute

This course is offered by the LSE Data Science Institute (DSI).
DSI is the hub for LSE’s interdisciplinary collaboration in data science
👉 Sign up to the DSI newsletter. Even when you go back home, you can keep up with the latest news and events from the DSI.

Regular DSI Courses 10:27 – 10:32

The LSE Data Science Institute offers a range of undergraduate data science courses. These courses are aimed at students who are NOT primarily on a data science track.

DS101: Fundamentals of Data Science

🎯 Focus: Theoretical concepts of data science.
📂 How: Reflections through reading and writing.

DS105: Data for Data Scientists

🎯 Focus: Collection and handling of real data.
📂 How: Hands-on coding and a group project.

(ME204 is inspired by this course!)

DS202: Data Science for Social Scientists

🎯 Focus: Fundamental machine learning algorithms.
📂 How: Practical use of ML techniques and metrics.

DS205: Advanced Data Manipulation

🎯 Focus: Professional-grade data engineering.
📂 How: Building APIs, web scraping, and NLP pipelines.

Who are you? 🫵 10:20 – 10:50

Course Structure & Syllabus 10:50 – 11:30

What we will cover in this course
A look at the course syllabus
The two pieces of assessments
Our expectations of you

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon. Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.), and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

Source of this definition: Shah, C. (2020). A hands-on introduction to data science. Cambridge University Press.

Emphasis and emojis are of my own making.

It’s fake that data scientists are these mythical unicorns 🦄

knows everything about statistics

able to communicate insights perfectly

fully understands businesses like no one

is a fluent computer programmer

In reality…

We are all jugglers 🤹

Everyone brings a different skill set.
We need multi-disciplinary teams.
Good data scientists know a bit of everything.
- Not fluent in all things
- Understands their strenghts and weaknessess
- They know when and where to interface with others

The Data Science Workflow

The Data Science Workflow (of this course)

People in the field like to joke that 80% of the time and effort spent on a data science project goes to the tasks highlighted in the diagram above.

And this is what this course is about! You will learn some of the most common tools used during this data wrangling process.

The meme is real:

The struggle is real.
by u/ali_azg in r/dataengineering

Preparing data for analysis doesn’t get as much attention as algorithms (what people usually think of when they hear the term data science), but it’s an ESSENTIAL skill if you want to work with data.

A few words on Data Engineering

In industry, data engineers are the ones who are responsible for the data pipeline.

Data Engineer jobs - Part II

The aspects of data engineering we will cover in this course:

Data collection via programming
Best practices in:
- Data formatting (tidy data)
- Software engineering (neat, replicable code)
Data pre-processing (with Python, Pandas and SQL)
Data products (dataviz and interactive dashboards)
Modern reporting (computational notebooks like Jupyter and Quarto Markdown)

Course syllabus

ME204’s 2025 favicon is an upgraded version of the blurry AI-slop version I used last year. For this year, I kept the resemblance to the original logo but created a proper vectorised version of the logo (in `.svg` this year) inspired by it using a procedural art script in Python.

Let’s look at the course syllabus:

📔 Syllabus on Moodle.

By the way, everything you need is on Moodle but we also have a public-facing website, so you can refer to it even after the course is over:

🌐 ME204’s website

✍️ Assessment Structure

25%	Individual	✍️ Midterm Project Is London really all that rainy?	Due: 22 July 2025
75%	Individual	📦 Final Project (bring your own data)	Due: 01 August 2025

Our expectations of you

Let us know when you think we have been unclear when teaching a particular topic
You have a good sense of self-direction:
- If you feel behind others: you are not afraid of asking questions
- If you are ahead of others: help those around you to reach the same level as you
Your use of 🤖 GenAI tools is productive and not a substitute for your own learning.

Our expectations of you, coding-wise

Important

This course is designed for students with some basic coding experience in mind.

I recommended the free online book 📗 Automate the Boring Stuff as a pre-course reading.

💆‍♂️ HOWEVER: coding beginners have done really well in the previous ME204 iterations (and in my other regular course, DS105)! As long as you are aware that you will need to put in a bit more effort than others, you will be totally fine.

Our expectations of you, coding-wise (continued)

You have some basic coding experience to the following level:

know what variables are and how to create them
understand the different data types (str, int, float, bool)
know how to use logical operators (and, or, not)
know how to use conditionals (if, elif, else)
know how to use loops (for and while)
know how to use functions (def)
know how to use lists and dictionaries
you have an intuition about how to use lists ([]), dictionaries ({}) and sets ({})
(the topic of 💻 Week 01 Day 01 Lab)

Use of Generative AI in this course

Let me go back to the Mentimeter activity 👉

🤖 LSE’s AI Policy Framework

LSE asks course leaders to adopt one of three positions on generative AI use.

For ME204, I’ve chosen Position 3: Full authorised use of generative AI in assessment which means:

You can use any AI tools you want
- You might have access to LSE’s Claude for Education.
You are allowed to use AI tools in your assignments, too.
We will show productive examples of AI use in this course.

👉 You are also allowed NOT to use AI tools if you prefer.

📑 Our marking strategy:

In this course, we really favour process over output when it comes to learning.

When marking, we will always be searching for evidence of learning. It is important that you explain, at a high-level, the rationale behind your choices for all your key decisions.

Even if you produce incredibly advanced stuff, if it does not engage with the things we have been discussing in lectures and labs (because or not because of AI), you will not be likely to get a good grade.

🚨 Signs of Under-Engagement

Here are a few things that often tell us you have not been interacting much with us and the course material:

Code Style Mismatches:

Using different libraries than we teach.
This is oddly very common! If we teach how to use pandas a certain way but you do it differently, you must provide a clear explanation for why your version is clearly superior.
Very long and convoluted code.
We strive for code that is concise, elegantly efficient yet easy to understand.
Logic that doesn’t match our course approach

Writing Style Issues:

Long, repetitive, self-aggrandising accounts
The sort of verbose writing AI favours
Missing your authentic voice

What We Want to See

Purposeful Code:

Concise, focused solutions
Clear understanding of what you need
Evidence you’ve thought through the problem

Authentic Writing:

Your voice coming through
Concise and purposeful explanations
Use AI to polish ideas, not replace thinking

This MIT Media Lab research suggests that relying too much on AI can lead to ‘cognitive offloading’. I find that this type of brain rot is particularly pronounced when you are dealing with a topic you are not familiar with.

🍵 Coffee Break 11:30 – 11:45

Let’s take 15 minutes to get a coffee and come back refreshed.

When we return:

What does it feel like to plan a data processing pipeline?
Which tools will we use in this course?
What to expect from the afternoon lab and the rest of the week?

A group activity: map the data workflow 11:45 – 12:30

Before thinking about coding, let’s see how if feels like to think about a workflow to process data for analysis.

Case Study: Boston's Gender Wage Gap

This report unites the efforts of the Boston Women’s Workforce Council to understand the wage gap in the city. It is a great model of data reporting to aspire to do in this course.

They communicate key takeaways in a simple, clear and engaging way and the text accompanying the figures tell us exactly what the author’s point is. The readers are directed to the plot for confirmation of that point.

:::

Let’s emulate their process

I have no idea what their raw data actually look like, but just for the sake of our exercise, let’s imagine a hypothetical scenario where they had to deal with:

The Starting Point: Over 250 separate Excel files from different companies, containing in total 156,000 anonymised employee records. The spreadsheets may have similar, comparable data but each might have a different style and structure.
External benchmark: They did analysis within their data but also compared their wage gap results to those reported by the Current Population Survey (CPS) compiled by the U.S. Bureau of Labor Statistics.

Typical problems:

Inconsistent Job Titles: “Software Engineer”, “Software Dev”, “Eng - Software”. All needed to be standardised somehow.
Missing Data: It’s common in all data sets for some records to be missing some information. Perhaps some employees asked to remove their demographic information from the data set, for example.
Varied Formats: It is possible that dates will have been formatted differently across files (some will be in the format MM/DD/YYYY and others in DD/MM/YYYY and others in YYYY-MM-DD). The same could be said, perhaps, for salary information where some will be in the format £100,000 and others in 100000 or even that ones will be calculated per hour and others per year.

Student Activity: Map the Journey

Time: 12 minutes

Task: Form groups of 3-4 and produce a flowchart of the data cleaning and data processing steps you imagine you would need to take before you could arrive at the charts in the report. We don’t know you have not been taught anything formal yet, so don’t worry too much about the technical details. Think about it in terms of the big steps, like “write a script to put all dates in the same format”, and in which order you would do them.
Format: Either draw on paper, use a tool like draw.io or even Google Slides or PowerPoint.
Share-out: At the end, take a screenshot of your flowchart and share it on the #social channel on Slack.
Then, a 🗣️ CLASSROOM DISCUSSION will follow.

Our Toolbox 12:30 – 13:00

Python vs R ??
- You can do pretty much the same in both languages
- Python is more widely adopted in industry, whereas R is more popular in academia
I used to teach this course in R (I 🫶 R) but by popular demand of previous cohorts, I have switched to Python.

Python vs R

In case you are curious about how equivalent the two languages are, here is a comparison of the most common data science packages in each language.

Python

Pull data from webpages & APIs:
- selenium, scrapy, requests
Reshape data
- pandas, numpy, scipy
Plotting data
- matplotlib, letsplot,
  seaborn, plotly
Share & Report
- Github Markdown
- Jupyter

Pull data from webpages & APIs:
- RSelenium, httr, rvest
Reshape data
- tidyverse (all packages)
Plotting data
- ggplot,
  leaflet
Share & Report
- Github Markdown
- RMarkdown, knitr, Quarto

Let’s get started with Nuvolos

We have a cloud environment called Nuvolos that has been set up for you to use during these 3 weeks of ME204.

Questions & Next Steps 12:40 – 13:00

Before we break for lunch:

Questions about the course? Now’s the time to ask!
Technical concerns? We’ll address them in the lab
Excited about a particular project idea? Share it with us!

Afternoon Labs

Your labs happen either from 2.00pm - 3.30pm or from 3.30pm - 5.00pm. Check the timetable information you received from the Summer School Office.

You can use the computers in the lab classrooms
OR you can use your own laptop

Thanks for coming! THE END

ME204 - Data Engineering for the Social World

LSE Summer School 2025 | ME204 Week 01 Day 01