ME204 – Data Engineering for the Social World
15 July 2025
requests
library to demonstrate), and to the concept of DataFrames (using the pandas
library).One interesting question from yesterday’s lecture:
“could we get an AI to do the whole data processing and data analysis for us?”
Let’s put it to the test 👉
I came up with this exploratory question which we could subject to data analysis:
How has the frequency of heatwaves in the UK changed over the last 10 years?
Let’s build this together.
I will use Claude to help me with this, just because it’s the one we have access to at LSE (you can try to mirror what I’m doing, live, with any other AI of your choice).
Would the following prompt work? Why or why not?
Create a full visual data science analysis report on the frequency of heatwaves in the UK over the last 10 years.
Let’s try it, think about the output and then tweak it. We will force it to use a specific data source and a specific definition of a “heatwave” and see how it changes the output.
Our Goal: To critically evaluate the output. Do I trust the criteria used to define a “heatwave”? Can I trust the data? Can I easily reproduce the analysis if I need to?
Based on my tests ahead of the live demo, the results might vary when I repeat the prompt live but I think we will always find that for a reliable, reproducible answer, we need to be in control.
This is why we need to learn what’s “under the hood”.
Before we discuss how to collect data, let’s go even deeper into the computer’s memory.
Numbers, text, images, and sounds are all stored as sequences of 0s and 1s in your computer’s memory. Each 0 or 1 is called a bit.
Think of a bit as a tiny box:
\[ \require{color} \fcolorbox{black}{white}{$\phantom{0}$} \phantom{\leftarrow \text{a bit can have a value of $0$}} \]
Numbers, text, images, and sounds are all stored as sequences of 0s and 1s in your computer’s memory. Each 0 or 1 is called a bit.
Think of a bit as a tiny box:
\[ \require{color} \begin{array}{ccc} \fcolorbox{black}{#eeeeee}{0} & \leftarrow & \text{a bit can have a value of $0$} \end{array} \]
Numbers, text, images, and sounds are all stored as sequences of 0s and 1s in your computer’s memory. Each 0 or 1 is called a bit.
Think of a bit as a tiny box:
\[ \begin{array}{ccc} \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \leftarrow & \text{OR it can have a value of $1$} \end{array} \]
but nothing else!
bool
)For everything that has a ‘Yes’ or ‘No’ answer, we can use a single bit.
\[ \textcolor{#9753b8}{\texttt{is_it_raining}} = \begin{cases} \fcolorbox{black}{#eeeeee}{$\textcolor{black}{0}$} & \text{if it is not raining} \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \text{if it is raining} \end{cases} \]
In Python:
Suppose we want to represent positive numbers (0 included). We can’t do that with just a single bit!
With \(2\) bits, we can represent \(4\) different numbers:
\[\begin{array}{ccc} \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 0 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 1 \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 2 \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 3 \\ \end{array}\]With \(3\) bits, I can represent double the amount of numbers: \(8\)
\[\begin{array}{ccccccc} \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 0 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 4 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 1 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 5 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 2 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 6 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 3 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 7 \\ \end{array}\]Here is another way of looking at it:
\[ \begin{array}{cccccc} \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \\ \downarrow & \downarrow & \downarrow & \downarrow \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \times 2^3 & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \times 2^2 & \fcolorbox{black}{#eeeeee}{0} \times 2^1 & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \times 2^0 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ 8 & 4 & 0 & 1 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ 8 & +\quad4 & +\quad0 & +\quad1 & = & 13 \end{array} \]
In practice, we reserve the first bit to represent the sign of the number:
\[ \begin{array}{c|ccccc} \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \\ \downarrow & \downarrow & \downarrow & \downarrow \\ \textcolor{red}{-} & 4 & 0 & 1 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ \textcolor{red}{-} & 4 & +\quad0 & +\quad1 & = & -5 \\ \textcolor{red}{sign} & \text{value} & & & & \\ \end{array} \]
💡 You won’t need to do this type of calculation manually in this course!
However, understanding the theory behind binary numbers is crucial for:
numpy
and pandas
efficiently:int32
, float64
) saves memory and makes your code faster.Enjoy the simplicity of Python but remember that understanding what’s under the hood matters!
Decimal numbers are represented using the floating-point (float
) data type.
\[ \textcolor{#9753b8}{\texttt{pi}} = 3.14159 \]
It’s conceptually similar: a certain number of bits are used to store the number, split between:
Text is stored by mapping each character to a number. UTF-8 is the modern standard that includes characters from most of the world’s languages, and even emojis!
A
’ is 65
a
’ is 97
😊
’ is 128522
Each of those numbers is then stored in binary, just like we saw before.
Browse the full UTF-8 table if you’re interested.
Think about this code, which you might have seen yesterday.
max_temps
, leading to all sorts of errors down the line.numpy
and pandas
Enter numpy
and pandas
numpy
is the foundational library for numerical computing in Python. It provides an efficient* array object that stores data of the same type, using the binary representations we just discussed. This makes it incredibly fast.
pandas
is built on top of numpy
. It provides the DataFrame, a two-dimensional table that can hold columns of different types.
*: numpy
converts all the calculations to C
code and it uses other optimisations to make it even faster.
A numpy
array is a grid of values, all of the same type.
import numpy as np
# A regular Python list
python_list = [22, 25, 21, 19]
# A numpy array where each value
# is stored as a single byte (8 bits)
# 8 bits can represent only 256 different values
numpy_array = np.array(python_list, dtype=np.int8)
print(numpy_array)
Output: array([22, 25, 21, 19], dtype=int8)
The formatting of the output is just an explicit signal to the reader that this is not a simple list, but a numpy
array.
Whereas a list stores pointers to scattered objects, a numpy
array stores this data as a single, contiguous block of integers in memory.
Choosing the right dtype
is a trade-off between memory and precision. Using a smaller dtype
can save a lot of memory, which is crucial for large datasets.
Data Type | Signed/Unsigned | Bits | Range (Approximate) |
---|---|---|---|
np.int8 |
Signed | 8 | -128 to 127 |
np.uint8 |
Unsigned | 8 | 0 to 255 |
np.int16 |
Signed | 16 | -32,768 to 32,767 |
np.uint16 |
Unsigned | 16 | 0 to 65,535 |
np.int32 |
Signed | 32 | -2,147,483,648 to 2,147,483,647 |
np.uint32 |
Unsigned | 32 | 0 to 4,294,967,295 |
np.int64 |
Signed | 64 | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
np.uint64 |
Unsigned | 64 | 0 to 18,446,744,073,709,551,615 |
np.float16 |
Signed | 16 | -65,504 to 65,504 (~3 digits precision) |
np.float32 |
Signed | 32 | -3.4e+38 to 3.4e+38 (~7 digits precision) |
np.float64 |
Signed | 64 | -1.8e+308 to 1.8e+308 (~16 digits precision) |
This is why choosing the right data type matters: it impacts memory efficiency and helps you avoid overflow errors (when a number is too large for the type to hold)
A pandas
DataFrame is essentially a dictionary where the keys are column names and the values are numpy
arrays (wrapped inside a pandas
object called a Series
).
We can convert our messy lists into a clean, structured DataFrame with pandas
.
Code:
DataFrames are brilliant for working with data and one of the most common formats that mirrors the structure of a DataFrame very closely is the Comma-Separated Values (CSV) file.
DataFrames can be easily saved to a CSV file:
If you open a CSV file in a text editor, you’ll see that it’s just a plain text file with a few commas separating the columns:
date,max_temp,condition
2025-07-14,22,Cloudy
2025-07-15,25,Sunny
And loaded from a CSV file:
Whenever you need to save data to continue to work with it in pandas
later, use the to_csv
method. pandas
is set up to work with tabular data and it comes with many powerful functions to help us work with it.
Data doesn’t always come in a tabular format. Sometimes you will inevitably have to work with JSON data.
Somewhat formally:
JSON stands for JavaScript Object Notation. It’s a lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate.*
But I like to think about it as a mixture of lists and dictionaries.
For example, this is a JSON object:
Let’s browse the Open-Meteo website: https://open-meteo.com/
API stands for Application Programming Interface.
requests
libraryIn Python, we can use the requests
library to make HTTP requests to APIs.
It does not come pre-installed with Python, so we need to install it:
Then, after importing it, we can use it to make a request to the API:
import requests
base_url = "https://api.open-meteo.com/v1/forecast"
params = {
"latitude": 51.5085,
"longitude": -0.1257,
"daily": "temperature_2m_max,temperature_2m_min,weather_code",
"forecast_days": 3
}
response = requests.get(base_url, params=params)
# Check the response status code
print(response.status_code) # 200 means success!
# Get the content in JSON format
data = response.json()
(I will show you some live examples during the lecture)
Assuming the request was successful, you will get something like this:
Which continues to:
To create a meaningful DataFrame, I would need to first navigate to the relevant part of the JSON object (the daily
part).
Which will render like a dictionary:
Which I can then convert to a DataFrame:
Rendering this on the console will look like this:
time [2025-07-14, 2025-07-15, 2025-07-16]
temperature_2m_max [24.6, 22.1, 24.8]
temperature_2m_min [17.4, 15.1, 15.9]
weather_code [80, 95, 3]
Name: daily, dtype: object
Or, on Jupyter, it will look like a proper table:
time | temperature_2m_max | temperature_2m_min | weather_code |
---|---|---|---|
2025-07-14 | 24.6 | 17.4 | 80 |
2025-07-15 | 22.1 | 15.1 | 95 |
2025-07-16 | 24.8 | 15.9 | 3 |
Whenever you want to get data from an API, you will need to follow these steps:
GET
request: Use the requests
library to ask for the data.You have now unlocked the power to acquire and structure your data in a suitable format for analysis.
Let’s take 15 minutes to get a coffee and come back refreshed.
When we return:
LSE Summer School 2025 | ME204 Week 01 Day 02