🗓️ Week 01 | Day 02
Working with Data Types and APIs

ME204 – Data Engineering for the Social World

Dr Jon Cardoso-Silva

15 July 2025

Recap & Today’s Agenda

  • Yesterday: We mapped a data workflow and thought about the process of data pipelines at a conceptual and high-level. In the afternoon, you familiarised yourself with the Nuvolos environment and started to run some Python code.
  • Today: We start to look at the technical details of how to handle data effectively in a computer. It will be our formal introduction to the concept of API (using the requests library to demonstrate), and to the concept of DataFrames (using the pandas library).

Can’t an AI do the analysis for us? 10:05 – 10:30

One interesting question from yesterday’s lecture:

“could we get an AI to do the whole data processing and data analysis for us?”

Let’s put it to the test 👉

The Data Analysis Challenge

London has seen its third heatwave of the summer so far.

It’s likely that temperatures of 40°C (in Farenheits: 104°F) will become the new normal in the UK.

I came up with this exploratory question which we could subject to data analysis:

How has the frequency of heatwaves in the UK changed over the last 10 years?

An Initial Prompt

Let’s build this together.

  • I will use Claude to help me with this, just because it’s the one we have access to at LSE (you can try to mirror what I’m doing, live, with any other AI of your choice).

  • Would the following prompt work? Why or why not?

Create a full visual data science analysis report on the frequency of heatwaves in the UK over the last 10 years.

Let’s try it, think about the output and then tweak it. We will force it to use a specific data source and a specific definition of a “heatwave” and see how it changes the output.

The “Human in the Loop”

Based on my tests ahead of the live demo, the results might vary when I repeat the prompt live but I think we will always find that for a reliable, reproducible answer, we need to be in control.

  • We need to choose a trustworthy data source.
  • We need to use a clear, specific robust definition of a “heatwave”.
  • We need to build a workflow that is reproducible and anyone can re-run and verify.

This is why we need to learn what’s “under the hood”.

Under the Hood: Data Types 10:30 – 11:00

Before we discuss how to collect data, let’s go even deeper into the computer’s memory.

  • How is data actually stored in a computer?

Computers only understand 0s and 1s

Numbers, text, images, and sounds are all stored as sequences of 0s and 1s in your computer’s memory. Each 0 or 1 is called a bit.

Think of a bit as a tiny box:

\[ \require{color} \fcolorbox{black}{white}{$\phantom{0}$} \phantom{\leftarrow \text{a bit can have a value of $0$}} \]

Computers only understand 0s and 1s

Numbers, text, images, and sounds are all stored as sequences of 0s and 1s in your computer’s memory. Each 0 or 1 is called a bit.

Think of a bit as a tiny box:

\[ \require{color} \begin{array}{ccc} \fcolorbox{black}{#eeeeee}{0} & \leftarrow & \text{a bit can have a value of $0$} \end{array} \]

Computers only understand 0s and 1s

Numbers, text, images, and sounds are all stored as sequences of 0s and 1s in your computer’s memory. Each 0 or 1 is called a bit.

Think of a bit as a tiny box:

\[ \begin{array}{ccc} \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \leftarrow & \text{OR it can have a value of $1$} \end{array} \]

but nothing else!

Boolean data type (aka bool)

For everything that has a ‘Yes’ or ‘No’ answer, we can use a single bit.

\[ \textcolor{#9753b8}{\texttt{is_it_raining}} = \begin{cases} \fcolorbox{black}{#eeeeee}{$\textcolor{black}{0}$} & \text{if it is not raining} \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \text{if it is raining} \end{cases} \]

In Python:

# if it is raining
is_it_raining = True

# if it is not raining
is_it_raining = False

What about numbers?

Suppose we want to represent positive numbers (0 included). We can’t do that with just a single bit!

With \(2\) bits, we can represent \(4\) different numbers:

\[\begin{array}{ccc} \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 0 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 1 \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 2 \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 3 \\ \end{array}\]

Positive whole numbers

With \(3\) bits, I can represent double the amount of numbers: \(8\)

\[\begin{array}{ccccccc} \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 0 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 4 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 1 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 5 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 2 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \rightarrow 6 \\ \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 3 & & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \rightarrow 7 \\ \end{array}\]

Positive whole numbers

Here is another way of looking at it:

\[ \begin{array}{cccccc} \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \\ \downarrow & \downarrow & \downarrow & \downarrow \\ \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \times 2^3 & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \times 2^2 & \fcolorbox{black}{#eeeeee}{0} \times 2^1 & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \times 2^0 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ 8 & 4 & 0 & 1 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ 8 & +\quad4 & +\quad0 & +\quad1 & = & 13 \end{array} \]

But we need negative numbers too!

In practice, we reserve the first bit to represent the sign of the number:

\[ \begin{array}{c|ccccc} \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} & \fcolorbox{black}{#eeeeee}{0} & \fcolorbox{black}{#111111}{$\textcolor{white}{1}$} \\ \downarrow & \downarrow & \downarrow & \downarrow \\ \textcolor{red}{-} & 4 & 0 & 1 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ \textcolor{red}{-} & 4 & +\quad0 & +\quad1 & = & -5 \\ \textcolor{red}{sign} & \text{value} & & & & \\ \end{array} \]

Don’t Worry!

💡 You won’t need to do this type of calculation manually in this course!

However, understanding the theory behind binary numbers is crucial for:

  1. Using libraries like numpy and pandas efficiently:
  • Choosing the correct data types (e.g., int32, float64) saves memory and makes your code faster.
  • It helps you avoid common pitfalls like overflow errors.
  1. Working with real-world datasets:
  • Some data is stored in binary formats.
  • Understanding the fundamentals helps you debug weird data issues.

Enjoy the simplicity of Python but remember that understanding what’s under the hood matters!

What If I need a decimal number?

Decimal numbers are represented using the floating-point (float) data type.

\[ \textcolor{#9753b8}{\texttt{pi}} = 3.14159 \]

It’s conceptually similar: a certain number of bits are used to store the number, split between:

  • the sign,
  • the main digits (mantissa), and
  • the position of the decimal point (exponent).

And what about text?

Text is stored by mapping each character to a number. UTF-8 is the modern standard that includes characters from most of the world’s languages, and even emojis!

  • The letter ‘A’ is 65
  • The letter ‘a’ is 97
  • The emoji ‘😊’ is 128522

Each of those numbers is then stored in binary, just like we saw before.

The Problem with Separate Lists

Think about this code, which you might have seen yesterday.

# From yesterday's lab
dates = ['2025-07-14', '2025-07-15', ...]
max_temps = [22, 25, ...]
conditions = ['Cloudy', 'Sunny', ...]

# To get info for the first day...
msg = f"Date: {dates[0]}"
msg += f", Temp: {max_temps[0]}"
msg += f", Condition: {conditions[0]}"

print(msg)
  • Coordination Problem: The data is disconnected. If you sort one list, but not the others, you’ve corrupted your data.
  • No Type Enforcement: A list can hold anything. You could accidentally put a string in max_temps, leading to all sorts of errors down the line.
  • Inefficient: Python lists are flexible but slow for numerical calculations.

The Solution: numpy and pandas

Enter numpy and pandas

  • numpy is the foundational library for numerical computing in Python. It provides an efficient* array object that stores data of the same type, using the binary representations we just discussed. This makes it incredibly fast.

  • pandas is built on top of numpy. It provides the DataFrame, a two-dimensional table that can hold columns of different types.

Figure 1: Numpy and Pandas logos.

The NumPy Array: A Grid of Values

A numpy array is a grid of values, all of the same type.

import numpy as np

# A regular Python list
python_list = [22, 25, 21, 19]

# A numpy array where each value
# is stored as a single byte (8 bits)
# 8 bits can represent only 256 different values
numpy_array = np.array(python_list, dtype=np.int8)

print(numpy_array)

Output: array([22, 25, 21, 19], dtype=int8) The formatting of the output is just an explicit signal to the reader that this is not a simple list, but a numpy array.

Whereas a list stores pointers to scattered objects, a numpy array stores this data as a single, contiguous block of integers in memory.

Choosing the Right Data Type

Choosing the right dtype is a trade-off between memory and precision. Using a smaller dtype can save a lot of memory, which is crucial for large datasets.

Data Type Signed/Unsigned Bits Range (Approximate)
np.int8 Signed 8 -128 to 127
np.uint8 Unsigned 8 0 to 255
np.int16 Signed 16 -32,768 to 32,767
np.uint16 Unsigned 16 0 to 65,535
np.int32 Signed 32 -2,147,483,648 to 2,147,483,647
np.uint32 Unsigned 32 0 to 4,294,967,295
np.int64 Signed 64 -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
np.uint64 Unsigned 64 0 to 18,446,744,073,709,551,615
np.float16 Signed 16 -65,504 to 65,504 (~3 digits precision)
np.float32 Signed 32 -3.4e+38 to 3.4e+38 (~7 digits precision)
np.float64 Signed 64 -1.8e+308 to 1.8e+308 (~16 digits precision)

The DataFrame: A Collection of Arrays

A pandas DataFrame is essentially a dictionary where the keys are column names and the values are numpy arrays (wrapped inside a pandas object called a Series).

We can convert our messy lists into a clean, structured DataFrame with pandas.

Code:

import pandas as pd

# Combine our lists into a dictionary
weather_data = {
    'date': ['2025-07-14', '2025-07-15'],
    'max_temp': [22, 25],
    'condition': ['Cloudy', 'Sunny']
}

# Create a DataFrame
df = pd.DataFrame(weather_data)

print(df)

Output:

         date  max_temp condition
0  2025-07-14        22    Cloudy
1  2025-07-15        25     Sunny
  • Data is now organised, linked, and typed.
  • max_temp will be an integer type (int64), making it fast for calculations.
  • date and condition will be ‘object’ types (strings).

The CSV Connection

DataFrames are brilliant for working with data and one of the most common formats that mirrors the structure of a DataFrame very closely is the Comma-Separated Values (CSV) file.

  • DataFrames can be easily saved to a CSV file:

    df.to_csv('weather_data.csv', index=False)
  • If you open a CSV file in a text editor, you’ll see that it’s just a plain text file with a few commas separating the columns:

    date,max_temp,condition
    2025-07-14,22,Cloudy
    2025-07-15,25,Sunny
  • And loaded from a CSV file:

    df_loaded = pd.read_csv('weather_data.csv')

JSON

  • Data doesn’t always come in a tabular format. Sometimes you will inevitably have to work with JSON data.

  • Somewhat formally:

    JSON stands for JavaScript Object Notation. It’s a lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate.*

  • But I like to think about it as a mixture of lists and dictionaries.

  • For example, this is a JSON object:

    {
      "name": "Jon",
      "city": "London",
      "job": "Assistant Professor",
      "courses": ["ME204", "DS105", "DS205"]
    }

A source of JSON data: Open-Meteo

Let’s browse the Open-Meteo website: https://open-meteo.com/

  • It’s a website that provides weather data for a given location.
  • It’s a free service (with some limits on the number of requests).

Collecting Data from APIs 11:00 – 11:30

What’s an API?

API stands for Application Programming Interface.

  • Think of it as a waiter at a restaurant.
  • You (the “client”) have a menu of requests you can make.
  • You give your order to the waiter (the API), who goes to the kitchen (the “server”) and brings back the food (the “data”) you asked for.
  • It’s a structured way for computer programs to talk to each other and for us to to talk to them.

Using the requests library

In Python, we can use the requests library to make HTTP requests to APIs.

  • It does not come pre-installed with Python, so we need to install it:

    pip install requests
  • Then, after importing it, we can use it to make a request to the API:

    import requests
    
    base_url = "https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": 51.5085,
        "longitude": -0.1257,
        "daily": "temperature_2m_max,temperature_2m_min,weather_code",
        "forecast_days": 3
    }
    
    response = requests.get(base_url, params=params)
    
    # Check the response status code
    print(response.status_code) # 200 means success!
    
    # Get the content in JSON format
    data = response.json()

From API to DataFrame

Assuming the request was successful, you will get something like this:

{
  "latitude": 51.5,
  "longitude": -0.120000124,
  "utc_offset_seconds": 0,
  "timezone": "GMT",
  "elevation": 23.0,
  "daily_units": {
    "time": "iso8601",
    "temperature_2m_max": "°C",
    "temperature_2m_min": "°C",
    "weather_code": "wmo code"
  },
... it continues

From API to DataFrame (continued)

Which continues to:

  "daily": {
    "time": [
      "2025-07-14",
      "2025-07-15",
      "2025-07-16"
    ],
    "temperature_2m_max": [
      24.6,
      22.1,
      24.8
    ],
    "temperature_2m_min": [
      17.4,
      15.1,
      15.9
    ],
    "weather_code": [
      80,
      95,
      3
    ]
  }
}

From API to DataFrame (continued)

To create a meaningful DataFrame, I would need to first navigate to the relevant part of the JSON object (the daily part).

daily_data = data['daily']

Which will render like a dictionary:

# how daily_data looks like in Python
{
    "time": ["2025-07-14", "2025-07-15", "2025-07-16"],
    "temperature_2m_max": [24.6, 22.1, 24.8],
    "temperature_2m_min": [17.4, 15.1, 15.9],
    "weather_code": [80, 95, 3]
}

From API to DataFrame (continued)

Which I can then convert to a DataFrame:

df = pd.DataFrame(daily_data)

Rendering this on the console will look like this:

time                  [2025-07-14, 2025-07-15, 2025-07-16]
temperature_2m_max                      [24.6, 22.1, 24.8]
temperature_2m_min                      [17.4, 15.1, 15.9]
weather_code                                   [80, 95, 3]
Name: daily, dtype: object

Or, on Jupyter, it will look like a proper table:

time temperature_2m_max temperature_2m_min weather_code
2025-07-14 24.6 17.4 80
2025-07-15 22.1 15.1 95
2025-07-16 24.8 15.9 3

From API to DataFrame: the workflow

Whenever you want to get data from an API, you will need to follow these steps:

  1. Find the API endpoint you want to use: Discover what is the URL that points to the data you want. For example:
    • Open-Meteo’s forecast endpoint is
    https://api.open-meteo.com/v1/forecast
    • Open-Meteo’s historical weather endpoint is
    https://archive-api.open-meteo.com/v1/archive
  2. Make a GET request: Use the requests library to ask for the data.
  3. Get a JSON response: The server sends back the data in a semi-structured JSON format.
  4. Parse the JSON: Extract the useful parts.
  5. Load into a DataFrame: Convert the clean data into a pandas DataFrame.

🍵 Coffee Break 11:30 – 11:45



Let’s take 15 minutes to get a coffee and come back refreshed.

When we return:

  • Let’s create a Jupyter Notebook from scratchthe goal is to make our code more reproducible.
  • Let’s use Markdown to document our code.
  • Let’s go back to our AI quest: if I am more specific about what I want, can the AI be more helpful?

🖥️ Live Demos 11:45 – 13:00

  • Follow along with me if you brought your laptop.
  • Otherwise, just take notes of key steps and practice in the afternoon lab.

LSE Summer School 2025 | ME204 Week 01 Day 02