🗓️ Week 04
Working with Tabular Data in Python (NumPy & Pandas)

DS105W – Data for Data Science

Dr Jon Cardoso-Silva

LSE Data Science Institute

13 Feb 2025

1️⃣ Solution to the 📝 W04 Formative Exercise

16:03 – 16:15

(Early Insights)

Good Patterns in Your Code

🖥️ Live Demo: Some good practices I saw in your submissions.

Unhelpful Patterns I’ve Seen (so far)

Seemingly uncritical use of AI?
(ChatGPT loves suggesting this but I suspect many don’t understand what it does)
- dict.get('key', {}) instead of dict['key']
- for a, b in zip(list_a, list_b)
- Repeteadly importing the same library in the notebook

Today’s Agenda

Code We’ve Seen So Far

Lists
Dictionaries
for loops
JSON data

Where We’re Going

From lists to arrays
From dictionaries to DataFrames
From loops to vectorisation
Tabular data analysis
using professional tools (numpy and pandas)

2️⃣ Introduction to NumPy

16:15 – 16:50

Why NumPy? 🤔

Key Benefits

More efficient memory usage
Faster mathematical operations
Built for scientific computing
Foundation for AI/ML libraries

Core Features

Vectorised operations (no for loops)
Strong data types
Powerful slicing and indexing
Multi-dimensional data

Python Lists vs NumPy Arrays

Say you want to create a new list with the temperatures in Fahrenheit.

Python List

temps_c = [25.3, 28.1, 22.5]

# Modify each temp in a loop
temps_f = []
for t in temps_c:
    temps_f.append(t * 9/5 + 32)

print(temps_f)
# [77.54, 82.58, 72.5]

NumPy Array

temps_c = np.array([25.3, 28.1, 22.5])

# Modify everything at once
temps_f = temps_c * 9/5 + 32



print(temps_f)
# array([77.54, 82.58, 72.5 ])

Dynamic size
Mixed data types
Python objects overhead
Slower operations

Fixed size
Same data type
Contiguous memory
Vectorized operations

Data Types in NumPy

NumPy Types

Integers: np.int8 to np.int64
Floats: np.float32, np.float64
Others: np.bool_, np.string_

💡 Choose types based on memory needs

🔗 Link: NumPy Data Types

Working with Types

# Create with specific type
temps = np.array([25.3, 28.1], 
                 dtype=np.float32)

# Check type
print(temps.dtype)

# Convert type
temps = temps.astype(np.float64)

Accessing and Modifying Arrays

Basic Indexing

# First element
temps[0]
# Last element
temps[-1]

# From 1st to 3rd element
temps[1:4]
# Every second element
temps[::2]

Boolean Indexing

# Find hot days (>25°C)
# This is also called a mask
hot_days = temps > 25

# Filter according to the mask
hot_temps = temps[hot_days]

Useful NumPy Functions

Statistics

# Mean
np.mean(temps)   
# Median
np.median(temps) 
# Standard deviation
np.std(temps)    
# Minimum
np.min(temps)    
# Maximum
np.max(temps)

Array Operations

# Empty zeros
np.zeros(5)  # [0. 0. 0. 0. 0.]

# Empty ones
np.ones(3)   # [1. 1. 1.]

# Range
np.arange(0, 10, 2)  # [0 2 4 6 8]

# 5 numbers between 0 and 1
np.linspace(0, 1, 5)

Logical Operations

# Single condition
hot = temps > 25

# This AND that
perfect = (temps >= 20) & (temps <= 25)

# This OR that
good = (temps >= 20) | (temps <= 25)

💡 Use & and |, not and/or

⚠️ Using and/or with arrays will fail silently!

Where Function

Use it like an if-else statement but for arrays.

Say temps = np.array([25, 28, 22, 27, 24]):

# If True, return 'Hot', otherwise 'Cool'
np.where(temps > 25,'Hot','Cool')

You can also chain them together:

# If True, return 'Hot', otherwise check if cold
np.where(temps > 25,'Hot',
        # If True, return 'Cold', otherwise return 'Mild'
         np.where(temps < 15, 'Cold', 'Mild'))

Mathematical Operations

Beyond basic arithmetic, NumPy has many built-in functions.

# Square root
np.sqrt(temps)

# Exponential
np.exp(temps)

# Logarithm
np.log(temps)

# Sine
np.sin(temps)

# Cosine
np.cos(temps)

NumPy in Action

We’ll work in the Jupyter notebook to:

Load weather data into NumPy arrays
Find hot days using array operations
Compare performance with Python lists

🍵 Quick Break (+ challenges)

16:50 – 17:15

+ Practice Exercise

I will give you two challenges to solve (in between the break).

Exercise 1: Filtering data with numpy
Exercise 2: “Perfect” vs “Not Perfect” summer days

💡 The exercises are in the Jupyter notebook I will share at the lecture.

3️⃣ Pandas Series, DataFrames & Operations

17:15 – 18:00

From NumPy to Pandas

NumPy Array

temps = np.array([25.3, 28.1, 22.5])
dates = np.array(['2024-06-01', 
                 '2024-06-02',
                 '2024-06-03'])

# No built-in labeling
print(temps[0])  # 25.3

Pandas Series

temps = pd.Series(
    [25.3, 28.1, 22.5],
    index=['2024-06-01', 
           '2024-06-02',
           '2024-06-03']
)

# Access by label: 25.3
print(temps['2024-06-01'])

Pandas Series

Key Features

1D labeled array
Built on NumPy
Automatic alignment by index
Handles missing data
Built-in time series tools

Common Operations

# Statistics
temps.mean()
temps.describe()

# Filtering
hot_days = temps[temps > 25]

# This also works:
hot_days = temps.loc[temps > 25]

⚠️ With DataFrames, always use .loc[] for boolean indexing!

Pandas DataFrames

Structure

weather_df = pd.DataFrame({
    'temp': [25.3, 28.1, 22.5],
    'rain': [0.0, 0.5, 1.2]
}, index=['2024-06-01', 
         '2024-06-02',
         '2024-06-03'])

weather_df.to_csv('weather.csv')

weather_df.to_json('weather.json')

weather_df['temp'].plot()

Advantages

Multiple columns (like Excel)
Easy data manipulation
CSV/JSON import/export
Built-in plotting
(for prototyping)

The `.loc` and `.iloc` Attributes

.loc: Label-based

# Get by date
df.loc['2024-06-01']

# Date and column
df.loc['2024-06-01', 'temp']

# Multiple dates
df.loc['2024-06-01':'2024-06-02']

.iloc: Integer-based

# First row
df.iloc[0]

# First row, first column
df.iloc[0, 0]

# First two rows
df.iloc[0:2]

The `.assign` Method

Adding New Columns

# Convert to Fahrenheit and add weather type
weather_df = (
    weather_df
    .assign(
        temp_f=weather_df["temp"] * 9/5 + 32,
        weather=np.where(weather_df["temp"] > 25, 'Hot', 'Cool')
    )
)

💡 .assign() is “chainable” and doesn’t modify the original DataFrame

⚠️ Avoid modifying DataFrames in-place:

# Instead of this:
df['new_col'] = df['old_col'] * 2

# DO this instead:
df = df.assign(new_col=lambda x: x.old_col * 2)

The `.apply` Method (Series)

Apply to Series

def classify_temp(temp):
    if temp > 25:
        return 'Hot'
    return 'Cool'

# Apply to temperature column
weather_df['temp'].apply(classify_temp)

The `.apply` Method (DataFrame)

Apply to DataFrame

def classify_weather(row):
    if row['temp'] > 25 and row['rain'] < 1:
        return 'Hot & Dry'
    return 'Other'

# Apply to each row
weather_df.apply(classify_weather, axis=1)

You need to specify axis=1 if you want to apply the function to each row. axis=0 is the default and applies the function to each column.

💡 Pro tip: Chain pandas methods together

(weather_df
    .assign(temp_f=lambda x: x.temp * 9/5 + 32)
    .query('temp_f > 80')
    .sort_values('temp_f', ascending=False))

What Next?

THE END

Practice NumPy and Pandas tomorrow
Read the NumPy and Pandas quickstart guides
Start working on your ✍🏻 Mini-Project I
- Released on Moodle tomorrow
- Due Thursday 27 February 2025, 8pm
- Similar structure to formative (but with numpy and pandas)
- Worth 20% of your final mark

🗓️ Week 04 Working with Tabular Data in Python (NumPy & Pandas)