🗓️ Week 04
Working with Tabular Data in Python (NumPy & Pandas)

DS105W – Data for Data Science

13 Feb 2025

1️⃣ Solution to the 📝 W04 Formative Exercise

16:03 – 16:15

(Early Insights)

Good Patterns in Your Code

🖥️ Live Demo: Some good practices I saw in your submissions.

Unhelpful Patterns I’ve Seen (so far)

  • Seemingly uncritical use of AI?
    (ChatGPT loves suggesting this but I suspect many don’t understand what it does)
    • dict.get('key', {}) instead of dict['key']
    • for a, b in zip(list_a, list_b)
    • Repeteadly importing the same library in the notebook

Today’s Agenda

Code We’ve Seen So Far

  • Lists
  • Dictionaries
  • for loops
  • JSON data

Where We’re Going

  • From lists to arrays
  • From dictionaries to DataFrames
  • From loops to vectorisation
  • Tabular data analysis
    using professional tools (numpy and pandas)

2️⃣ Introduction to NumPy

16:15 – 16:50

Why NumPy? 🤔

Key Benefits

  • More efficient memory usage
  • Faster mathematical operations
  • Built for scientific computing
  • Foundation for AI/ML libraries

Core Features

  • Vectorised operations (no for loops)
  • Strong data types
  • Powerful slicing and indexing
  • Multi-dimensional data

Python Lists vs NumPy Arrays

Say you want to create a new list with the temperatures in Fahrenheit.

Python List

temps_c = [25.3, 28.1, 22.5]

# Modify each temp in a loop
temps_f = []
for t in temps_c:
    temps_f.append(t * 9/5 + 32)

print(temps_f)
# [77.54, 82.58, 72.5]

NumPy Array

temps_c = np.array([25.3, 28.1, 22.5])

# Modify everything at once
temps_f = temps_c * 9/5 + 32



print(temps_f)
# array([77.54, 82.58, 72.5 ])
  • Dynamic size
  • Mixed data types
  • Python objects overhead
  • Slower operations
  • Fixed size
  • Same data type
  • Contiguous memory
  • Vectorized operations

Data Types in NumPy

NumPy Types

  • Integers: np.int8 to np.int64
  • Floats: np.float32, np.float64
  • Others: np.bool_, np.string_

💡 Choose types based on memory needs

🔗 Link: NumPy Data Types

Working with Types

# Create with specific type
temps = np.array([25.3, 28.1], 
                 dtype=np.float32)

# Check type
print(temps.dtype)

# Convert type
temps = temps.astype(np.float64)

Accessing and Modifying Arrays

Basic Indexing

# First element
temps[0]
# Last element
temps[-1]

# From 1st to 3rd element
temps[1:4]
# Every second element
temps[::2] 

Boolean Indexing

# Find hot days (>25°C)
# This is also called a mask
hot_days = temps > 25

# Filter according to the mask
hot_temps = temps[hot_days]

Useful NumPy Functions

Statistics

# Mean
np.mean(temps)   
# Median
np.median(temps) 
# Standard deviation
np.std(temps)    
# Minimum
np.min(temps)    
# Maximum
np.max(temps)    

Array Operations

# Empty zeros
np.zeros(5)  # [0. 0. 0. 0. 0.]

# Empty ones
np.ones(3)   # [1. 1. 1.]

# Range
np.arange(0, 10, 2)  # [0 2 4 6 8]

# 5 numbers between 0 and 1
np.linspace(0, 1, 5)

Logical Operations

# Single condition
hot = temps > 25

# This AND that
perfect = (temps >= 20) & (temps <= 25)

# This OR that
good = (temps >= 20) | (temps <= 25)

💡 Use & and |, not and/or

⚠️ Using and/or with arrays will fail silently!

Where Function

Use it like an if-else statement but for arrays.

Say temps = np.array([25, 28, 22, 27, 24]):

# If True, return 'Hot', otherwise 'Cool'
np.where(temps > 25,'Hot','Cool')

You can also chain them together:

# If True, return 'Hot', otherwise check if cold
np.where(temps > 25,'Hot',
        # If True, return 'Cold', otherwise return 'Mild'
         np.where(temps < 15, 'Cold', 'Mild'))

Mathematical Operations

Beyond basic arithmetic, NumPy has many built-in functions.

# Square root
np.sqrt(temps)

# Exponential
np.exp(temps)

# Logarithm
np.log(temps)

# Sine
np.sin(temps)

# Cosine
np.cos(temps)

NumPy in Action

We’ll work in the Jupyter notebook to:

  1. Load weather data into NumPy arrays
  2. Find hot days using array operations
  3. Compare performance with Python lists

🍵 Quick Break (+ challenges)

16:50 – 17:15

+ Practice Exercise

I will give you two challenges to solve (in between the break).

  • Exercise 1: Filtering data with numpy
  • Exercise 2: “Perfect” vs “Not Perfect” summer days

💡 The exercises are in the Jupyter notebook I will share at the lecture.

3️⃣ Pandas Series, DataFrames & Operations

17:15 – 18:00

From NumPy to Pandas

NumPy Array

temps = np.array([25.3, 28.1, 22.5])
dates = np.array(['2024-06-01', 
                 '2024-06-02',
                 '2024-06-03'])

# No built-in labeling
print(temps[0])  # 25.3

Pandas Series

temps = pd.Series(
    [25.3, 28.1, 22.5],
    index=['2024-06-01', 
           '2024-06-02',
           '2024-06-03']
)

# Access by label: 25.3
print(temps['2024-06-01'])  

Pandas Series

Key Features

  • 1D labeled array
  • Built on NumPy
  • Automatic alignment by index
  • Handles missing data
  • Built-in time series tools

Common Operations

# Statistics
temps.mean()
temps.describe()

# Filtering
hot_days = temps[temps > 25]

# This also works:
hot_days = temps.loc[temps > 25]

⚠️ With DataFrames, always use .loc[] for boolean indexing!

Pandas DataFrames

Structure

weather_df = pd.DataFrame({
    'temp': [25.3, 28.1, 22.5],
    'rain': [0.0, 0.5, 1.2]
}, index=['2024-06-01', 
         '2024-06-02',
         '2024-06-03'])

weather_df.to_csv('weather.csv')

weather_df.to_json('weather.json')

weather_df['temp'].plot()

Advantages

  • Multiple columns (like Excel)
  • Easy data manipulation
  • CSV/JSON import/export
  • Built-in plotting
    (for prototyping)

The .loc and .iloc Attributes

.loc: Label-based

# Get by date
df.loc['2024-06-01']

# Date and column
df.loc['2024-06-01', 'temp']

# Multiple dates
df.loc['2024-06-01':'2024-06-02']

.iloc: Integer-based

# First row
df.iloc[0]

# First row, first column
df.iloc[0, 0]

# First two rows
df.iloc[0:2]

The .assign Method

Adding New Columns

# Convert to Fahrenheit and add weather type
weather_df = (
    weather_df
    .assign(
        temp_f=weather_df["temp"] * 9/5 + 32,
        weather=np.where(weather_df["temp"] > 25, 'Hot', 'Cool')
    )
)

💡 .assign() is “chainable” and doesn’t modify the original DataFrame

⚠️ Avoid modifying DataFrames in-place:

# Instead of this:
df['new_col'] = df['old_col'] * 2

# DO this instead:
df = df.assign(new_col=lambda x: x.old_col * 2)

The .apply Method (Series)

Apply to Series

def classify_temp(temp):
    if temp > 25:
        return 'Hot'
    return 'Cool'

# Apply to temperature column
weather_df['temp'].apply(classify_temp)

The .apply Method (DataFrame)

Apply to DataFrame

def classify_weather(row):
    if row['temp'] > 25 and row['rain'] < 1:
        return 'Hot & Dry'
    return 'Other'

# Apply to each row
weather_df.apply(classify_weather, axis=1)

You need to specify axis=1 if you want to apply the function to each row. axis=0 is the default and applies the function to each column.

💡 Pro tip: Chain pandas methods together

(weather_df
    .assign(temp_f=lambda x: x.temp * 9/5 + 32)
    .query('temp_f > 80')
    .sort_values('temp_f', ascending=False))

What Next?

THE END

  • Practice NumPy and Pandas tomorrow
  • Read the NumPy and Pandas quickstart guides
  • Start working on your ✍🏻 Mini-Project I
    • Released on Moodle tomorrow
    • Due Thursday 27 February 2025, 8pm
    • Similar structure to formative (but with numpy and pandas)
    • Worth 20% of your final mark