🗄️ Databases & SQL Cookbook

A practical guide to storing and connecting DataFrames

Author

Dr Jon Cardoso-Silva

Published

11 March 2025

Use this guide when you need to store pandas DataFrames in a database, or if you have access to a database and need to get data from it into a pandas DataFrame.

1. Why Use Databases?

Think of a database as a collection of interconnected DataFrames with superpowers:

Efficient Storage: Only load the data you need
Data Relationships: Connect related DataFrames using common keys
Data Integrity: Ensure data consistency
Query Optimization: Efficiently filter and aggregate data
Concurrent Access: Multiple users can access the data simultaneously

Notice how, in this course, we went from a much lower level of abstraction, observing how numbers and strings are stored in memory to storing data in plain text files (CSV/JSON) to now being at a higher level of abstraction, where we can store data in a database.

2. Data Types in SQLite

Before you start storing data, you need to design your database schema. This is like creating a blueprint for how your data will be organized.

Understanding Data Types

SQLite supports these basic types:

TEXT      -- For strings (like 'hello')
INTEGER   -- For whole numbers
REAL      -- For decimal numbers
BLOB      -- For binary data (rarely used in this course)
NULL      -- For missing values

Read more about the different data types in SQLite here.

Really Precise Data Types

SQLite also lets you be very precise with integer storage, similar to what we learned in Week 1 about bits and bytes:

INTEGER(1)   -- 1-byte integer (range: -128 to 127)
INTEGER(2)   -- 2-byte integer (range: -32768 to 32767)
INTEGER(4)   -- 4-byte integer (range: -2147483648 to 2147483647)
INTEGER(8)   -- 8-byte integer (range: -9223372036854775808 to 9223372036854775807)

That is also true for variable text storage:

VARCHAR(10)   -- 10-character variable text (max 10 characters)
VARCHAR(100)  -- 100-character variable text (max 100 characters)

Date and Time

SQLite does not have a dedicated date/time datatype but you can do date time operations using functions:

SELECT datetime('now') -- Current date and time
SELECT date('now') -- Current date
SELECT time('now') -- Current time

Read more about the date and time functions in SQLite here.

3. Designing Your Database Schema

Creating Tables with Proper Relationships

Always create your tables BEFORE you start storing data. Here’s an example for Spotify data:

import sqlite3
from sqlalchemy import create_engine, text

filepath = '../data/spotify.db'

# Create a database connection
engine = create_engine(f'sqlite:///{filepath}')

# Define table schemas with proper data types and 
# relationships
artists_schema = """
CREATE TABLE IF NOT EXISTS artists (
    artist_id TEXT PRIMARY KEY,
    name TEXT NOT NULL UNIQUE,
    followers INTEGER,
    popularity INTEGER,
    genres TEXT
)
"""

albums_schema = """
CREATE TABLE IF NOT EXISTS albums (
    album_id TEXT PRIMARY KEY,
    artist_id TEXT NOT NULL,
    name TEXT NOT NULL,
    release_date DATE,
    total_tracks INTEGER DEFAULT 0,
    FOREIGN KEY (artist_id)
    REFERENCES artists(artist_id)
)
"""

tracks_schema = """
CREATE TABLE IF NOT EXISTS tracks (
    track_id TEXT PRIMARY KEY,
    album_id TEXT NOT NULL,
    name TEXT NOT NULL,
    duration_ms INTEGER,
    popularity INTEGER DEFAULT 0,
    FOREIGN KEY (album_id) REFERENCES albums(album_id)
)
"""

# Create the tables
with engine.connect() as conn:
    conn.execute(text(artists_schema))
    conn.execute(text(albums_schema))
    conn.execute(text(tracks_schema))

3. Storing Data in Your Database

Preparing Your Data

Before storing data, make sure it matches your schema:

# Example: One column at a time (basic approach)
artists_df['artist_id'] = artists_df['artist_id'].astype(str)
artists_df['followers'] = artists_df['followers'].astype(int)

# Better: Method chaining (more elegant approach)
artists_df = (artists_df
    .assign(
        artist_id=lambda x: x['artist_id'].astype(str),
        name=lambda x: x['name'].astype(str),
        followers=lambda x: x['followers'].astype(int),
        popularity=lambda x: x['popularity'].astype(int),
        genres=lambda x: x['genres'].astype(str)
    )
)

albums_df = (albums_df
    .assign(
        album_id=lambda x: x['album_id'].astype(str),
        artist_id=lambda x: x['artist_id'].astype(str),
        name=lambda x: x['name'].astype(str),
        release_date=lambda x: pd.to_datetime(x['release_date']),
        total_tracks=lambda x: x['total_tracks'].astype(int)
    )
)

tracks_df = (tracks_df
    .assign(
        track_id=lambda x: x['track_id'].astype(str),
        album_id=lambda x: x['album_id'].astype(str),
        name=lambda x: x['name'].astype(str),
        duration_ms=lambda x: x['duration_ms'].astype(int),
        popularity=lambda x: x['popularity'].astype(int)
    )
)

Storing DataFrames

ALWAYS use if_exists='append' unless you specifically want to delete existing data:

# Store data while preserving existing records
artists_df.to_sql(
    'artists',
    engine,
    if_exists='append',  # This preserves existing data
    index=False
)

albums_df.to_sql(
    'albums',
    engine,
    if_exists='append',
    index=False
)

tracks_df.to_sql(
    'tracks',
    engine,
    if_exists='append',
    index=False
)

Specifying Data Types

When storing data, you can specify SQLAlchemy types to ensure proper data conversion:

from sqlalchemy import Integer, Float, String, DateTime

# Define column types
dtypes = {
    'artist_id': String,
    'name': String,
    'followers': Integer,
    'popularity': Integer,
    'genres': String
}

# Store with specific types
artists_df.to_sql(
    'artists',
    engine,
    if_exists='append',
    index=False,
    dtype=dtypes
)

4. Querying Your Data

Basic Queries

# Read entire table
df = pd.read_sql('SELECT * FROM artists', engine)

# Select specific columns
df = pd.read_sql('SELECT name, followers FROM artists', engine)

# Filter data
df = pd.read_sql(
    'SELECT * FROM tracks WHERE popularity > 80',
    engine
)

Joining Tables

# Using SQL JOIN
query = """
SELECT t.*, a.name as album_name, ar.name as artist_name
FROM tracks t
JOIN albums a ON t.album_id = a.album_id
JOIN artists ar ON a.artist_id = ar.artist_id
WHERE t.popularity > 80
ORDER BY t.popularity DESC
"""
joined_df = pd.read_sql(query, engine)

# Alternative: Using pandas merge
artists = pd.read_sql('SELECT * FROM artists', engine)
albums = pd.read_sql('SELECT * FROM albums', engine)
tracks = pd.read_sql('SELECT * FROM tracks', engine)

# Chain the merges
joined_df = (tracks
    .merge(albums, on='album_id')
    .merge(artists, on='artist_id')
)

Aggregating Data

# Get artist popularity metrics
query = """
SELECT 
    ar.name as artist_name,
    COUNT(DISTINCT a.album_id) as num_albums,
    COUNT(DISTINCT t.track_id) as num_tracks,
    AVG(t.popularity) as avg_track_popularity,
    SUM(t.duration_ms) / 60000.0 as total_duration_minutes
FROM artists ar
LEFT JOIN albums a ON ar.artist_id = a.artist_id
LEFT JOIN tracks t ON a.album_id = t.album_id
GROUP BY ar.name
ORDER BY avg_track_popularity DESC
"""
metrics_df = pd.read_sql(query, engine)

5. Best Practices

Design First: Always design your schema BEFORE storing data
- Define tables and relationships
- Choose appropriate data types
- Add necessary constraints
- Create indexes for performance
Data Integrity:
- Use PRIMARY KEY constraints
- Use FOREIGN KEY constraints
- Set NOT NULL where appropriate
- Use UNIQUE constraints when needed
Safe Data Storage:
- Always use if_exists='append' unless you specifically want to overwrite
- Verify data types match your schema
- Test with small datasets first
Performance:
- Create indexes for frequently queried columns
- Only SELECT the columns you need
- Use appropriate data types
Backup:
```
# Backup your database regularly
import shutil
from datetime import datetime

backup_name = f"spotify_db_backup_{datetime.now().strftime('%Y%m%d')}.db"
shutil.copy2('spotify.db', backup_name)
```
If your SQLite database file is large (more than 100MB), do not commit it to GitHub!
- Database files can be very large and are not suitable for version control
- GitHub has a file size limit of 100MB
- Store large databases in cloud storage (even Google Drive is fine for the purpose of this course) then share download links with collaborators (add to README.md)
- By the way, if you need to do that, remember to add database files to .gitignore
  # Add to .gitignore *.db *.sqlite *.sqlite3