🗄️ Databases & SQL Cookbook

A practical guide to storing and connecting DataFrames

Author

Dr Jon Cardoso-Silva

Last updated

21 August 2025

Use this guide when you need to store pandas DataFrames in a database, or if you have access to a database and need to get data from it into a pandas DataFrame.

1. Why Use Databases?

Think of a database as a collection of interconnected DataFrames with superpowers:

Efficient Storage: Only load the data you need.
Data Relationships: Connect related DataFrames using common keys.
Data Integrity: Enforce rules to ensure data consistency.
Query Optimization: Efficiently filter and aggregate huge datasets.
Concurrent Access: Allow multiple users or scripts to access the data simultaneously.

In this course, we are moving up the ladder of abstraction: from raw text files (like CSVs or JSON) to a structured, robust, and scalable way of managing data.

2. Data Types in SQLite

Before you can store data, you must design your database schema. This is the blueprint for how your data will be organised. A key part of this is choosing the right data type for each column.

Core Data Types

SQLite supports these basic types:

TEXT      -- For strings of any length
INTEGER   -- For whole numbers
REAL      -- For decimal (floating-point) numbers
BLOB      -- For binary data (e.g., images, rarely used in this course)
NULL      -- Represents a missing value

You can read more about SQLite data types here.

Precise Data Types

You can also be more specific to save space and ensure data integrity.

For integers, you can specify the number of bytes, just as we discussed in Week 1:

INTEGER(1)   -- 1-byte integer (-128 to 127)
INTEGER(2)   -- 2-byte integer (-32,768 to 32,767)
INTEGER(4)   -- 4-byte integer (~ -2.1b to 2.1b)
INTEGER(8)   -- 8-byte integer (very, very large)

For text with a maximum length, you can use VARCHAR (variable character):

VARCHAR(10)   -- Stores up to 10 characters
VARCHAR(300)  -- Stores up to 300 characters

Date and Time

SQLite doesn’t have a dedicated DATETIME type, but it provides powerful functions to handle date and time information, which is usually stored as TEXT, REAL, or INTEGER.

-- Get the current date and time
SELECT datetime('now');

-- Get just the date
SELECT date('now');

-- Get just the time
SELECT time('now');

Read more about the date and time functions here.

3. Designing and Building Your Database

The Workflow: `CREATE` then `INSERT`

Always create your tables with a defined schema before you insert any data. Here is an example using the Reddit data from our lab.

import sqlite3
from sqlalchemy import create_engine, text

# Define the location for our database file
db_filepath = 'political_discourse.db'

# Create a database connection engine
engine = create_engine(f'sqlite:///{db_filepath}')

# Define the schemas for our tables with proper data types and keys
posts_schema = """
CREATE TABLE IF NOT EXISTS posts (
    post_id VARCHAR(10) PRIMARY KEY,
    subreddit VARCHAR(50),
    title VARCHAR(300),
    author VARCHAR(50),
    created_utc INTEGER
);
"""

comments_schema = """
CREATE TABLE IF NOT EXISTS comments (
    comment_id VARCHAR(10) PRIMARY KEY,
    post_id VARCHAR(10),
    comment_body TEXT,
    comment_author VARCHAR(50),
    score INTEGER,
    FOREIGN KEY (post_id) REFERENCES posts(post_id)
);
"""

# Use the connection to execute the SQL statements
with engine.connect() as conn:
    conn.execute(text("DROP TABLE IF EXISTS comments;")) # Start fresh
    conn.execute(text("DROP TABLE IF EXISTS posts;"))    # Start fresh
    conn.execute(text(posts_schema))
    conn.execute(text(comments_schema))

Populating Your Tables with `pandas`

Once the tables exist, you can use pandas.to_sql() to insert the data from your DataFrames.

Crucially, use if_exists='append' to add data to your existing, well-defined tables. Using 'replace' would cause pandas to guess the data types, ignoring your carefully crafted schema.

# Assume df_all_posts and df_all_comments are your prepared pandas DataFrames

# Store data in the 'posts' table
df_all_posts.to_sql(
    'posts',
    engine,
    if_exists='append',  # Use 'append' to respect the existing schema
    index=False
)

# Store data in the 'comments' table
df_all_comments.to_sql(
    'comments',
    engine,
    if_exists='append',
    index=False
)

4. Querying Your Data with SQL

Basic Queries with `pd.read_sql()`

You can execute any SQL query and load the results directly into a DataFrame.

# Read the entire 'posts' table
df = pd.read_sql('SELECT * FROM posts', engine)

# Select specific columns from the 'comments' table
df = pd.read_sql('SELECT comment_author, score FROM comments', engine)

# Filter for highly-upvoted comments
df = pd.read_sql(
    "SELECT * FROM comments WHERE score > 100",
    engine
)

Joining Tables to Create Insights

This is where the power of relational databases shines. You can combine data from multiple tables in a single, efficient query.

# SQL JOIN to get posts and their comment counts
query = """
SELECT 
    p.subreddit,
    p.title,
    p.author,
    COUNT(c.comment_id) as num_comments
FROM 
    posts AS p
LEFT JOIN 
    comments AS c ON p.post_id = c.post_id
GROUP BY 
    p.post_id
ORDER BY 
    num_comments DESC;
"""
# Execute the query and get the result as a DataFrame
df_joined = pd.read_sql(query, engine)

Aggregating Data

SQL is excellent for calculating summary statistics.

# Get engagement metrics per subreddit
query = """
SELECT 
    p.subreddit,
    COUNT(DISTINCT p.post_id) as num_posts,
    COUNT(c.comment_id) as num_comments,
    AVG(c.score) as avg_comment_score
FROM 
    posts AS p
LEFT JOIN 
    comments AS c ON p.post_id = c.post_id
GROUP BY 
    p.subreddit
ORDER BY 
    num_comments DESC;
"""
df_metrics = pd.read_sql(query, engine)

5. Best Practices

Design First: Always design your schema before writing code to insert data. Think about your tables, columns, data types, and the relationships between them.
Data Integrity is Key:
- Use a PRIMARY KEY to uniquely identify each row.
- Use a FOREIGN KEY to enforce relationships between tables.
- Use NOT NULL constraints on columns that must have a value.
Safe Data Storage:
- Always use if_exists='append' with a pre-defined schema unless you have a very good reason to overwrite everything.
- Validate your DataFrame’s columns and dtypes before trying to insert them.
Version Control for Databases:
- NEVER commit large database files (.db, .sqlite) to Git. GitHub has a file size limit (usually 100MB) and is not designed for versioning large binary files.
- Add *.db and *.sqlite to your .gitignore file.
- If you need to share a database, upload it to a cloud storage service (like Google Drive, S3, or Dropbox) and include the download link in your project’s README.md.

Backup Your Work:

Databases are just files. You can copy them to create backups.

import shutil
from datetime import datetime

backup_name = f"database_backup_{datetime.now().strftime('%Y%m%d')}.db"
shutil.copy2('political_discourse.db', backup_name)