ποΈ Databases & SQL Cookbook
A practical guide to storing and connecting DataFrames
Use this guide when you need to store pandas DataFrames in a database, or if you have access to a database and need to get data from it into a pandas DataFrame.
1. Why Use Databases?
Think of a database as a collection of interconnected DataFrames with superpowers:
- Efficient Storage: Only load the data you need.
- Data Relationships: Connect related DataFrames using common keys.
- Data Integrity: Enforce rules to ensure data consistency.
- Query Optimization: Efficiently filter and aggregate huge datasets.
- Concurrent Access: Allow multiple users or scripts to access the data simultaneously.
In this course, we are moving up the ladder of abstraction: from raw text files (like CSVs or JSON) to a structured, robust, and scalable way of managing data.
2. Data Types in SQLite
Before you can store data, you must design your database schema. This is the blueprint for how your data will be organised. A key part of this is choosing the right data type for each column.
Core Data Types
SQLite supports these basic types:
-- For strings of any length
TEXT INTEGER -- For whole numbers
REAL -- For decimal (floating-point) numbers
BLOB -- For binary data (e.g., images, rarely used in this course)
NULL -- Represents a missing value
You can read more about SQLite data types here.
Precise Data Types
You can also be more specific to save space and ensure data integrity.
For integers, you can specify the number of bytes, just as we discussed in Week 1:
INTEGER(1) -- 1-byte integer (-128 to 127)
INTEGER(2) -- 2-byte integer (-32,768 to 32,767)
INTEGER(4) -- 4-byte integer (~ -2.1b to 2.1b)
INTEGER(8) -- 8-byte integer (very, very large)
For text with a maximum length, you can use VARCHAR
(variable character):
VARCHAR(10) -- Stores up to 10 characters
VARCHAR(300) -- Stores up to 300 characters
Date and Time
SQLite doesnβt have a dedicated DATETIME
type, but it provides powerful functions to handle date and time information, which is usually stored as TEXT
, REAL
, or INTEGER
.
-- Get the current date and time
SELECT datetime('now');
-- Get just the date
SELECT date('now');
-- Get just the time
SELECT time('now');
Read more about the date and time functions here.
3. Designing and Building Your Database
The Workflow: CREATE
then INSERT
Always create your tables with a defined schema before you insert any data. Here is an example using the Reddit data from our lab.
import sqlite3
from sqlalchemy import create_engine, text
# Define the location for our database file
= 'political_discourse.db'
db_filepath
# Create a database connection engine
= create_engine(f'sqlite:///{db_filepath}')
engine
# Define the schemas for our tables with proper data types and keys
= """
posts_schema CREATE TABLE IF NOT EXISTS posts (
post_id VARCHAR(10) PRIMARY KEY,
subreddit VARCHAR(50),
title VARCHAR(300),
author VARCHAR(50),
created_utc INTEGER
);
"""
= """
comments_schema CREATE TABLE IF NOT EXISTS comments (
comment_id VARCHAR(10) PRIMARY KEY,
post_id VARCHAR(10),
comment_body TEXT,
comment_author VARCHAR(50),
score INTEGER,
FOREIGN KEY (post_id) REFERENCES posts(post_id)
);
"""
# Use the connection to execute the SQL statements
with engine.connect() as conn:
"DROP TABLE IF EXISTS comments;")) # Start fresh
conn.execute(text("DROP TABLE IF EXISTS posts;")) # Start fresh
conn.execute(text(
conn.execute(text(posts_schema)) conn.execute(text(comments_schema))
Populating Your Tables with pandas
Once the tables exist, you can use pandas.to_sql()
to insert the data from your DataFrames.
Crucially, use if_exists='append'
to add data to your existing, well-defined tables. Using 'replace'
would cause pandas to guess the data types, ignoring your carefully crafted schema.
# Assume df_all_posts and df_all_comments are your prepared pandas DataFrames
# Store data in the 'posts' table
df_all_posts.to_sql('posts',
engine,='append', # Use 'append' to respect the existing schema
if_exists=False
index
)
# Store data in the 'comments' table
df_all_comments.to_sql('comments',
engine,='append',
if_exists=False
index )
4. Querying Your Data with SQL
Basic Queries with pd.read_sql()
You can execute any SQL query and load the results directly into a DataFrame.
# Read the entire 'posts' table
= pd.read_sql('SELECT * FROM posts', engine)
df
# Select specific columns from the 'comments' table
= pd.read_sql('SELECT comment_author, score FROM comments', engine)
df
# Filter for highly-upvoted comments
= pd.read_sql(
df "SELECT * FROM comments WHERE score > 100",
engine )
Joining Tables to Create Insights
This is where the power of relational databases shines. You can combine data from multiple tables in a single, efficient query.
# SQL JOIN to get posts and their comment counts
= """
query SELECT
p.subreddit,
p.title,
p.author,
COUNT(c.comment_id) as num_comments
FROM
posts AS p
LEFT JOIN
comments AS c ON p.post_id = c.post_id
GROUP BY
p.post_id
ORDER BY
num_comments DESC;
"""
# Execute the query and get the result as a DataFrame
= pd.read_sql(query, engine) df_joined
Aggregating Data
SQL is excellent for calculating summary statistics.
# Get engagement metrics per subreddit
= """
query SELECT
p.subreddit,
COUNT(DISTINCT p.post_id) as num_posts,
COUNT(c.comment_id) as num_comments,
AVG(c.score) as avg_comment_score
FROM
posts AS p
LEFT JOIN
comments AS c ON p.post_id = c.post_id
GROUP BY
p.subreddit
ORDER BY
num_comments DESC;
"""
= pd.read_sql(query, engine) df_metrics
5. Best Practices
- Design First: Always design your schema before writing code to insert data. Think about your tables, columns, data types, and the relationships between them.
- Data Integrity is Key:
- Use a
PRIMARY KEY
to uniquely identify each row. - Use a
FOREIGN KEY
to enforce relationships between tables. - Use
NOT NULL
constraints on columns that must have a value.
- Use a
- Safe Data Storage:
- Always use
if_exists='append'
with a pre-defined schema unless you have a very good reason to overwrite everything. - Validate your DataFrameβs columns and dtypes before trying to insert them.
- Always use
- Version Control for Databases:
- NEVER commit large database files (
.db
,.sqlite
) to Git. GitHub has a file size limit (usually 100MB) and is not designed for versioning large binary files. - Add
*.db
and*.sqlite
to your.gitignore
file. - If you need to share a database, upload it to a cloud storage service (like Google Drive, S3, or Dropbox) and include the download link in your projectβs
README.md
.
- NEVER commit large database files (
- Backup Your Work:
- Databases are just files. You can copy them to create backups.
import shutil from datetime import datetime = f"database_backup_{datetime.now().strftime('%Y%m%d')}.db" backup_name 'political_discourse.db', backup_name) shutil.copy2(