💻 Week 02, Day 04 - Lab

Creating a Database and designing exploratory queries

Author

Dr Jon Cardoso-Silva

Last updated

31 July 2025

🎯 Lab Objectives

In this lab, you will go beyond just practicing a technique and conduct a small-scale analysis. You will: i) Choose and collect post and comment data from two contrasting communities of your choice. ii) Combine the datasets using pandas.merge(). iii) Design and create a new SQLite database to properly store the relational data. iv) Populate your database with the data you collected. v) Write SQL queries to compare engagement patterns and discover insights between the two communities.

Introduction: From Technical Practice to Real Analysis

In the lecture, we walked through the mechanics of storing and querying relational data. Now, you’ll apply that workflow to answer a research question of your choosing: How does online discourse differ between two contrasting communities?

Your mission is to select two communities that interest you, collect their data, structure it in a database, and perform queries to discover patterns and differences.

Choose Your Research Subject

You have complete freedom to choose which communities to analyse. Here are some successful approaches from previous students:

Topic-Based Contrasts:

Gaming: r/NintendoSwitch vs r/pcmasterrace
Finance: r/investing vs r/wallstreetbets
Education: r/AskAcademia vs r/careeradvice
Health: r/fitness vs r/bodyweightfitness

Ideological/Political Contrasts:

r/democrats vs r/Conservative
r/climate vs r/ClimateSkeptics
r/Futurology vs r/collapse

Size/Style Contrasts:

r/movies (large community) vs r/TrueFilm (smaller, focused)
r/science vs r/EverythingScience

📖 Pro Tip: This lab can serve as excellent foundation work for your final project! Choose communities that genuinely interest you and could support a deeper analysis later in the course.

Setup

Make sure you are working inside your me204-study-notes repository.
Create a new folder for this week’s work, for example: W02D04/.
Create a new Jupyter Notebook (e.g., W02D04-lab.ipynb). This is where you will build your analysis.

Lab Tasks

Follow these steps in your notebook. Use Markdown cells to document your process and explain your findings.

📝 Task 1: Data Collection

Choose and collect data from two contrasting subreddits.

Select two communities that offer an interesting comparison for your research question. Then, using the pagination techniques from yesterday’s lab, collect the top 50-100 posts (and their comments) from both subreddits. Focus on posts from the last month to keep it current. Store the data for each subreddit into four separate files (or better yet, straightaway just produce two files, one for posts and one for comments):
- [subreddit1]_posts.csv
- [subreddit1]_comments.csv
- [subreddit2]_posts.csv
- [subreddit2]_comments.csv
Self-Correction: Make sure your post and comment dataframes for each subreddit have a common post_id column!

📝 Task 2: Consolidate and Merge

Load and combine your data.

Read your four CSV files into separate pandas DataFrames. Add a subreddit column to each DataFrame before you combine them, so you can tell which data came from where. Combine the two post DataFrames into a single df_all_posts. Combine the comment DataFrames into a single df_all_comments.
Merge into a final analysis frame.

Use pd.merge() to join df_all_posts and df_all_comments into one master DataFrame called df_merged.

📝 Task 3: Design and Build Your Database

Create a database connection.

Use sqlalchemy.create_engine() to create a connection to a new database file. Name it meaningfully (e.g., gaming_communities.db, financial_discourse.db).
Design and create your tables with SQL.

Write CREATE TABLE statements for a posts table and a comments table. Then think carefully about the data types. Should score be an INTEGER? Should title be VARCHAR(300) or TEXT?

Crucially, define the PRIMARY KEY for each table and the FOREIGN KEY relationship between them. Execute the SQL to create the tables.

📝 Task 4: Populate and Query

Insert your data.

Use the to_sql() method to insert the data from your consolidated df_all_posts and df_all_comments DataFrames into your new database tables. Remember to set if_exists='append' to avoid overwriting the existing table definitions.
Query your database to compare the communities.
- Write a SQL query to calculate the average number of comments per post for each subreddit. Which community is more talkative?
- Write a SQL query to find the average score of posts in each subreddit. Which community’s posts are more highly upvoted on average?
- Use pd.read_sql() to run these queries and present your findings in your notebook.

🏆 Bonus Challenge: Discover Deeper Patterns

Expand your analysis with custom queries tailored to your topic and write custom SQL queries (or pandas queries) to explore patterns specific to your chosen communities.

If you went for a topic-based contrast, can you identify common themes or keywords that appear in both communities? How does their reception differ?
If you went for an ideological contrast, are there news articles or events discussed in both communities? Compare their scores and comment engagement.
If you went for a size contrast, how does post complexity (title length, comment depth) vary between large and small communities?

🎓 Reflection Questions

In a final Markdown cell, reflect on:

What was the most surprising difference you discovered between your chosen communities?
How did using a database compare to working with CSV files and pandas alone?
If you were to expand this analysis for a final project, what additional data would you want to collect?