π» Week 02, Day 04 - Lab
Creating a Database and designing exploratory queries
In this lab, you will go beyond just practicing a technique and conduct a small-scale analysis. You will: i) Choose and collect post and comment data from two contrasting communities of your choice. ii) Combine the datasets using pandas.merge()
. iii) Design and create a new SQLite database to properly store the relational data. iv) Populate your database with the data you collected. v) Write SQL queries to compare engagement patterns and discover insights between the two communities.
Introduction: From Technical Practice to Real Analysis
In the lecture, we walked through the mechanics of storing and querying relational data. Now, youβll apply that workflow to answer a research question of your choosing: How does online discourse differ between two contrasting communities?
Your mission is to select two communities that interest you, collect their data, structure it in a database, and perform queries to discover patterns and differences.
Choose Your Research Subject
You have complete freedom to choose which communities to analyse. Here are some successful approaches from previous students:
Topic-Based Contrasts:
- Gaming:
r/NintendoSwitch
vsr/pcmasterrace
- Finance:
r/investing
vsr/wallstreetbets
- Education:
r/AskAcademia
vsr/careeradvice
- Health:
r/fitness
vsr/bodyweightfitness
Ideological/Political Contrasts:
r/democrats
vsr/Conservative
r/climate
vsr/ClimateSkeptics
r/Futurology
vsr/collapse
Size/Style Contrasts:
r/movies
(large community) vsr/TrueFilm
(smaller, focused)r/science
vsr/EverythingScience
π Pro Tip: This lab can serve as excellent foundation work for your final project! Choose communities that genuinely interest you and could support a deeper analysis later in the course.
Setup
- Make sure you are working inside your
me204-study-notes
repository. - Create a new folder for this weekβs work, for example:
W02D04/
. - Create a new Jupyter Notebook (e.g.,
W02D04-lab.ipynb
). This is where you will build your analysis.
Lab Tasks
Follow these steps in your notebook. Use Markdown cells to document your process and explain your findings.
π Task 1: Data Collection
Choose and collect data from two contrasting subreddits.
Select two communities that offer an interesting comparison for your research question. Then, using the pagination techniques from yesterdayβs lab, collect the top 50-100 posts (and their comments) from both subreddits. Focus on posts from the last month to keep it current. Store the data for each subreddit into four separate files (or better yet, straightaway just produce two files, one for posts and one for comments):
[subreddit1]_posts.csv
[subreddit1]_comments.csv
[subreddit2]_posts.csv
[subreddit2]_comments.csv
Self-Correction: Make sure your post and comment dataframes for each subreddit have a common
post_id
column!
π Task 2: Consolidate and Merge
Load and combine your data.
Read your four CSV files into separate pandas DataFrames. Add a
subreddit
column to each DataFrame before you combine them, so you can tell which data came from where. Combine the two post DataFrames into a singledf_all_posts
. Combine the comment DataFrames into a singledf_all_comments
.Merge into a final analysis frame.
Use
pd.merge()
to joindf_all_posts
anddf_all_comments
into one master DataFrame calleddf_merged
.
π Task 3: Design and Build Your Database
Create a database connection.
Use
sqlalchemy.create_engine()
to create a connection to a new database file. Name it meaningfully (e.g.,gaming_communities.db
,financial_discourse.db
).Design and create your tables with SQL.
Write
CREATE TABLE
statements for aposts
table and acomments
table. Then think carefully about the data types. Shouldscore
be anINTEGER
? Shouldtitle
beVARCHAR(300)
orTEXT
?Crucially, define the
PRIMARY KEY
for each table and theFOREIGN KEY
relationship between them. Execute the SQL to create the tables.
π Task 4: Populate and Query
Insert your data.
Use the
to_sql()
method to insert the data from your consolidateddf_all_posts
anddf_all_comments
DataFrames into your new database tables. Remember to setif_exists='append'
to avoid overwriting the existing table definitions.Query your database to compare the communities.
Write a SQL query to calculate the average number of comments per post for each subreddit. Which community is more talkative?
Write a SQL query to find the average score of posts in each subreddit. Which communityβs posts are more highly upvoted on average?
Use
pd.read_sql()
to run these queries and present your findings in your notebook.
π Bonus Challenge: Discover Deeper Patterns
Expand your analysis with custom queries tailored to your topic and write custom SQL queries (or pandas queries) to explore patterns specific to your chosen communities.
If you went for a topic-based contrast, can you identify common themes or keywords that appear in both communities? How does their reception differ?
If you went for an ideological contrast, are there news articles or events discussed in both communities? Compare their scores and comment engagement.
If you went for a size contrast, how does post complexity (title length, comment depth) vary between large and small communities?
π Reflection Questions
In a final Markdown cell, reflect on:
- What was the most surprising difference you discovered between your chosen communities?
- How did using a database compare to working with CSV files and pandas alone?
- If you were to expand this analysis for a final project, what additional data would you want to collect?