ME204 β Data Engineering for the Social World
30 July 2025
We want you to exercise your communication skills at communicating data-driven insights to different audiences. In your π¦ Final Project, you will need to think of three hypothetical audiences for your project:
π README.md
Technical Colleagues
Other data scientists who might want to reproduce your work
π notebooks/
Data Analysts
Technical professionals who understand Python, pandas, and SQL.
π docs/index.md
General Public
Educated readers without any particular technical background.
π README.md
Technical Colleagues
What they need to know:
π notebooks/
Data Analysts
What they need to know:
π docs/index.md
General Public
What they need to know:
π docs/index.md
General Public
Todayβs Focus: Building Your Website
Learn to create beautiful, accessible data stories that anyone can understand and engage with.
GitHub Pages is a static site hosting service that takes files from a GitHub repository, runs them through a build process, and publishes a website.
Key Benefits:
username.github.io/repository-name
)Enable GitHub Pages:
Point to the docs/
folder:
main
branch/docs
Does your project structure look like this?
docs/
folderindex.md
file to the docs/
folderNote: The docs/index.md
file can be empty for now. Weβll add content to it in the next step.
Your Project Structure:
your-me204-project/
βββ data/
β βββ reddit.db
β βββ raw/
β βββ comments.json
β βββ posts.csv
β βββ subreddits.json
βββ docs/
β βββ index.md β π ADD THIS FILE
βββ .gitignore β Ignore files
βββ notebooks/
β βββ NB01-data-collection.ipynb β Saves JSON/CSV files
β βββ NB02-data-processing.ipynb β Creates SQLite database
β βββ NB03-analysis.ipynb β Exploratory analysis
βββ README.md β High-level description and reproducibility
βββ requirements.txt β Python dependencies (optional)
docs/index.md
Your Project Structure:
your-me204-project/
βββ README.md β For technical colleagues
βββ notebooks/ β For data analysts
β βββ NB01.ipynb
β βββ NB02.ipynb
βββ docs/
βββ index.md β For general public
Key Points:
docs/index.md
is just a regular markdown fileREADME.md
fileYou can mix markdown and HTML
(the language of the Web that we saw yesterday)
Example docs/index.md
content:
# My Reddit Analysis Project
This project, which I did for [ME204](https://lse-dsi.github.io/ME204/2025/), is about [...].
**What you will find in this website:**
1. **I discovered that [...].**
2. **Some other finding**
## Methodology & Justification
Because I was curious in understanding [...], I chose to collect data
from the following three subreddits: `r/AskReddit`, `r/explainlikeimfive`, and `r/AmItheAsshole`.
I focused on [...this and that rankings...] because [reasons...].
I also collected **all** comments from the posts.
The diagram below illustrates how I collected and preprocessed the data.

## Findings
Isn't this following figure cool?
Just like in Jupyter Notebooks, you can mix markdown and HTML
(the language of the Web that we saw yesterday)
Example docs/index.md
content:
# My Reddit Analysis Project
This project, which I did for [ME204](https://lse-dsi.github.io/ME204/2025/), is about [...].
<div style="border:1px solid #000;padding:10px;margin-bottom:10px;border-radius:5px;">
<span style="font-weight:bold;font-size:1.1em">What you will find in this website:</span>
<ol>
<li>I discovered that [...]</li>
<li>Some other finding</li>
</ol>
</div>
## Methodology & Justification
Because I was curious in understanding [...], I chose to collect data
from the following three subreddits: `r/AskReddit`, `r/explainlikeimfive`, and `r/AmItheAsshole`.
Go to the Actions tab:
Go back to Settings:
Your URL will look like: https://username.github.io/repository-name
Note: It may take a few minutes for your changes to appear live.
Students often make these mistakes when building their websites. Letβs clear up the confusion.
π This is NOT your public website, but itβs what you see when you view any .md
file directly on GitHub.
.md
file directly on GitHubREADME.md
fileKey point: This is just GitHubβs way of displaying markdown files nicely within their platform.
π This is your actual public website:
https://username.github.io/repository-name
Key point: This is your real website that you can share with anyone.
docs/
FolderYour project structure:
your-me204-project/
βββ data/
β βββ reddit.db
β βββ raw/
β βββ comments.json
β βββ posts.csv
β βββ subreddits.json
βββ docs/
β βββ index.md
β βββ figures/
β βββ my-plot.png β β
WILL be rendered
βββ notebooks/
β βββ NB01-data-collection.ipynb
β βββ NB02-data-processing.ipynb
β βββ NB03-analysis.ipynb
βββ figures/
β βββ another-plot.png β β WON'T be rendered
βββ README.md
βββ requirements.txt
Key point: Only files inside the docs/
folder are accessible to your GitHub Pages website.
β This will work:
β This wonβt work:
Understanding ./
and ../
:
./
means βcurrent folderβ (the docs/
folder)../
means βparent folderβ (the repository root)docs/
, so ../
wonβt workExamples:
./figures/plot.png
= look in docs/figures/plot.png
../figures/plot.png
= look in figures/plot.png
(outside docs/
)If you want to learn more about HTML and CSS, here are some useful resources:
π¨ CSS Styling
π HTML Elements
π€ GenAI chatbots
Time: 15 minutes
When you are done creating plots for your project, you will need to export them as PNG or SVG files so you can add them to your website.
In your NB03 notebook:
After creating your plot with matplotlib, add these lines to save it:
plt.savefig()
saves the current figuredpi=300
gives high qualitybbox_inches='tight'
removes extra whitespaceplt.close()
frees up memoryPNG vs SVG:
Choose PNG for: Complex visualisations with many data points
Choose SVG for: Simple charts, graphs, and when you need crisp resolution
You donβt need to always go for a plot to show data results. You can also use a table.
It helps if you customise how it looks so itβs not just a boring screenshot!
pandas
StylerLetβs give a bland DataFrame a makeover!
Your basic pandas table:
At any point you can save a DataFrame to HTML.
The data is there, but it could be more attractive.
Subreddit | Posts | Comments | Avg_Score | Engagement_Rate |
---|---|---|---|---|
AskReddit | 152 | 1595 | 3.116167 | 0.778937 |
explainlikeimfive | 229 | 1544 | 4.732352 | 0.682710 |
AmItheAsshole | 142 | 621 | 4.202230 | 0.248637 |
todayilearned | 64 | 966 | 4.416145 | 0.227277 |
relationship_advice | 156 | 1738 | 3.041169 | 0.228383 |
Your table might look slightly different (because my slides have some built-in styling).
What if I want to change how the table, as a whole, looks?
You can use df.style.set_properties()
to apply CSS styling to your table.
# Apply basic CSS styling
styled_df = df.style.set_properties(**{
'text-align': 'center',
'font-size': '0.85em', # Reduces the font 85%
'padding': '0.5em', # Adds space around the text
'border': '1px solid #ddd' # Adds a border to the table
})
# Save the styled table
styled_df.to_html('basic_styled_table.html', index=False)
Subreddit | Posts | Comments | Avg_Score | Engagement_Rate |
---|---|---|---|---|
AskReddit | 152 | 1595 | 3.116167 | 0.778937 |
explainlikeimfive | 229 | 1544 | 4.732352 | 0.682710 |
AmItheAsshole | 142 | 621 | 4.202230 | 0.248637 |
todayilearned | 64 | 966 | 4.416145 | 0.227277 |
relationship_advice | 156 | 1738 | 3.041169 | 0.228383 |
Your table might look slightly different (because my slides have some built-in styling).
Now letβs make the headers stand out:
Use set_table_styles()
to style specific parts of your table. Letβs make the headers bold and have a different background.
The skills you need here are similar to those I showed yesterday in the web scraping section. Youβd need to know, for example, that th
represents the table headers. and which CSS properties to use to style them.
# Add header styling
styled_df = df.style.set_properties(**{
'text-align': 'center',
'font-size': '0.85em',
'padding': '0.5em',
'border': '1px solid #ddd'
}).set_table_styles([
{'selector': 'th', 'props': [
('background-color', '#f8f9fa'),
('font-weight', 'bold'),
('padding', '0.5em'),
('text-align', 'center')
]}
])
Subreddit | Posts | Comments | Avg_Score | Engagement_Rate |
---|---|---|---|---|
AskReddit | 152 | 1595 | 3.116167 | 0.778937 |
explainlikeimfive | 229 | 1544 | 4.732352 | 0.682710 |
AmItheAsshole | 142 | 621 | 4.202230 | 0.248637 |
todayilearned | 64 | 966 | 4.416145 | 0.227277 |
relationship_advice | 156 | 1738 | 3.041169 | 0.228383 |
Your table might look slightly different (because my slides have some built-in styling).
Letβs add a colour scale to highlight the data:
Use background_gradient()
to add colour scales to numeric columns. This helps readers quickly spot patterns and compare values.
# Add colour gradients to numeric data
# Assume we also added the CSS styling from the previous step
styled_df = (
styled_df
.background_gradient(cmap='Blues', subset=['Posts', 'Comments'])
.background_gradient(cmap='Greens', subset=['Avg_Score'])
.background_gradient(cmap='Oranges', subset=['Engagement_Rate'])
)
Subreddit | Posts | Comments | Avg_Score | Engagement_Rate |
---|---|---|---|---|
AskReddit | 152 | 1595 | 3.116167 | 0.778937 |
explainlikeimfive | 229 | 1544 | 4.732352 | 0.682710 |
AmItheAsshole | 142 | 621 | 4.202230 | 0.248637 |
todayilearned | 64 | 966 | 4.416145 | 0.227277 |
relationship_advice | 156 | 1738 | 3.041169 | 0.228383 |
Final touches for a professional look:
Add zebra striping, hover effects, and modern styling.
This creates a table that looks like it belongs on a professional website.
# Professional styling with all features
styled_df = (
styled_df
.set_table_styles([
{'selector': 'th', 'props': [
('background-color', '#f8f9fa'),
('font-weight', 'bold'),
('padding', '0.5em'),
('border-radius', '0.25em 0.25em 0 0')
]},
{'selector': 'tr:nth-child(even)', 'props': [
('background-color', '#fafafa')
]},
{'selector': 'tr:hover', 'props': [
('background-color', '#f0f8ff')
]}
])
)
Subreddit | Posts | Comments | Avg. Score | Engagement Rate |
---|---|---|---|---|
AskReddit | 152 | 1595 | 3.116167 | 0.778937 |
explainlikeimfive | 229 | 1544 | 4.732352 | 0.682710 |
AmItheAsshole | 142 | 621 | 4.202230 | 0.248637 |
todayilearned | 64 | 966 | 4.416145 | 0.227277 |
relationship_advice | 156 | 1738 | 3.041169 | 0.228383 |
The user is responsible for interpreting the data and drawing their own conclusions.
Streamlit is a Python library that makes it easy to create web apps for data science.
Perfect for:
import streamlit as st
import pandas as pd
import plotly.express as px
# Load your data
df = pd.read_csv('data/processed/reddit_data.csv')
# Create the app
st.title('Reddit Analysis Dashboard')
# Add filters
selected_subreddit = st.selectbox('Choose a subreddit:',
df['subreddit'].unique()
)
To run your dashboard:
app.py
streamlit run app.py
Deployment options:
If anyone is interessed in using Quarto for their projects, let me know. I will help you with the Publishing to GitHub Pages part.
LSE Summer School 2025 | ME204 Week 03 Day 03