π Syllabus
LSE ME204 (2024) β Data Engineering for the Social World
Week 01 (08 July - 14 July) | Know your Data
ποΈ Day 01
(Mon 8 Jul)
TOPIC: Welcome, Course Logistics and Computer Setup
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Articulate why data preparation is important for data analysis
- Discuss the essential differences between R and Python for data science
- Follow official tutorials to set up your computer for data analysis
- Install the necessary software for the course
Morning Lecture
10.00am - 1.00pm
π§βπ« Slides: Introduction & The Data Science Toolbox
(10.00am - 11.00am)
π΅ Little Break
(11.00am - 11.15am)
π Activity: Setting up your computer
(11.15am - 1.00pm)
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Meet the Data Frame (a comparison of R vs. Python)
ποΈ Day 02
(Tue 9 Jul)
TOPIC: Data Types & Common File Formats
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Understand how different data types are stored in computer memory
- Modify data types of columns in data frames
- Modify contents of columns in a data frame
- Create new columns in a data frame
- Compare and contrast CSV, XML and JSON file formats
- Read and write data in different file formats
Morning Lecture
10.00am - 1.00pm
π§βπ» Programming Practice:
(10:00am - 11:15am)
- Solve a few programming puzzles to help us reach a consensus: R or Python?
- Sort out pending installation issues you might have
π΅ Little Break
(11.15am - 11.30am)
π§βπ« Slides: Data Types & File Formats
(11:30am - 1:00pm)
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Tidying up tabular data
ποΈ Day 03
(Wed 10 Jul)
TOPIC: Summarizing and Visualizing Data
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Use computational notebooks to document your data analysis
- Write Markdown to format your computational notebooks
- Articulate the relevance of summarizing data
- Use the
groupby
->apply
->combine
pattern to summarize data - Create scatterplots, bar charts, histograms, and box plots
Morning Lecture
10.00am - 1.00pm
π§βπ» Live Coding: Summarizing Data
(10:00am - 11:15am)
- How to create computational notebooks (Jupyter if Python, Quarto Markdown if R)
- Introduction to Markdown
- Introduction to the
groupby
->apply
->combine
pattern - Demonstration of GitHub Copilot in action
π΅ Little Break
(11.15am - 11.30am)
π§βπ» Live Coding: Data visualization
(11:30am - 1:00pm)
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Dataviz practice
ποΈ Day 04
(Thu 11 Jul)
TOPIC: Reshaping Data for Visualization
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
Morning Lecture
10.00am - 1.00pm
π Activity: Practice data reshaping with dataviz exercises
(10:00am - 11:15am)
- Practice updating functions to apply to data
- Practice rewriting code as you scale up your data analysis
π΅ Little Break
(11.15am - 11.30am)
π Activity: Practice data reshaping with dataviz exercises
(11:30am - 1:00pm)
- Practice some
groupby
->apply
->combine
patterns - Putting it all together:
- Neat computational notebooks
- Neat documentation with Markdown
- Tidy data
- Appropriate data types
- Data visualization
- Groupby -> Apply -> Combine as needed
π’ Midterm Assignment Reveal
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π¦Έ Super Tech Support: Data pre-processing + dataviz
- Get help with your midterm assignment
π Enjoy!
(Fri 12 Jul β Sun 14 Jul)
Sightseeing Tips
Midterm support
π§βπΌ Office Hours: Friday, 14 July 2024 from 10am-12pm
- Attend office hours if you need additional assistance.
- No need to book, but please be patient if there are other students ahead of you in the queue.
- A typical office hour session lasts ~15 minutes.
Week 02 (15 July - 21 July) | Collecting Data
ποΈ Day 01
(Mon 15 Jul)
TOPIC: Collecting data from the Web
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Articulate the difference between the Internet and the Web
- Use HTML and CSS to create a simple webpage
- Write code to automate the collection of data from websites
Morning Lecture
10.00am - 1.00pm
π§βπ« Slides: The Internet and the Web
(10.00am - 10.45am)
π΅ Little Break
(10.45am - 11.00am)
π Activity: HTML and CSS
(11.00am - 12.00pm)
π§βπ» Live Coding: Collecting data from websites
(12:00pm - 1:00pm)
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Web scraping practice I
ποΈ Day 02
(Tue 16 Jul)
TOPIC: Web scraping tricks & Generative AI for debugging code
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Create your own website with just Markdown (no HTML or CSS required)
- Choose between CSS and XPath selectors to scrape data from websites
- Write human-readable scraping code
- Store scraped data in a structured format
- Use Generative AI tools, such as GitHub Copilot to debug code
Morning Lecture
10.00am - 1.00pm
π Activity: Web scraping practice
(10:00am - 11.15am)
- Continue to practice web scraping
- How to spot the identifiable HTML element near the data you want to scrape
- How to handle
<br>
tags in your scraped data
π΅ Little Break
(11.15am - 11.30am)
π§βπ» Live Coding: A deep dive into CSS and XPath Selectors
(11:30am - 1:00pm)
- How to choose between CSS and XPath selectors
- How to write human-readable scraping code
- Organizing your scraping code into functions (and why you should)
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Web scraping practice II
Take-Home Assignment
π Activity: Creating a website with Markdown
- Set up a GitHub account
- Create your profile page
- Create a new repository to store code for this course
- Create a website with Markdown and publish it on GitHub Pages
ποΈ Day 03
(Wed 17 Jul)
TOPIC: JSON and APIs
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Understand the JSON file format
- Navigate the JSON file structure
- Write code to collect data from APIs
- Navigate API documentation
Morning Lecture
10.00am - 1.00pm
βͺ Review & Solutions: Wikipedia scraping
(10:00am - 11:30am)
- Live demo of solutions to π» Week 02 Day 01 Lab
- Discussion of your solutions to π» Week 02 Day 02 Lab
- Using GitHub Copilot while coding
- Ethical scraping: when is it not OK to scrape data from a website? The role of the
robots.txt
file.
π΅ Little Break
(11.30am - 11.45am)
π§βπ» Live Coding: Collecting data from APIs
(12:00pm - 1:00pm)
- Set up a developer account with Reddit
- Connect to the Reddit API
- How to read the Reddit API documentation
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Collecting data from social media APIs (Reddit)
- Continuation of the morningβs live coding session
ποΈ Day 04
(Thu 18 Jul)
TOPIC: Web Crawlers & Browser Automation
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Set up Git on your machine.
- Understand the advantages of using
scrapy spider
overrequests
+Scrapy Selectors
. - Understand the architecture of a
scrapy spider
. - Use the
scrapy shell
to test your CSS selectors and XPath expressions. - Create a new Scrapy project and a new spider.
- Use the
scrapy crawl
command to run your spider. - Save the scraped data to a JSON or JSONL file.
Morning Lecture
10.00am - 1.00pm
π Activity: Setting Up Git on Your Machine
(10:00am - 10:45am)
π΅ Little Break
(10.45am - 11.00am)
π¨βπ» Live Coding: Scrapy spiders and Selenium
(11:00am - 1:00pm)
π’ Assignment Reveal: Instructions about your final project
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π¦Έ Super Tech Support: Data Collection and Project Setup
- Get support for any issues you might have with scrapy spiders or Selenium
- Get help with your final project
π Enjoy!
(Fri 19 Jul β Sun 21 Jul)
Sightseeing Tips
π§βπΌ Office Hours
- Attend office hours on Friday morning (from 10am to 12pm) if you need additional assistance.
- No need to book, but please be patient if there are other students ahead of you in the queue.
- A typical office hour session lasts ~15 minutes.
Week 03 (22 July - 28 July) | Databases & Dashboards
ποΈ Day 01
(Mon 22 Jul)
TOPIC: Intro to Databases and SQL
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Understand the similarities and differences between SQL and Python/Rβs data manipulation libraries
- Write basic SQL commands: SELECT, WHERE, GROUP BY, ORDER BY, JOIN
- Translate your data analysis from Python/R to SQL
- Use SQL to query databases
Morning Lecture
10.00am - 1.00pm
π§βπ» Live Coding: Moving away from simple data files
(10.00am - 11.15am)
π΅ Little Break
(11.15am - 11.30am)
π§βπ» Live Coding: Basic SQL commands
(11:30pm - 1:00pm)
- How SQL compares to Python/Rβs data manipulation libraries
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: SQL Practice
- Replicating parts of your data analysis in SQL
ποΈ Day 02
(Tue 23 Jul)
TOPIC: Reporting and Dashboards
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Create interactive visualizations
- Create a dashboard with multiple visualizations
- Use a dashboard to tell a story with data
- Use a dashboard to make data-driven decisions
Morning Lecture
10.00am - 1.00pm
π¦Έ Super Tech Support: Revision and Project Support
(10:00am - 11.15am)
- Q&A session on data manipulation and SQL
- Get help with your final project
- Further Generative AI tips (GitHub Copilot / ChatGPT)
π΅ Little Break
(11.15am - 11.30am)
π§βπ» Live Coding: Interactive Visualisations and Dashboards
(11:30am - 1:00pm)
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π» Lab: Dashboard Practice
ποΈ Day 03
(Wed 24 Jul)
TOPIC: Review and Support
Morning Lecture
10.00am - 1.00pm
π¦Έ Super Tech Support
(10:00am - 1:00pm)
Get help with:
- Your specific scraping needs
- Selenium
- Databases
- Merging data from multiple tables
- Data vizualization
- Creating markdown websites
- Git/GitHub
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π¦Έ Super Tech Support: Project Setup
- Get help with your final project
ποΈ Day 04
(Thu 25 Jul)
TOPIC: Managing your data pipeline
π₯ Objectives
Review the goals for today
At the end of the day you should be able to:
- Organise your folders for your data pipeline
- Use GitHub for version control of your code
- Write good markdown documentation
- Keep track of data provenance
- Ensure reproducibility of your work
Morning Lecture
10.00am - 1.00pm
π§βπ» Live Coding: Managing your data pipeline
(10:00am - 11:45am)
π΅ Little Break
(11.45am - 12.00pm)
π¦Έ Super Tech Support: Final Project
(12:00pm - 1:00pm)
- Get help with your final project
Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm
π¦Έ Super Tech Support: Final Project
- Get help with your final project
ποΈ Day 05
(Fri 26 Jul)
π§βπΌ Office Hours
- Attend office hours on Friday morning (from 10am to 12pm) if you need additional assistance.
- No need to book, but please be patient if there are other students ahead of you in the queue.
- A typical office hour session lasts ~15 minutes.
β³ Deadline:
Submit your final project β