πŸ““ Syllabus

LSE ME204 (2024) – Data Engineering for the Social World

Author

Week 01 (08 July - 14 July) | Know your Data

πŸ—“οΈ Day 01
(Mon 8 Jul)

TOPIC: Welcome, Course Logistics and Computer Setup

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Articulate why data preparation is important for data analysis
  • Discuss the essential differences between R and Python for data science
  • Follow official tutorials to set up your computer for data analysis
  • Install the necessary software for the course

Morning Lecture
10.00am - 1.00pm

πŸ§‘β€πŸ« Slides: Introduction & The Data Science Toolbox
(10.00am - 11.00am)

🍡 Little Break
(11.00am - 11.15am)

πŸ“‹ Activity: Setting up your computer
(11.15am - 1.00pm)

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Meet the Data Frame (a comparison of R vs. Python)

πŸ—“οΈ Day 02
(Tue 9 Jul)

TOPIC: Data Types & Common File Formats

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Understand how different data types are stored in computer memory
  • Modify data types of columns in data frames
  • Modify contents of columns in a data frame
  • Create new columns in a data frame
  • Compare and contrast CSV, XML and JSON file formats
  • Read and write data in different file formats

Morning Lecture
10.00am - 1.00pm

πŸ§‘β€πŸ’» Programming Practice:
(10:00am - 11:15am)

  • Solve a few programming puzzles to help us reach a consensus: R or Python?
  • Sort out pending installation issues you might have

🍡 Little Break
(11.15am - 11.30am)

πŸ§‘β€πŸ« Slides: Data Types & File Formats
(11:30am - 1:00pm)

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Tidying up tabular data

πŸ—“οΈ Day 03
(Wed 10 Jul)

TOPIC: Summarizing and Visualizing Data

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Use computational notebooks to document your data analysis
  • Write Markdown to format your computational notebooks
  • Articulate the relevance of summarizing data
  • Use the groupby -> apply -> combine pattern to summarize data
  • Create scatterplots, bar charts, histograms, and box plots

Morning Lecture
10.00am - 1.00pm

πŸ§‘β€πŸ’» Live Coding: Summarizing Data
(10:00am - 11:15am)

  • How to create computational notebooks (Jupyter if Python, Quarto Markdown if R)
  • Introduction to Markdown
  • Introduction to the groupby -> apply -> combine pattern
  • Demonstration of GitHub Copilot in action

🍡 Little Break
(11.15am - 11.30am)

πŸ§‘β€πŸ’» Live Coding: Data visualization
(11:30am - 1:00pm)

  • Intro to the grammar of graphics (using ggplot2 if R, letsplot if Python)
  • Scatterplots, bar charts, histograms, and box plots
  • Annotations, themes, and scales

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Dataviz practice

πŸ—“οΈ Day 04
(Thu 11 Jul)

TOPIC: Reshaping Data for Visualization

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

Morning Lecture
10.00am - 1.00pm

πŸ“‹ Activity: Practice data reshaping with dataviz exercises
(10:00am - 11:15am)

  • Practice updating functions to apply to data
  • Practice rewriting code as you scale up your data analysis

🍡 Little Break
(11.15am - 11.30am)

πŸ“‹ Activity: Practice data reshaping with dataviz exercises
(11:30am - 1:00pm)

  • Practice some groupby -> apply -> combine patterns
  • Putting it all together:
    • Neat computational notebooks
    • Neat documentation with Markdown
    • Tidy data
    • Appropriate data types
    • Data visualization
    • Groupby -> Apply -> Combine as needed

πŸ“’ Midterm Assignment Reveal

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

🦸 Super Tech Support: Data pre-processing + dataviz

  • Get help with your midterm assignment

πŸ’‚ Enjoy!
(Fri 12 Jul – Sun 14 Jul)

Sightseeing Tips

Midterm support

πŸ§‘β€πŸ’Ό Office Hours: Friday, 14 July 2024 from 10am-12pm

  • Attend office hours if you need additional assistance.
  • No need to book, but please be patient if there are other students ahead of you in the queue.
  • A typical office hour session lasts ~15 minutes.

Week 02 (15 July - 21 July) | Collecting Data

πŸ—“οΈ Day 01
(Mon 15 Jul)

TOPIC: Collecting data from the Web

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Articulate the difference between the Internet and the Web
  • Use HTML and CSS to create a simple webpage
  • Write code to automate the collection of data from websites

Morning Lecture
10.00am - 1.00pm

πŸ§‘β€πŸ« Slides: The Internet and the Web
(10.00am - 10.45am)

🍡 Little Break
(10.45am - 11.00am)

πŸ“‹ Activity: HTML and CSS
(11.00am - 12.00pm)

πŸ§‘β€πŸ’» Live Coding: Collecting data from websites
(12:00pm - 1:00pm)

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Web scraping practice I

πŸ—“οΈ Day 02
(Tue 16 Jul)

TOPIC: Web scraping tricks & Generative AI for debugging code

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Create your own website with just Markdown (no HTML or CSS required)
  • Choose between CSS and XPath selectors to scrape data from websites
  • Write human-readable scraping code
  • Store scraped data in a structured format
  • Use Generative AI tools, such as GitHub Copilot to debug code

Morning Lecture
10.00am - 1.00pm

πŸ“‹ Activity: Web scraping practice
(10:00am - 11.15am)

  • Continue to practice web scraping
  • How to spot the identifiable HTML element near the data you want to scrape
  • How to handle <br> tags in your scraped data

🍡 Little Break
(11.15am - 11.30am)

πŸ§‘β€πŸ’» Live Coding: A deep dive into CSS and XPath Selectors
(11:30am - 1:00pm)

  • How to choose between CSS and XPath selectors
  • How to write human-readable scraping code
  • Organizing your scraping code into functions (and why you should)

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Web scraping practice II

Take-Home Assignment

πŸ“‹ Activity: Creating a website with Markdown

  • Set up a GitHub account
  • Create your profile page
  • Create a new repository to store code for this course
  • Create a website with Markdown and publish it on GitHub Pages

πŸ—“οΈ Day 03
(Wed 17 Jul)

TOPIC: JSON and APIs

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Understand the JSON file format
  • Navigate the JSON file structure
  • Write code to collect data from APIs
  • Navigate API documentation

Morning Lecture
10.00am - 1.00pm

βͺ Review & Solutions: Wikipedia scraping
(10:00am - 11:30am)

  • Live demo of solutions to πŸ’» Week 02 Day 01 Lab
  • Discussion of your solutions to πŸ’» Week 02 Day 02 Lab
  • Using GitHub Copilot while coding
  • Ethical scraping: when is it not OK to scrape data from a website? The role of the robots.txt file.

🍡 Little Break
(11.30am - 11.45am)

πŸ§‘β€πŸ’» Live Coding: Collecting data from APIs
(12:00pm - 1:00pm)

  • Set up a developer account with Reddit
  • Connect to the Reddit API
  • How to read the Reddit API documentation

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Collecting data from social media APIs (Reddit)

  • Continuation of the morning’s live coding session

πŸ—“οΈ Day 04
(Thu 18 Jul)

TOPIC: Web Crawlers & Browser Automation

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Set up Git on your machine.
  • Understand the advantages of using scrapy spider over requests + Scrapy Selectors.
  • Understand the architecture of a scrapy spider.
  • Use the scrapy shell to test your CSS selectors and XPath expressions.
  • Create a new Scrapy project and a new spider.
  • Use the scrapy crawl command to run your spider.
  • Save the scraped data to a JSON or JSONL file.

Morning Lecture
10.00am - 1.00pm

πŸ“‹ Activity: Setting Up Git on Your Machine
(10:00am - 10:45am)

🍡 Little Break
(10.45am - 11.00am)

πŸ‘¨β€πŸ’» Live Coding: Scrapy spiders and Selenium
(11:00am - 1:00pm)

πŸ“’ Assignment Reveal: Instructions about your final project

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

🦸 Super Tech Support: Data Collection and Project Setup

  • Get support for any issues you might have with scrapy spiders or Selenium
  • Get help with your final project

πŸ’‚ Enjoy!
(Fri 19 Jul – Sun 21 Jul)

Sightseeing Tips

πŸ§‘β€πŸ’Ό Office Hours

  • Attend office hours on Friday morning (from 10am to 12pm) if you need additional assistance.
  • No need to book, but please be patient if there are other students ahead of you in the queue.
  • A typical office hour session lasts ~15 minutes.

Week 03 (22 July - 28 July) | Databases & Dashboards

πŸ—“οΈ Day 01
(Mon 22 Jul)

TOPIC: Intro to Databases and SQL

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Understand the similarities and differences between SQL and Python/R’s data manipulation libraries
  • Write basic SQL commands: SELECT, WHERE, GROUP BY, ORDER BY, JOIN
  • Translate your data analysis from Python/R to SQL
  • Use SQL to query databases

Morning Lecture
10.00am - 1.00pm

πŸ§‘β€πŸ’» Live Coding: Moving away from simple data files
(10.00am - 11.15am)

🍡 Little Break
(11.15am - 11.30am)

πŸ§‘β€πŸ’» Live Coding: Basic SQL commands
(11:30pm - 1:00pm)

  • How SQL compares to Python/R’s data manipulation libraries

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: SQL Practice

  • Replicating parts of your data analysis in SQL

πŸ—“οΈ Day 02
(Tue 23 Jul)

TOPIC: Reporting and Dashboards

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Create interactive visualizations
  • Create a dashboard with multiple visualizations
  • Use a dashboard to tell a story with data
  • Use a dashboard to make data-driven decisions

Morning Lecture
10.00am - 1.00pm

🦸 Super Tech Support: Revision and Project Support
(10:00am - 11.15am)

  • Q&A session on data manipulation and SQL
  • Get help with your final project
  • Further Generative AI tips (GitHub Copilot / ChatGPT)

🍡 Little Break
(11.15am - 11.30am)

πŸ§‘β€πŸ’» Live Coding: Interactive Visualisations and Dashboards
(11:30am - 1:00pm)

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

πŸ’» Lab: Dashboard Practice

πŸ—“οΈ Day 03
(Wed 24 Jul)

TOPIC: Review and Support

Morning Lecture
10.00am - 1.00pm

🦸 Super Tech Support
(10:00am - 1:00pm)

Get help with:

  • Your specific scraping needs
  • Selenium
  • Databases
  • Merging data from multiple tables
  • Data vizualization
  • Creating markdown websites
  • Git/GitHub

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

🦸 Super Tech Support: Project Setup

  • Get help with your final project

πŸ—“οΈ Day 04
(Thu 25 Jul)

TOPIC: Managing your data pipeline

πŸ₯… Objectives

Review the goals for today

At the end of the day you should be able to:

  • Organise your folders for your data pipeline
  • Use GitHub for version control of your code
  • Write good markdown documentation
  • Keep track of data provenance
  • Ensure reproducibility of your work

Morning Lecture
10.00am - 1.00pm

πŸ§‘β€πŸ’» Live Coding: Managing your data pipeline
(10:00am - 11:45am)

🍡 Little Break
(11.45am - 12.00pm)

🦸 Super Tech Support: Final Project
(12:00pm - 1:00pm)

  • Get help with your final project

Afternoon Class
2.00pm - 3.30pm or
3.30pm - 5.00pm

🦸 Super Tech Support: Final Project

  • Get help with your final project

πŸ—“οΈ Day 05
(Fri 26 Jul)

πŸ§‘β€πŸ’Ό Office Hours

  • Attend office hours on Friday morning (from 10am to 12pm) if you need additional assistance.
  • No need to book, but please be patient if there are other students ahead of you in the queue.
  • A typical office hour session lasts ~15 minutes.

⏳ Deadline:

Submit your final project βœ