πŸ—“οΈ Week 07 - Putting it all together: from web scraping to initial data cleaning

Theme: Cleaning and reshaping data

Author

The original title of this lecture was "Data Summarisation and the Grammar of graphics". However, following your feedback via Slack and office hours, this week’s class will focus on doing a practical example of web scraping and initial data cleaning, thus putting together the skills we have learned so far.

πŸ“ƒ Lecture Schedule

πŸ“Location: Thursday 29 February 2024, 4 pm - 6 pm at MAR.1.04

πŸ‘¨β€πŸ« Lecture Material

πŸŽ₯ Looking for lecture recordings? You can only find those on Moodle, typically a day after the lecture. If you can’t find the recordings, please contact πŸ“§ .

GitHub Repository

We will create a GitHub repository at the lecture. Once created, the link below will take you to the repository.

LINK TO GITHUB REPOSITORY

Goals

In this lecture, we revisit the core concepts of web scraping by compiling a list of the last instances of UK general elections from Wikipedia.

The case study covers the following topics:

  1. Finding CSS/XPath selectors on a page
  2. Writing functions for web scraping tasks
  3. List comprehensions for data extraction
  4. Using pd.apply() for data manipulation

The repository will be created from scratch during the lecture, providing a hands-on approach to Git commands and web scraping techniques. This case study aims to reinforce the concepts learned and help students apply them to their W08 assignment.