🗣️ Week 04 Lecture

Introduction to Web Scraping

Author
Published

10 February 2025

Image created with the AI embedded in MS Designer using the prompt 'abstract green and blue icon depicting the advanced stages of data wrangling, API design, and scalable pipelines for sustainability-focused data engineering.'

Last Updated: 9 February 2025, 23:30.

Welcome to Week 04 of DS205, where we will explore web scraping techniques and their ethical implications.

📍 Session Details

  • Date: Monday, 10 February 2025
  • Time: 10:00 am - 12:00 pm
  • Location: KSW.1.01

🗣️ Lecture Content

1. Introduction to Web Scraping

  • What is web scraping and why is it important?
  • Real-world applications:
    • Market research
    • Academic research
    • Financial analysis
    • Job market analysis
    • News monitoring

2. How Websites are Structured

  • HTML fundamentals
  • CSS basics and styling
  • The Document Object Model (DOM)
  • Using browser developer tools for inspection

3. Selecting Elements

  • HTML elements and attributes
  • CSS selectors
  • Live demonstration of element selection

4. Ethical and Industry Perspectives

  • Legal framework (GDPR, CCPA, DMCA)
  • Technical controls and best practices
  • Industry cases:
    • News Publishers v. AI Companies
    • Small Websites v. AI Companies
    • Industry responses to AI-powered data collection

5. Looking Ahead

  • Introduction to Python web scraping libraries
  • Preview of the lab session
  • Overview of the formative exercise

🎬 Lecture Slides

Use keyboard arrows to navigate. Select the slides below or view fullscreen.

Or download the slides directly as a PDF:

👉 Next Steps

  1. Review the Black (2016) paper discussed in class
  2. Consider reading the Liu et al. (2024) paper on how content creators can protect their content from being scraped by AI crawlers
  3. Access the 💻 W04 Lab materials to practice web scraping techniques
  4. Reserve some time later in the week to work on the 📝 W04-W05 Formative Exercise once available on Moodle.

🎥 Session Recording

The lecture recording will be available on Moodle by the afternoon of the lecture.