DS205 2025-2026 Winter Term Icon

πŸ’» Week 02 Lab

Scrapy selectors, then Selenium with Chromium

nuvolos
selenium
web-scraping
Practise CSS selectors with Scrapy, then switch to Selenium when needed.
Author

Dr Jon Cardoso-Silva

Published

02 February 2026

Modified

02 February 2026

πŸ₯… Learning Goals

By the end of this lab, you should be able to: i) Use Scrapy’s Selector and CSS selectors to extract links from a page, ii) Explain pseudo-selectors like ::text and ::attr(...), iii) Use Selenium with Chromium to extract the same information from a live browser, iv) Collect Waitrose category links into a simple JSON structure.

This lab focuses on selectors. You will practise with Scrapy first (fast feedback), then switch to Selenium when the page needs a real browser.

πŸ“‹ Preparation

πŸ›£οΈ Lab Roadmap

How the W02 lab will be structured
Part Activity Type Focus Time Outcome
Section 0 πŸ‘€ Teaching Moment Create and activate food 10 min Everyone runs the same environment
Section 1 πŸ‘€ Teaching Moment Scrapy selector recap 15 min You can target elements and chain selectors
Section 2 ⏸️ Self-paced Wikipedia link extraction (Scrapy) 30 min You can extract links and explain >
Section 3 πŸ‘€ Teaching Moment Selenium with Chromium Remaining time You can repeat extraction in a live browser
Final task ⏸️ Self-paced Waitrose category links Remaining time / take-home A JSON list of {category_name, link}

πŸ‘‰ NOTE: Whenever you see a πŸ‘€ TEACHING MOMENT, this means your class teacher deserves your full attention!

Section 0: Setup (10 mins)

🎯 ACTION POINT

Follow the steps below as soon as you enter the lab. Raise your hand if you need help and your class teacher will assist you. (Feel free to work in pairs or small groups if you prefer.)

  1. We want to make sure we are all using the same environment. And if there are bugs, we all want to be able to face them together and fix them together!

    So, let’s start by creating and activating the food conda environment. If you have already done so yesterday in the lecture, note that the environment.yml file has changed a bit and you need to update your environment.

If you don’t have a food conda environment: create it

Point to the environment.yml file in the week02 folder and run the following commands:

# Change the path if you moved the environment.yml file to a different location
conda env create -f /files/week02/environment.yml
conda activate food

If you can’t find the environment.yml file on Nuvolos, here’s the content you need. Just create a file called environment.yml and paste the content below.

name: food
channels:
  - defaults
  - conda-forge
dependencies:
  # --- Core runtime ---
  - python=3.12

  # --- Notebooks in VS Code ---
  - ipykernel
  - ipython

  # --- Data collection ---
  - requests
  - scrapy
  - selenium

  # --- Data manipulation ---
  - pandas
If you need to update the food conda environment

Whenever you edit the environment.yml file, apply the change like this:

conda env update -f environment.yml --prune
conda activate food

If things get messy, remove and start again:

conda env remove -n food
conda env create -f environment.yml
conda activate food

Section 1: Teaching moment (15 mins)

This section is a TEACHING MOMENT

Your class teacher will recap the essentials of HTML and CSS selectors in Scrapy and show how to chain selectors if needed.

πŸ’‘ TIP: Reuse the same notebook from the lecture for this section. Just add a ## Section 6: Scrapy selectors recap section at the end and add any new code you need to answer the questions in the lab.

We will continue to work with the same Scrapy Selector object called response, which contains the HTML of the page we are scraping: List of foods.

CSS classes and IDs (quick recap)

  • Class: .product captures all elements with class="product". div.product means a <div> with that class.
  • ID: #main captures the element with id="main".

πŸ†• You can chain the CSS selectors

Barry will show that you could achieve the same result as response.css("h1 span::text").get() by chaining multiple CSS selectors:

response.css("h1").css("span")
response.css("h1").css("span::text")

⚠️ WARNING: if you need/want to chain selectors, do not call .get() too early. Once you call .get(), you get a string back. You cannot keep selecting inside a string.

Section 2: Practise with Scrapy selectors (30 mins)

🎯 ACTION POINTS

Work in pairs or on your own for this section. Your class teacher will give you the answers at the end of the section.

Action Point 1: Finding the best container

Open the page in your browser and use Inspect:

Your task: Find the <div> closest to the list of foods.

Confirm that you have found the right container by replacing the ? with your selector and running the code.

div_selector = "?"
response.css(div_selector).get()

πŸ’‘ TIP: for classes you can use div.some-class or .some-class.

Action Point 2: what does > do?

What does this particular selector (div > ul > li) return?

response.css("div > ul > li")

Does it return all the listed foods of the page? Or does it return something else?

Your task: if your answer to the question above is β€œno”, write a more specific selector that returns only the listed foods.

Action Point 3: remove the > and compare

Remove the > symbols and run the selector again.

response.css("div ul li")

Your task: Write down an explanation of why the selector above returned something else than on the previous question.

πŸ† (Optional) Challenge: keep the nested structure

Some items have nested lists (e.g. Plant milk has Almond milk, Coconut milk, etc.).

Your task: If you have the time, try to write a solution using a loop to convert that nested HTML structure (the list of <li> elements inside other <li> elements) into a nested JSON structure.

πŸ€– Ask the robot for help: You could use the DS205 Claude bot to help you design the approach. Can you get Claude to write the neatest/shorter code possible without over-engineering the solution?

Section 3: Time to switch to Selenium (remaining time)

This section is a TEACHING MOMENT

Your class teacher will show how to use Selenium with Chromium on Nuvolos, then connect the same selector thinking back to Scrapy.

Now open the second notebook of the week and follow along with Barry:

  • W02-NB02-Lab-Selenium.ipynb

That notebook contains:

  • a Scrapy vs Selenium cheat sheet
  • working Selenium code for the same Wikipedia tasks from Section 2 (solutions included)
  • a short demo of clicking and navigation

Your Problem Set 1 (remaining time or take-home)

While the full instructions will come a bit later, you can already start to build your Problem Set 1.

  • Create a separate Jupyter Notebook or Python script to build your scraper.
  • Go to Waitrose and collect the list of products and their links from: https://www.waitrose.com/ecom/shop/browse/groceries
  • Decide whether you need scrapy or selenium to extract product names and URLs into a Python list or DataFrame.
  • Write an initial version of your scraper.