π» Week 02 Lab
Scrapy selectors, then Selenium with Chromium
By the end of this lab, you should be able to: i) Use Scrapyβs Selector and CSS selectors to extract links from a page, ii) Explain pseudo-selectors like ::text and ::attr(...), iii) Use Selenium with Chromium to extract the same information from a live browser, iv) Collect Waitrose category links into a simple JSON structure.
This lab focuses on selectors. You will practise with Scrapy first (fast feedback), then switch to Selenium when the page needs a real browser.
π Preparation
Review the π£οΈ Week 02 Lecture
On Nuvolos, open the app
VS Code + Chromium + SeleniumDownload the notebook you will use in the last section:
π£οΈ Lab Roadmap
| Part | Activity Type | Focus | Time | Outcome |
|---|---|---|---|---|
| Section 0 | π€ Teaching Moment | Create and activate food |
10 min | Everyone runs the same environment |
| Section 1 | π€ Teaching Moment | Scrapy selector recap | 15 min | You can target elements and chain selectors |
| Section 2 | βΈοΈ Self-paced | Wikipedia link extraction (Scrapy) | 30 min | You can extract links and explain > |
| Section 3 | π€ Teaching Moment | Selenium with Chromium | Remaining time | You can repeat extraction in a live browser |
| Final task | βΈοΈ Self-paced | Waitrose category links | Remaining time / take-home | A JSON list of {category_name, link} |
π NOTE: Whenever you see a π€ TEACHING MOMENT, this means your class teacher deserves your full attention!
Section 0: Setup (10 mins)
π― ACTION POINT
Follow the steps below as soon as you enter the lab. Raise your hand if you need help and your class teacher will assist you. (Feel free to work in pairs or small groups if you prefer.)
We want to make sure we are all using the same environment. And if there are bugs, we all want to be able to face them together and fix them together!
So, letβs start by creating and activating the
foodconda environment. If you have already done so yesterday in the lecture, note that theenvironment.ymlfile has changed a bit and you need to update your environment.
If you donβt have a food conda environment: create it
Point to the environment.yml file in the week02 folder and run the following commands:
# Change the path if you moved the environment.yml file to a different location
conda env create -f /files/week02/environment.yml
conda activate foodIf you canβt find the environment.yml file on Nuvolos, hereβs the content you need. Just create a file called environment.yml and paste the content below.
name: food
channels:
- defaults
- conda-forge
dependencies:
# --- Core runtime ---
- python=3.12
# --- Notebooks in VS Code ---
- ipykernel
- ipython
# --- Data collection ---
- requests
- scrapy
- selenium
# --- Data manipulation ---
- pandas
If you need to update the food conda environment
Whenever you edit the environment.yml file, apply the change like this:
conda env update -f environment.yml --prune
conda activate foodIf things get messy, remove and start again:
conda env remove -n food
conda env create -f environment.yml
conda activate foodSection 1: Teaching moment (15 mins)
This section is a TEACHING MOMENT
Your class teacher will recap the essentials of HTML and CSS selectors in Scrapy and show how to chain selectors if needed.
π‘ TIP: Reuse the same notebook from the lecture for this section. Just add a ## Section 6: Scrapy selectors recap section at the end and add any new code you need to answer the questions in the lab.
We will continue to work with the same Scrapy Selector object called response, which contains the HTML of the page we are scraping: List of foods.
CSS classes and IDs (quick recap)
- Class:
.productcaptures all elements withclass="product".div.productmeans a<div>with that class. - ID:
#maincaptures the element withid="main".
π You can chain the CSS selectors
Barry will show that you could achieve the same result as response.css("h1 span::text").get() by chaining multiple CSS selectors:
response.css("h1").css("span")
response.css("h1").css("span::text")β οΈ WARNING: if you need/want to chain selectors, do not call .get() too early. Once you call .get(), you get a string back. You cannot keep selecting inside a string.
Section 2: Practise with Scrapy selectors (30 mins)
π― ACTION POINTS
Work in pairs or on your own for this section. Your class teacher will give you the answers at the end of the section.
Action Point 1: Finding the best container
Open the page in your browser and use Inspect:
Your task: Find the <div> closest to the list of foods.
Confirm that you have found the right container by replacing the ? with your selector and running the code.
div_selector = "?"
response.css(div_selector).get()π‘ TIP: for classes you can use div.some-class or .some-class.
Action Point 2: what does > do?
What does this particular selector (div > ul > li) return?
response.css("div > ul > li")Does it return all the listed foods of the page? Or does it return something else?
Your task: if your answer to the question above is βnoβ, write a more specific selector that returns only the listed foods.
Action Point 3: remove the > and compare
Remove the > symbols and run the selector again.
response.css("div ul li")Your task: Write down an explanation of why the selector above returned something else than on the previous question.
π (Optional) Challenge: keep the nested structure
Some items have nested lists (e.g. Plant milk has Almond milk, Coconut milk, etc.).
Your task: If you have the time, try to write a solution using a loop to convert that nested HTML structure (the list of <li> elements inside other <li> elements) into a nested JSON structure.
π€ Ask the robot for help: You could use the DS205 Claude bot to help you design the approach. Can you get Claude to write the neatest/shorter code possible without over-engineering the solution?
Section 3: Time to switch to Selenium (remaining time)
This section is a TEACHING MOMENT
Your class teacher will show how to use Selenium with Chromium on Nuvolos, then connect the same selector thinking back to Scrapy.
Now open the second notebook of the week and follow along with Barry:
W02-NB02-Lab-Selenium.ipynb
That notebook contains:
- a Scrapy vs Selenium cheat sheet
- working Selenium code for the same Wikipedia tasks from Section 2 (solutions included)
- a short demo of clicking and navigation
Your Problem Set 1 (remaining time or take-home)
While the full instructions will come a bit later, you can already start to build your Problem Set 1.
- Create a separate Jupyter Notebook or Python script to build your scraper.
- Go to Waitrose and collect the list of products and their links from: https://www.waitrose.com/ecom/shop/browse/groceries
- Decide whether you need
scrapyorseleniumto extract product names and URLs into a Python list or DataFrame. - Write an initial version of your scraper.