💻 Week 03 Lab
From Notebook to Scrapy Project
By the end of this lab, you should be able to: i) Structure your scraper as a Scrapy project rather than a single notebook, ii) Use Selenium’s waiting mechanisms for dynamic content, iii) Integrate Selenium into a Scrapy crawler via middleware, iv) Make meaningful progress on Problem Set 1.
This lab is about making progress on ✍️ Problem Set 1. The guides below cover techniques you will likely need. Work through them as they become relevant to your scraper.
📋 Preparation
- Review the 🖥️ Week 03 Lecture, particularly Guide 2 on Scrapy project structure
- On Nuvolos, open the app
VS Code + Chromium + Selenium - Make sure your
foodconda environment is activated
🛣️ Lab Roadmap
| Part | Activity Type | Focus | Time | Outcome |
|---|---|---|---|---|
| Section 0 | 👤 Teaching Moment | Check-in and troubleshooting | 10 min | Everyone has a working setup |
| Section 1 | 👤 Teaching Moment | Selenium waiting patterns | 20 min | You understand WebDriverWait and expected_conditions |
| Section 2 | ⏸️ Self-paced | Problem Set 1 work | Remaining time | Progress on your scraper |
👉 NOTE: Whenever you see a 👤 TEACHING MOMENT, this means your class teacher deserves your full attention!
Section 0: Check-in (10 mins)
This section is a TEACHING MOMENT
Your class teacher will check where everyone is with Problem Set 1 and help troubleshoot common issues.
🎯 ACTION POINT
Answer these questions (to yourself or with someone nearby):
- Have you cloned your
problem-set-1-<username>repository? - Have you extracted at least some data from Waitrose?
If the answer to any of these is “no”, that’s fine. Use this lab to catch up. Raise your hand if you need help.
Section 1: Selenium Waiting Patterns (20 mins)
This section is a TEACHING MOMENT
Your class teacher will demonstrate how to wait for dynamic content in Selenium, then show how this fits into a Scrapy project.
📖 Guide 3: Selenium Patterns for Dynamic Content
Refer to 🖥️ W03 Lecture for the previous guides of this week: 📖 Guides 1 & 2.
If you are using Selenium and you call driver.get(url), that is the moment the browser starts loading the page and fetching all the other pieces of content to be displayed. It might also happen that you need to click on something (a button, a menu item, etc.) to reveal the content you want to extract. In this case, Selenium will keep behaving like a browser does and will keep waiting for the content to be loaded.
3.1 Waiting for dynamic content
If the thing you want to extract is not yet loaded on the page and you don’t tell Selenium to wait, it will raise a NoSuchElementException. The naive fix is time.sleep(5) and although it works sometimes, it wastes time when the page loads quickly and still fails when the page loads slowly.
A better approach when using Selenium is to wait for a specific condition rather than a fixed time.
Meet the expected_conditions part of Selenium
Selenium provides a library of conditions you can wait for. Import it like this:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import ByHere are some conditions you might want to use when scraping a page:
# Element exists in the DOM (may not be visible)
EC.presence_of_element_located((By.CSS_SELECTOR, "button.load-more"))
# Element is visible AND enabled (can be clicked)
EC.element_to_be_clickable((By.CSS_SELECTOR, "button.load-more"))
# Element is visible on the page
EC.visibility_of_element_located((By.CSS_SELECTOR, "div.product"))Note the double parentheses. The condition function takes a tuple (By.CSS_SELECTOR, "...") as its argument.
🗒️ NOTE: If you need to make sure something is present on the page, the first thing you need to do is to figure out a CSS selector that will uniquely identify the element you are looking for. Then, you can use one of the conditions above to tell Selenium to ensure it is present on the page before the code continues.
Telling Selenium to wait with WebDriverWait
But what if that button/menu item takes a few seconds to show up or to become clickable? In this case, you should combine the condition with a wait.
WebDriverWait checks, repeatedly, that a specific condition is met until it becomes true (or it runs out of time):
from selenium.webdriver.support.ui import WebDriverWait
wait = WebDriverWait(driver, 10) # wait up to 10 seconds
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.load-more")))If the button becomes clickable within 10 seconds, wait.until() returns the element. If not, it raises TimeoutException.
What if something goes wrong? Handling exceptions
If something enexpected happen say the button takes longer than 10 seconds to become clickable, or if it never shows up on the page, Selenium will throw an exception. You need to handle these exceptions to avoid your code crashing.
from selenium.common.exceptions import TimeoutException, NoSuchElementExceptionTimeoutException: the condition never became true within the timeoutNoSuchElementException: you tried to find an element that doesn’t exist (without waiting)
To avoid your code crashing, you can use a try-except block to handle the exception. A common pattern you can copy from is the following:
try:
button = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "button.load-more"))
)
button.click()
except TimeoutException:
# No more "load more" button; we've reached the end
passUse NoSuchElementException for optional elements:
try:
cookie_banner = driver.find_element(By.CSS_SELECTOR, "button.reject-cookies")
cookie_banner.click()
except NoSuchElementException:
# No cookie banner on this page
pass3.2 Clicking elements that won’t cooperate
Sometimes .click() fails even though the element exists. Common reasons:
- Another element covers it (a modal, a sticky header)
- The element is outside the viewport
A workaround is to click via JavaScript. This bypasses Selenium’s checks and clicks directly. Reserve it for when normal clicking fails, but prefer normal .click() when it works.
button = driver.find_element(By.CSS_SELECTOR, "button.load-more")
driver.execute_script("arguments[0].click();", button)This bypasses Selenium’s checks and clicks directly. Use it when normal clicking fails, but prefer normal .click() when it works.
3.3 Complete example: a “load more” loop
This pattern loads all products on a page that uses a “load more” button:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome()
driver.get("https://www.waitrose.com/ecom/shop/browse/groceries/bakery")
# Dismiss cookie banner if present
try:
reject_btn = driver.find_element(By.CSS_SELECTOR, 'button[data-testid="reject-all"]')
reject_btn.click()
except:
pass
# Click "load more" until it disappears
while True:
try:
load_more = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, 'button[data-actiontype="load"]'))
)
driver.execute_script("arguments[0].click();", load_more)
except TimeoutException:
break # No more button; done loading
# Now extract all products
products = driver.find_elements(By.CSS_SELECTOR, "...your CSS selector here...")
print(f"Found {len(products)} products")3.4 Running Selenium within a Scrapy crawler
In this ✍️ Problem Set 1, you will be expected to use scrapy projects rather than notebooks. This means if you need to use Selenium, you will need to do adhere to the Scrapy project structure and use the scrapy command line tools.
The advantage of this approach is that you can use Selenium for navigation (clicking “load more”, handling JavaScript) but use Scrapy’s much more professional and robust framework for structured extraction and pipelines.
There are several ways to set this up. The approach we will prefer is to use a Scrapy middleware that intercepts requests, loads them in Selenium (rather than Scrapy requests), and returns the rendered HTML to your spider.
Click here for a Selenium middleware example
Add this to your middlewares.py:
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class SeleniumMiddleware:
"""Middleware that uses Selenium to render JavaScript-heavy pages."""
def __init__(self):
self.driver = None
def spider_opened(self, spider):
"""Create the browser when the spider starts."""
options = Options()
# options.add_argument("--headless") # Uncomment this line to use headless mode if you think your website allows it
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
self.driver = webdriver.Chrome(options=options)
def spider_closed(self, spider):
"""Close the browser when the spider finishes."""
if self.driver:
self.driver.quit()
def process_request(self, request, spider):
"""Intercept requests and load them in Selenium."""
self.driver.get(request.url)
# Add any waiting logic here (WebDriverWait, clicking, etc.)
# For example:
# WebDriverWait(self.driver, 10).until(...)
body = self.driver.page_source
return HtmlResponse(
url=request.url,
body=body,
encoding='utf-8',
request=request
)Enable it in settings.py:
🗒️ NOTE: The number 543 is the priority of the middleware. It is important to set this to a unique number for each middleware you use.
DOWNLOADER_MIDDLEWARES = {
'supermarkets.middlewares.SeleniumMiddleware': 500,
}For Problem Set 1, a single driver for the whole crawl is usually fine. If you notice strange behaviour after many requests, try clearing cookies with driver.delete_all_cookies().
Section 2: Problem Set 1 Work (Remaining Time)
🎯 ACTION POINT
Use the remaining lab time to make progress on your scraper. Choose the path that matches your current state:
Path A: You have some scraping code in a notebook
Port your code to a Scrapy project:
From your repository root, run:
scrapy startproject supermarkets ./scraper/ cd scraper scrapy genspider waitrose www.waitrose.comOpen
supermarkets/spiders/waitrose.pyand move your extraction logic into theparse()method.Test with
scrapy crawl waitroseand fix errors as they appear.Move step by step. Don’t try to port everything at once.
Path B: You haven’t started scraping yet
You have two options:
Option 1: Start with a Scrapy project directly (follow Path A above, then write your extraction logic in the spider).
Option 2: Prototype in a Jupyter Notebook first, then port later. This can be faster for experimentation, but you will need to restructure before submission.
Either way, your goal today is to extract product data from at least one Waitrose category.
What to work on
Refer to ✍️ Problem Set 1 for the full requirements. Key questions to answer:
- Can Scrapy alone see the product data, or do you need Selenium?
- How will you handle the “load more” button (if present)?
- What data fields will you extract from each product?
If you get stuck, check the principles we discussed in 🖥️ W03 Lecture. Ask your class teacher or use the DS205 Claude bot.