💻 Week 03 Lab

From Notebook to Scrapy Project

nuvolos

selenium

scrapy

web-scraping

Move your scraping code into a proper Scrapy project structure.

Author

Dr Jon Cardoso-Silva

Published

02 February 2026

Modified

02 February 2026

🥅 Learning Goals

By the end of this lab, you should be able to: i) Structure your scraper as a Scrapy project rather than a single notebook, ii) Use Selenium’s waiting mechanisms for dynamic content, iii) Integrate Selenium into a Scrapy crawler via middleware, iv) Make meaningful progress on Problem Set 1.

This lab is about making progress on ✍️ Problem Set 1. The guides below cover techniques you will likely need. Work through them as they become relevant to your scraper.

📋 Preparation

Review the 🖥️ Week 03 Lecture, particularly Guide 2 on Scrapy project structure
On Nuvolos, open the app VS Code + Chromium + Selenium
Make sure your food conda environment is activated

🛣️ Lab Roadmap

How the W03 lab will be structured
Part	Activity Type	Focus	Time	Outcome
Section 0	👤 Teaching Moment	Check-in and troubleshooting	10 min	Everyone has a working setup
Section 1	👤 Teaching Moment	Selenium waiting patterns	20 min	You understand `WebDriverWait` and `expected_conditions`
Section 2	⏸️ Self-paced	Problem Set 1 work	Remaining time	Progress on your scraper

👉 NOTE: Whenever you see a 👤 TEACHING MOMENT, this means your class teacher deserves your full attention!

Section 0: Check-in (10 mins)

This section is a TEACHING MOMENT

Your class teacher will check where everyone is with Problem Set 1 and help troubleshoot common issues.

🎯 ACTION POINT

Answer these questions (to yourself or with someone nearby):

Have you cloned your problem-set-1-<username> repository?
Have you extracted at least some data from Waitrose?

If the answer to any of these is “no”, that’s fine. Use this lab to catch up. Raise your hand if you need help.

Section 1: Selenium Waiting Patterns (20 mins)

This section is a TEACHING MOMENT

Your class teacher will demonstrate how to wait for dynamic content in Selenium, then show how this fits into a Scrapy project.

📖 Guide 3: Selenium Patterns for Dynamic Content

Refer to 🖥️ W03 Lecture for the previous guides of this week: 📖 Guides 1 & 2.

If you are using Selenium and you call driver.get(url), that is the moment the browser starts loading the page and fetching all the other pieces of content to be displayed. It might also happen that you need to click on something (a button, a menu item, etc.) to reveal the content you want to extract. In this case, Selenium will keep behaving like a browser does and will keep waiting for the content to be loaded.

3.1 Waiting for dynamic content

If the thing you want to extract is not yet loaded on the page and you don’t tell Selenium to wait, it will raise a NoSuchElementException. The naive fix is time.sleep(5) and although it works sometimes, it wastes time when the page loads quickly and still fails when the page loads slowly.

A better approach when using Selenium is to wait for a specific condition rather than a fixed time.

Meet the expected_conditions part of Selenium

Selenium provides a library of conditions you can wait for. Import it like this:

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Here are some conditions you might want to use when scraping a page:

# Element exists in the DOM (may not be visible)
EC.presence_of_element_located((By.CSS_SELECTOR, "button.load-more"))

# Element is visible AND enabled (can be clicked)
EC.element_to_be_clickable((By.CSS_SELECTOR, "button.load-more"))

# Element is visible on the page
EC.visibility_of_element_located((By.CSS_SELECTOR, "div.product"))

Note the double parentheses. The condition function takes a tuple (By.CSS_SELECTOR, "...") as its argument.

🗒️ NOTE: If you need to make sure something is present on the page, the first thing you need to do is to figure out a CSS selector that will uniquely identify the element you are looking for. Then, you can use one of the conditions above to tell Selenium to ensure it is present on the page before the code continues.

Telling Selenium to wait with WebDriverWait

But what if that button/menu item takes a few seconds to show up or to become clickable? In this case, you should combine the condition with a wait.

WebDriverWait checks, repeatedly, that a specific condition is met until it becomes true (or it runs out of time):

from selenium.webdriver.support.ui import WebDriverWait

wait = WebDriverWait(driver, 10)  # wait up to 10 seconds
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.load-more")))

If the button becomes clickable within 10 seconds, wait.until() returns the element. If not, it raises TimeoutException.

What if something goes wrong? Handling exceptions

If something enexpected happen say the button takes longer than 10 seconds to become clickable, or if it never shows up on the page, Selenium will throw an exception. You need to handle these exceptions to avoid your code crashing.

from selenium.common.exceptions import TimeoutException, NoSuchElementException

TimeoutException: the condition never became true within the timeout
NoSuchElementException: you tried to find an element that doesn’t exist (without waiting)

To avoid your code crashing, you can use a try-except block to handle the exception. A common pattern you can copy from is the following:

try:
    button = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, "button.load-more"))
    )
    button.click()
except TimeoutException:
    # No more "load more" button; we've reached the end
    pass

Use NoSuchElementException for optional elements:

try:
    cookie_banner = driver.find_element(By.CSS_SELECTOR, "button.reject-cookies")
    cookie_banner.click()
except NoSuchElementException:
    # No cookie banner on this page
    pass

3.2 Clicking elements that won’t cooperate

Sometimes .click() fails even though the element exists. Common reasons:

Another element covers it (a modal, a sticky header)
The element is outside the viewport

A workaround is to click via JavaScript. This bypasses Selenium’s checks and clicks directly. Reserve it for when normal clicking fails, but prefer normal .click() when it works.

button = driver.find_element(By.CSS_SELECTOR, "button.load-more")
driver.execute_script("arguments[0].click();", button)

This bypasses Selenium’s checks and clicks directly. Use it when normal clicking fails, but prefer normal .click() when it works.

3.3 Complete example: a “load more” loop

This pattern loads all products on a page that uses a “load more” button:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
driver.get("https://www.waitrose.com/ecom/shop/browse/groceries/bakery")

# Dismiss cookie banner if present
try:
    reject_btn = driver.find_element(By.CSS_SELECTOR, 'button[data-testid="reject-all"]')
    reject_btn.click()
except:
    pass

# Click "load more" until it disappears
while True:
    try:
        load_more = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, 'button[data-actiontype="load"]'))
        )
        driver.execute_script("arguments[0].click();", load_more)
    except TimeoutException:
        break  # No more button; done loading

# Now extract all products
products = driver.find_elements(By.CSS_SELECTOR, "...your CSS selector here...")
print(f"Found {len(products)} products")

3.4 Running Selenium within a Scrapy crawler

In this ✍️ Problem Set 1, you will be expected to use scrapy projects rather than notebooks. This means if you need to use Selenium, you will need to do adhere to the Scrapy project structure and use the scrapy command line tools.

The advantage of this approach is that you can use Selenium for navigation (clicking “load more”, handling JavaScript) but use Scrapy’s much more professional and robust framework for structured extraction and pipelines.

There are several ways to set this up. The approach we will prefer is to use a Scrapy middleware that intercepts requests, loads them in Selenium (rather than Scrapy requests), and returns the rendered HTML to your spider.

Click here for a Selenium middleware example

Add this to your middlewares.py:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


class SeleniumMiddleware:
    """Middleware that uses Selenium to render JavaScript-heavy pages."""

    def __init__(self):
        self.driver = None

    def spider_opened(self, spider):
        """Create the browser when the spider starts."""
        options = Options()
        # options.add_argument("--headless") # Uncomment this line to use headless mode if you think your website allows it
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        self.driver = webdriver.Chrome(options=options)

    def spider_closed(self, spider):
        """Close the browser when the spider finishes."""
        if self.driver:
            self.driver.quit()

    def process_request(self, request, spider):
        """Intercept requests and load them in Selenium."""
        self.driver.get(request.url)
        
        # Add any waiting logic here (WebDriverWait, clicking, etc.)
        # For example:
        # WebDriverWait(self.driver, 10).until(...)
        
        body = self.driver.page_source
        return HtmlResponse(
            url=request.url,
            body=body,
            encoding='utf-8',
            request=request
        )

Enable it in settings.py:

🗒️ NOTE: The number 543 is the priority of the middleware. It is important to set this to a unique number for each middleware you use.

DOWNLOADER_MIDDLEWARES = {
    'supermarkets.middlewares.SeleniumMiddleware': 500,
}

For Problem Set 1, a single driver for the whole crawl is usually fine. If you notice strange behaviour after many requests, try clearing cookies with driver.delete_all_cookies().

Section 2: Problem Set 1 Work (Remaining Time)

🎯 ACTION POINT

Use the remaining lab time to make progress on your scraper. Choose the path that matches your current state:

Path A: You have some scraping code in a notebook

Port your code to a Scrapy project:

From your repository root, run:

scrapy startproject supermarkets ./scraper/
cd scraper
scrapy genspider waitrose www.waitrose.com

Open supermarkets/spiders/waitrose.py and move your extraction logic into the parse() method.
Test with scrapy crawl waitrose and fix errors as they appear.
Move step by step. Don’t try to port everything at once.

Path B: You haven’t started scraping yet

You have two options:

Option 1: Start with a Scrapy project directly (follow Path A above, then write your extraction logic in the spider).

Option 2: Prototype in a Jupyter Notebook first, then port later. This can be faster for experimentation, but you will need to restructure before submission.

Either way, your goal today is to extract product data from at least one Waitrose category.

What to work on

Refer to ✍️ Problem Set 1 for the full requirements. Key questions to answer:

Can Scrapy alone see the product data, or do you need Selenium?
How will you handle the “load more” button (if present)?
What data fields will you extract from each product?

If you get stuck, check the principles we discussed in 🖥️ W03 Lecture. Ask your class teacher or use the DS205 Claude bot.

Appendix | Resources

Reference Links

Problem Set 1