🗓️ Week 05
Topics in Web Scraping: XPath Selectors, Item Pipelines, and Dynamic URL Discovery

DS205 – Advanced Data Manipulation

17 Feb 2025

1️⃣ Review and Coding Tips

10:03 – 10:15

A few unusual Python concepts that might have left you a bit puzzled plus some new debugging tools.

Why Python Scripts vs Notebooks?

You will probably have found it unusual that we’re not using Jupyter Notebooks much in this course. This is because for code that needs to be run in a production environment, Jupyter Notebooks are not a perfect fit.

👉 We want our code to run without a graphical user interface

Jupyter Notebooks

Great for exploration
Interactive development
Visual outputs
Documentation
Reporting tool

Python Scripts

Production code
Version control
Command line usage
Automation
Fewer dependencies

The `yield` Keyword in Scrapy

You are aware of lists in Python, right? But maybe you don’t know about generators?

In the words of the Python documentation:

Regular functions compute a value and return it, but generators return an iterator that returns a stream of values.

This is useful when you want to process items one at a time, without loading them all into memory. It’s a type of lazy evaluation, a concept from functional programming in computer science.

Example

def parse(self, response):
    # This returns and keeps state
    yield {
        'title': response.css('h1::text').get(),
        'date': response.css('.date::text').get()
    }

The command above will not return the item immediately. Instead, it will yield the item to the Scrapy engine, which will process it later.

In the end, as the user of a function, you will perceive this as if the function was returnning a kind of list of items for you.

Debugging with `ipdb`

Key Commands

def parse(self, response):
    # I typically import it on the same line
    # as I will only use it temporarily
    # I just delete the line before committing
    import ipdb;ipdb.set_trace()
    
    title = response.css('h1::text')
    # ...

Common Commands

n: Next line
s: Step into function
c: Continue execution
p variable: Print variable
ll: List source code
q: Quit debugger

Debugging Tools Compared

print()

def parse(self, response):
    print(f"URL: {response.url}")
    print(f"Found: {item}")

➕ Simple to use
➕ Always visible
👎 Clutters the output
👎 No way to indicate the severity of the message
👎 Hard to disable

ipdb

def parse(self, response):
    ipdb.set_trace()
    item = response.css('h1')

➕ Interactive debugging
➕ Inspect what is in each variable at that point in the code
➕ Step through code
👎 Temporary use
👎 Stops execution

Logging

You can also monitor your Python code using the Python logging library.

logger = logging.getLogger()

def parse(self, response):
    # INFO shows general information
    logger.info("Starting parse")
    ... some code ...

    try:
        ... some code ...
    except Exception as e:
        logger.error(f"Failed: {str(e)}")

Logging is a more professional way to monitor your code. You can enable different severity levels, making it easier to filter later on.

Using Logging in Scrapy

import logging
logger = logging.getLogger(__name__)

class MySpider(Spider):
    def parse(self, response):
        # Different severity levels
        logger.debug("Detailed info for debugging")
        logger.info("General information")
        logger.warning("Something unexpected")
        logger.error("Something failed")


        try:
            # ... scraping code ...
        except Exception as e:
            logger.error(f"Failed: {str(e)}")

💡 Logging is the professional way to monitor your spiders

(ADVANCED) Logging Configuration Example

You can customise how the logs are displayed. Here is the way I like to do it. I create a custom formatter that adds colors and mimics the Scrapy logging style:

class ColorFormatter(logging.Formatter):
    green = "\033[32m"
    reset = "\033[0m"
    
    FORMATS = {
        logging.INFO: green + 
            "%(asctime)s [%(name)s] %(levelname)s: "
            "%(message)s" + reset
    }
    
    def __init__(self):
        super().__init__(datefmt="%Y-%m-%d %H:%M:%S")

The string "\033[32m" is the ANSI escape code for green. ANSI is a standard for controlling the formatting of text output on terminals.

How to use the formatter

import logging

from somewhere import ColorFormatter

logger = logging.getLogger(__name__)
logger.addHandler(ColorFormatter())

Logging Output Example

This way, instead of seeing:

Parsing https://climateactiontracker.org/countries/brazil/
Successfully parsed data for Brazil

You see a colourful output that is more like this:

2025-02-16 12:21:08 [climate_tracker.spiders] INFO: Parsing https://climateactiontracker.org/countries/brazil/ 2025-02-16 12:21:09 [climate_tracker.spiders] DEBUG: Successfully parsed data for Brazil

2️⃣ Web Scraping Selectors: Beyond CSS Selectors

10:15 – 10:40

Back to the world of web scraping…

CSS Selectors

Remember how last week we used CSS selectors to extract data?

# Using CSS selectors
title = response.css('h1::text').get()
rating = response.css('.ratings-matrix__overall dd::text').get()

The same with XPath

There’s another way to select elements: XPath.

It’s more powerful but also more verbose 👇

# Using XPath selectors
title = response.xpath('//h1/text()').get()
rating = response.xpath('//div[@class="ratings-matrix__overall"]/dd/text()').get()

CSS vs XPath Selectors

CSS Selector

# Select by class
response.css('.intro')

# Select text
response.css('p::text')

# Can't select parent! 😢
# No way to do this in CSS

XPath Selector

# Select by class
response.xpath('//p[@class="intro"]')

# Select text
response.xpath('//p/text()')

# Select parent! 🎉
response.xpath('//p/parent::div')

XPath Power Features

# Find elements containing specific text
response.xpath('//p[contains(text(), "climate")]')

# Complex conditions
response.xpath('//div[@class="rating" and @data-value > 5]')

# Navigate up the tree
response.xpath('//span[@class="price"]/ancestor::div')

# Select nth child
response.xpath('//ul/li[2]')  # second list item

💡 We’ll use both CSS and XPath in our spiders - each has its strengths

CSS Selector Cheatsheet

Common CSS Selectors

h1                  /* Element type */
.intro              /* Class */
#title              /* ID */
div p               /* Descendant */
div > p             /* Direct child */
img[alt]            /* Has attribute */
[data-value='10']   /* Attribute value */
p:first-child       /* First child */

Scrapy-specific Pseudo-elements

response.css('p::text').get()      # Extract text content
response.css('a::attr(href)').get()  # Extract attribute value

Note there is a difference between response.css('p::text').get() and response.css('p ::text').get().

Understanding CSS Nesting

The > Combinator

<span>
    <em>            <!-- selected by both 'span *' and 'span > *' -->
        <strong>    <!-- selected by 'span *' only -->
            text    <!-- selected by 'span *' only -->
        </strong>
    </em>
    <b>             <!-- selected by both 'span *' and 'span > *' -->
        <i>         <!-- selected by 'span *' only -->
            more    <!-- selected by 'span *' only -->
        </i>
    </b>
</span>

In Scrapy

response.css('span *')     # All descendants
response.css('span > *')   # Direct children only

💡 The space in 'span *' means “any descendant”, while 'span > *' means “direct child”

XPath Cheatsheet

XPath looks more like working with paths in a file system.

Basic Selection

//div                    # Any div anywhere

//div/p                  # Direct child p of div

//div//p                 # Any p descendant of any div
                         # no matter how deep

Attributes and Text

//p[@class='intro']     # p with class 'intro'

//p[contains(@class,'intro')]  # p with class containing 'intro'

//p[text()='Hello']     # p containing exact text

XPath Cheatsheet (continued)

Navigation

//div/parent::*         # Parent of div

//div/following::p      # p elements after div

//div/ancestor::section # Any section ancestor of div

Indexing

//p[1]                  # First p (index starts at 1!)

//p[last()-1]          # Second-to-last p

//ul/li[position()=2]   # Second li in ul

//div[p][1]             # First div that has a p child

(//p)[1]                # First p in entire document (it starts at 1)

//div//p[2]             # Second p within each div

3️⃣ Making our code more robust

10:40 – 11:05

Just like we did with our API in Weeks 2 & 3, we need to make sure our spider:

Testing our Spiders with Scrapy Contracts
Validating data with Scrapy Items

Testing Selectors

Interactive Testing

def parse(self, response):
    ipdb.set_trace()
    
    # Test in console
    response.xpath('//p')
    response.css('p')

Unit Testing

def test_title_extraction(self):
    selector = response.xpath('//h1/text()')
    h1_text  = selector.get()
    
    self.assertEqual(
        selector,
        'Expected Title'
    )

Traditional Unit Testing

Python Unit Tests

# test_api.py

def test_get_user():

    response = client.get("/users/1")
    assert response.status_code == 200
    assert "name" in response.json()

Remember the ascor-api tests?

Key Concepts

Test one thing at a time
Arrange-Act-Assert pattern
Mock external dependencies
Clear test names
Isolated tests

Why Not Just pytest?

Challenges

Network dependencies
Dynamic content
Rate limiting
State management
Complex setup

Web Scraping Needs

Test selectors
Validate data formats
Check pagination
Handle failures
Test pipelines

Enter Scrapy Contracts


def parse(self, response):
    """Extract data from country pages.
    @url https://climateactiontracker.org/countries/brazil/
    @returns items 1 1
    @scrapes country_name overall_rating flag_url
    """
    # ... spider code ...

💡 Contracts are docstring-based tests

Built-in Contracts

Basic Contracts

@url - Test URL
@returns - Expected output
@scrapes - Required fields
@cb_kwargs - Callback args
@meta - Request metadata

Example Usage

"""
@url http://example.com
@returns items 1 1
@returns requests 0 0
@scrapes title price
"""

Custom Contracts

class ValidCountryContract(Contract):
  """Check if country name is a valid country
  @valid_country
  """
  name = 'valid_country'

  def post_process(self, output):


    for item in output:
      search_country = pycountry.countries.get(name=item['country_name'])

      if not search_country:
        raise ContractFail(
          f"Invalid country name: {item['country_name']}"
        )

Running Contract Tests

Command

scrapy check

Tests all contracts in all spiders in this scrapy project

Configuration

# settings.py
SPIDER_CONTRACTS = {
    'climate_tracker.contracts.ValidCountryContract': 100
}

Contracts vs Unit Tests

Unit Tests

More flexible
Better for complex logic
Can mock dependencies
Standard Python tools
IDE integration

Contracts

Spider-specific
Built into Scrapy
Tests real responses
Simpler to write
Self-documenting

Data Validation with Scrapy Items

API Models (W02-W03)

class Country(BaseModel):
    name: str
    rating: str
    flag_url: str = Field(pattern=r'^https?://.+')

Pydantic models served as a good example of how to enforce data structure.

Scrapy Items

class CountryClimateItem(Item):
    name = Field(
        serializer=str,
        required=True
    )
    rating = Field(
        serializer=str,
        choices=['Insufficient', 
                 'Compatible']
    )

Scrapy Items serve a very similar purpose to Pydantic models. It’s just the syntax that’s different.

Why Use Items?

Benefits

Type validation
Required fields
Field choices
Documentation
IDE support

Usage

def parse(self, response):
    item = CountryClimateItem()
    item['name'] = response.css('h1::text').get()
    return item

🍵 Quick Coffee Break

11:05 – 11:15

After the break:

Item Pipelines for Data Processing
Dynamic URL Discovery & Crawling
Best Practices & Common Pitfalls

4️⃣ Data Processing with Item Pipelines

11:15 – 11:35

Spider -> Item -> Pipeline -> Output

1. Adding a data validation pipeline

class ValidateItemPipeline:

    def process_item(self, item, spider):

        if not isinstance(item, CountryClimateItem):
            raise DropItem(f"Unknown item type: {type(item)}")


        # Suppose we have a validation logic for the rating
        if item['overall_rating'] not in ['Insufficient', 'Compatible']:
            raise DropItem(f"Invalid rating: {item['overall_rating']}")
            
        return item

Then add this to your settings.py:

# Check settings.py is configured
ITEM_PIPELINES = {
    'climate_tracker.pipelines.ValidateItemPipeline': 300
}

Debugging Pipelines

Add logging to your pipeline to help debug.

class ValidateItemPipeline:

    def process_item(self, item, spider):

        logger.debug(f"Validating item: {item}")
        if not isinstance(item, CountryClimateItem):
            logger.error(f"Item {item} is not a valid CountryClimateItem")
            raise DropItem(f"Unknown item type: {type(item)}")


        # Suppose we have a validation logic for the rating
        if item['overall_rating'] not in ['Insufficient', 'Compatible']:
            error_msg = (
                f"Dropping item {item} because "
                f"it has an invalid rating: {item['overall_rating']}"
            )
            logger.error(error_msg)
            raise DropItem(error_msg)
            
        return item

Running & Checking

You can run your spider with a specific type of log level.

# Run with logging enabled
scrapy crawl spider_name --logfile log.txt

# Run with debug level
scrapy crawl spider_name -L DEBUG

🐞 Current bug: the log level is not being applied to our custom logging, only to Scrapy’s default logging. If anyone finds a fix, send a PR!

Pipeline Lifecycle

When Pipelines Run

Spider yields item
Each pipeline processes in order
Items can be:
- Modified
- Dropped
- Passed through

Common Issues

Pipeline not in settings
Wrong priority number
Not returning item
Incorrect field names
Missing error handling

2. Adding a pipeline for downloading images

class CountryFlagsPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        """Request SVG download if URL is present."""
        if item.get('flag_url'):
            yield Request(item['flag_url'])

    def file_path(self, request, response=None, info=None, *, item=None):
        """Generate file path for storing the SVG."""
        country = item['country_name'].lower().replace(' ', '_')
        return f'flags/{country}.svg'


    def item_completed(self, results, item, info):
        """Update item with local file path after download."""
        if results and results[0][0]:  # if success
            item['flag_path'] = results[0][1]['path']
            logger.debug(some_message)
        else:
            logger.warning(some_message)
            item['flag_path'] = None
        return item

3. Remember to add the pipeline to the settings

# settings.py
ITEM_PIPELINES = {
    'climate_tracker.pipelines.ValidateItemPipeline': 100,
    'climate_tracker.pipelines.CountryFlagsPipeline': 300
}

# Lower numbers = higher priority
# Range: 0-1000

Run a spider parser with the pipelines

scrapy parse --spider=spider_name --pipelines url

4. Control the output format

We can use Scrapy’s Feed Exports to control the output format. They are very similar to Item Pipelines, but they are built into Scrapy.

# settings.py
FEEDS = {
    'data/output.jsonl': {
        'format': 'jsonlines',
        'encoding': 'utf8',
        'store_empty': False,
        'overwrite': True,
    },
}

💡 Use built-in feed exports instead of custom pipelines when possible

You can also specify the output file when running the spider:

scrapy crawl spider_name -O output.jsonl

Common Gotchas of Item Pipelines

⚠️ Common gotchas:

Forgetting to add pipelines to settings
Not handling file resources properly
Wrong pipeline order
Missing error handling

5️⃣ Dynamic URL Discovery

11:35 – 12:00

From Static to Dynamic URLs

Static Approach

class MySpider(Spider):
    start_urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
    ]

Dynamic Discovery

# Start from a single URL
start_urls = [
    'https://example.com/countries/'
]

def parse(self, response):
    # Find all country links
    for href in response.css(
        '.country-link::attr(href)'
    ):

        yield response.follow(
            href,
            self.parse_country
        )

The .follow() method is used to follow a link and call a different callback function to parse the response.

Callback Functions

Main Parser

def parse(self, response):
    """Find all country URLs."""
    for href in response.css(
        '.country-link::attr(href)'
    ).getall():
        yield response.follow(
            href, 
            self.parse_country
        )

Detail Parser

def parse_country(self, response):
    """Extract country data."""
    return {
        'name': response.css(
            'h1::text'
        ).get(),
        'rating': response.css(
            '.rating::text'
        ).get()
    }

💡 Each callback function has a specific responsibility

Handling Pagination

By the way, you can also handle pagination in your callback functions.

def parse(self, response):
    # Process current page
    for item in response.css('.item'):
        yield self.parse_item(item)
    
    # Follow next page
    next_page = response.css('.next::attr(href)').get()
    
    if next_page:
        yield response.follow(next_page, self.parse)


    def parse_item(self, item_selector):

        return {
            'title': item_selector.css('h2::text').get(),
            'date': item_selector.css('.date::text').get()
        }

💡 Separating item parsing into its own method makes the code more maintainable

Best Practices

URL Management

Start from index pages
Follow internal links only
Implement rate limiting
Handle URL parameters
Check URL validity

Error Handling

Handle network errors
Implement retries
Log exceptions
Validate data
Monitor performance

What’s Next?

THE END

Join us in the lab for a demo of Selenium.
Practice with the 🗓️ W04-W05 Formative Exercise
Review Scrapy documentation on:
- XPath Selectors
- Item Pipelines
Start thinking about your Summative Exercise options