🗓️ Week 05
Topics in Web Scraping: XPath Selectors, Item Pipelines, and Dynamic URL Discovery

DS205 – Advanced Data Manipulation

17 Feb 2025

1️⃣ Review and Coding Tips

10:03 – 10:15

A few unusual Python concepts that might have left you a bit puzzled plus some new debugging tools.

Why Python Scripts vs Notebooks?

You will probably have found it unusual that we’re not using Jupyter Notebooks much in this course. This is because for code that needs to be run in a production environment, Jupyter Notebooks are not a perfect fit.

👉 We want our code to run without a graphical user interface

Jupyter Notebooks

  • Great for exploration
  • Interactive development
  • Visual outputs
  • Documentation
  • Reporting tool

Python Scripts

  • Production code
  • Version control
  • Command line usage
  • Automation
  • Fewer dependencies

The yield Keyword in Scrapy

You are aware of lists in Python, right? But maybe you don’t know about generators?

In the words of the Python documentation:

Regular functions compute a value and return it, but generators return an iterator that returns a stream of values.

This is useful when you want to process items one at a time, without loading them all into memory. It’s a type of lazy evaluation, a concept from functional programming in computer science.

Example

def parse(self, response):
    # This returns and keeps state
    yield {
        'title': response.css('h1::text').get(),
        'date': response.css('.date::text').get()
    }

The command above will not return the item immediately. Instead, it will yield the item to the Scrapy engine, which will process it later.

In the end, as the user of a function, you will perceive this as if the function was returnning a kind of list of items for you.

Debugging with ipdb

Key Commands

def parse(self, response):
    # I typically import it on the same line
    # as I will only use it temporarily
    # I just delete the line before committing
    import ipdb;ipdb.set_trace()
    
    title = response.css('h1::text')
    # ...

Common Commands

  • n: Next line
  • s: Step into function
  • c: Continue execution
  • p variable: Print variable
  • ll: List source code
  • q: Quit debugger

Debugging Tools Compared

print()

def parse(self, response):
    print(f"URL: {response.url}")
    print(f"Found: {item}")
  • ➕ Simple to use
  • ➕ Always visible
  • 👎 Clutters the output
  • 👎 No way to indicate the severity of the message
  • 👎 Hard to disable

ipdb

def parse(self, response):
    ipdb.set_trace()
    item = response.css('h1')
  • ➕ Interactive debugging
  • ➕ Inspect what is in each variable at that point in the code
  • ➕ Step through code
  • 👎 Temporary use
  • 👎 Stops execution

Logging

You can also monitor your Python code using the Python logging library.

logger = logging.getLogger()

def parse(self, response):
    # INFO shows general information
    logger.info("Starting parse")
    ... some code ...

    try:
        ... some code ...
    except Exception as e:
        logger.error(f"Failed: {str(e)}")

Logging is a more professional way to monitor your code. You can enable different severity levels, making it easier to filter later on.

Using Logging in Scrapy

import logging
logger = logging.getLogger(__name__)

class MySpider(Spider):
    def parse(self, response):
        # Different severity levels
        logger.debug("Detailed info for debugging")
        logger.info("General information")
        logger.warning("Something unexpected")
        logger.error("Something failed")

        try:
            # ... scraping code ...
        except Exception as e:
            logger.error(f"Failed: {str(e)}")

💡 Logging is the professional way to monitor your spiders

(ADVANCED) Logging Configuration Example

You can customise how the logs are displayed. Here is the way I like to do it. I create a custom formatter that adds colors and mimics the Scrapy logging style:

class ColorFormatter(logging.Formatter):
    green = "\033[32m"
    reset = "\033[0m"
    
    FORMATS = {
        logging.INFO: green + 
            "%(asctime)s [%(name)s] %(levelname)s: "
            "%(message)s" + reset
    }
    
    def __init__(self):
        super().__init__(datefmt="%Y-%m-%d %H:%M:%S")

The string "\033[32m" is the ANSI escape code for green. ANSI is a standard for controlling the formatting of text output on terminals.

How to use the formatter

import logging

from somewhere import ColorFormatter

logger = logging.getLogger(__name__)
logger.addHandler(ColorFormatter())

Logging Output Example

This way, instead of seeing:

Parsing https://climateactiontracker.org/countries/brazil/
Successfully parsed data for Brazil

You see a colourful output that is more like this:

2025-02-16 12:21:08 [climate_tracker.spiders] INFO: Parsing https://climateactiontracker.org/countries/brazil/ 2025-02-16 12:21:09 [climate_tracker.spiders] DEBUG: Successfully parsed data for Brazil

2️⃣ Web Scraping Selectors: Beyond CSS Selectors

10:15 – 10:40

Back to the world of web scraping…

CSS Selectors

Remember how last week we used CSS selectors to extract data?

# Using CSS selectors
title = response.css('h1::text').get()
rating = response.css('.ratings-matrix__overall dd::text').get()

The same with XPath

There’s another way to select elements: XPath.

It’s more powerful but also more verbose 👇

# Using XPath selectors
title = response.xpath('//h1/text()').get()
rating = response.xpath('//div[@class="ratings-matrix__overall"]/dd/text()').get()

CSS vs XPath Selectors

CSS Selector

# Select by class
response.css('.intro')

# Select text
response.css('p::text')

# Can't select parent! 😢
# No way to do this in CSS

XPath Selector

# Select by class
response.xpath('//p[@class="intro"]')

# Select text
response.xpath('//p/text()')

# Select parent! 🎉
response.xpath('//p/parent::div')

XPath Power Features

# Find elements containing specific text
response.xpath('//p[contains(text(), "climate")]')

# Complex conditions
response.xpath('//div[@class="rating" and @data-value > 5]')

# Navigate up the tree
response.xpath('//span[@class="price"]/ancestor::div')

# Select nth child
response.xpath('//ul/li[2]')  # second list item

💡 We’ll use both CSS and XPath in our spiders - each has its strengths

CSS Selector Cheatsheet

Common CSS Selectors

h1                  /* Element type */
.intro              /* Class */
#title              /* ID */
div p               /* Descendant */
div > p             /* Direct child */
img[alt]            /* Has attribute */
[data-value='10']   /* Attribute value */
p:first-child       /* First child */

Scrapy-specific Pseudo-elements

response.css('p::text').get()      # Extract text content
response.css('a::attr(href)').get()  # Extract attribute value

Note there is a difference between response.css('p::text').get() and response.css('p ::text').get().

Understanding CSS Nesting

The > Combinator

<span>
    <em>            <!-- selected by both 'span *' and 'span > *' -->
        <strong>    <!-- selected by 'span *' only -->
            text    <!-- selected by 'span *' only -->
        </strong>
    </em>
    <b>             <!-- selected by both 'span *' and 'span > *' -->
        <i>         <!-- selected by 'span *' only -->
            more    <!-- selected by 'span *' only -->
        </i>
    </b>
</span>

In Scrapy

response.css('span *')     # All descendants
response.css('span > *')   # Direct children only

💡 The space in 'span *' means “any descendant”, while 'span > *' means “direct child”

XPath Cheatsheet

XPath looks more like working with paths in a file system.

Basic Selection

//div                    # Any div anywhere

//div/p                  # Direct child p of div

//div//p                 # Any p descendant of any div
                         # no matter how deep

Attributes and Text

//p[@class='intro']     # p with class 'intro'

//p[contains(@class,'intro')]  # p with class containing 'intro'

//p[text()='Hello']     # p containing exact text

XPath Cheatsheet (continued)

Navigation

//div/parent::*         # Parent of div

//div/following::p      # p elements after div

//div/ancestor::section # Any section ancestor of div

Indexing

//p[1]                  # First p (index starts at 1!)

//p[last()-1]          # Second-to-last p

//ul/li[position()=2]   # Second li in ul

//div[p][1]             # First div that has a p child

(//p)[1]                # First p in entire document (it starts at 1)

//div//p[2]             # Second p within each div

3️⃣ Making our code more robust

10:40 – 11:05

Just like we did with our API in Weeks 2 & 3, we need to make sure our spider:

  1. Testing our Spiders with Scrapy Contracts
  2. Validating data with Scrapy Items

Testing Selectors

Interactive Testing

def parse(self, response):
    ipdb.set_trace()
    
    # Test in console
    response.xpath('//p')
    response.css('p')

Unit Testing

def test_title_extraction(self):
    selector = response.xpath('//h1/text()')
    h1_text  = selector.get()
    
    self.assertEqual(
        selector,
        'Expected Title'
    )

Traditional Unit Testing

Python Unit Tests

# test_api.py

def test_get_user():

    response = client.get("/users/1")
    assert response.status_code == 200
    assert "name" in response.json()

Remember the ascor-api tests?

Key Concepts

  • Test one thing at a time

  • Arrange-Act-Assert pattern

  • Mock external dependencies

  • Clear test names

  • Isolated tests

Why Not Just pytest?

Challenges

  • Network dependencies

  • Dynamic content

  • Rate limiting

  • State management

  • Complex setup

Web Scraping Needs

  • Test selectors

  • Validate data formats

  • Check pagination

  • Handle failures

  • Test pipelines

Enter Scrapy Contracts


def parse(self, response):
    """Extract data from country pages.
    @url https://climateactiontracker.org/countries/brazil/
    @returns items 1 1
    @scrapes country_name overall_rating flag_url
    """
    # ... spider code ...

💡 Contracts are docstring-based tests

Built-in Contracts

Basic Contracts

  • @url - Test URL

  • @returns - Expected output

  • @scrapes - Required fields

  • @cb_kwargs - Callback args

  • @meta - Request metadata

Example Usage

"""
@url http://example.com
@returns items 1 1
@returns requests 0 0
@scrapes title price
"""

Custom Contracts

class ValidCountryContract(Contract):
  """Check if country name is a valid country
  @valid_country
  """
  name = 'valid_country'

  def post_process(self, output):

    for item in output:
      search_country = pycountry.countries.get(name=item['country_name'])

      if not search_country:
        raise ContractFail(
          f"Invalid country name: {item['country_name']}"
        )

Running Contract Tests

Command

scrapy check

Tests all contracts in all spiders in this scrapy project

Configuration

# settings.py
SPIDER_CONTRACTS = {
    'climate_tracker.contracts.ValidCountryContract': 100
}

Contracts vs Unit Tests

Unit Tests

  • More flexible

  • Better for complex logic

  • Can mock dependencies

  • Standard Python tools

  • IDE integration

Contracts

  • Spider-specific

  • Built into Scrapy

  • Tests real responses

  • Simpler to write

  • Self-documenting

Data Validation with Scrapy Items

API Models (W02-W03)

class Country(BaseModel):
    name: str
    rating: str
    flag_url: str = Field(pattern=r'^https?://.+')

Pydantic models served as a good example of how to enforce data structure.

Scrapy Items

class CountryClimateItem(Item):
    name = Field(
        serializer=str,
        required=True
    )
    rating = Field(
        serializer=str,
        choices=['Insufficient', 
                 'Compatible']
    )

Scrapy Items serve a very similar purpose to Pydantic models. It’s just the syntax that’s different.

Why Use Items?

Benefits

  • Type validation
  • Required fields
  • Field choices
  • Documentation
  • IDE support

Usage

def parse(self, response):
    item = CountryClimateItem()
    item['name'] = response.css('h1::text').get()
    return item

🍵 Quick Coffee Break

11:05 – 11:15

After the break:

  • Item Pipelines for Data Processing
  • Dynamic URL Discovery & Crawling
  • Best Practices & Common Pitfalls

4️⃣ Data Processing with Item Pipelines

11:15 – 11:35

  • Spider -> Item -> Pipeline -> Output

1. Adding a data validation pipeline

class ValidateItemPipeline:

    def process_item(self, item, spider):

        if not isinstance(item, CountryClimateItem):
            raise DropItem(f"Unknown item type: {type(item)}")

        # Suppose we have a validation logic for the rating
        if item['overall_rating'] not in ['Insufficient', 'Compatible']:
            raise DropItem(f"Invalid rating: {item['overall_rating']}")
            
        return item

Then add this to your settings.py:

# Check settings.py is configured
ITEM_PIPELINES = {
    'climate_tracker.pipelines.ValidateItemPipeline': 300
}

Debugging Pipelines

Add logging to your pipeline to help debug.

class ValidateItemPipeline:

    def process_item(self, item, spider):

        logger.debug(f"Validating item: {item}")
        if not isinstance(item, CountryClimateItem):
            logger.error(f"Item {item} is not a valid CountryClimateItem")
            raise DropItem(f"Unknown item type: {type(item)}")

        # Suppose we have a validation logic for the rating
        if item['overall_rating'] not in ['Insufficient', 'Compatible']:
            error_msg = (
                f"Dropping item {item} because "
                f"it has an invalid rating: {item['overall_rating']}"
            )
            logger.error(error_msg)
            raise DropItem(error_msg)
            
        return item

Running & Checking

You can run your spider with a specific type of log level.

# Run with logging enabled
scrapy crawl spider_name --logfile log.txt

# Run with debug level
scrapy crawl spider_name -L DEBUG

🐞 Current bug: the log level is not being applied to our custom logging, only to Scrapy’s default logging. If anyone finds a fix, send a PR!

Pipeline Lifecycle

When Pipelines Run

  1. Spider yields item
  2. Each pipeline processes in order
  3. Items can be:
    • Modified
    • Dropped
    • Passed through

Common Issues

  • Pipeline not in settings
  • Wrong priority number
  • Not returning item
  • Incorrect field names
  • Missing error handling

2. Adding a pipeline for downloading images

class CountryFlagsPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        """Request SVG download if URL is present."""
        if item.get('flag_url'):
            yield Request(item['flag_url'])

    def file_path(self, request, response=None, info=None, *, item=None):
        """Generate file path for storing the SVG."""
        country = item['country_name'].lower().replace(' ', '_')
        return f'flags/{country}.svg'

    def item_completed(self, results, item, info):
        """Update item with local file path after download."""
        if results and results[0][0]:  # if success
            item['flag_path'] = results[0][1]['path']
            logger.debug(some_message)
        else:
            logger.warning(some_message)
            item['flag_path'] = None
        return item

3. Remember to add the pipeline to the settings

# settings.py
ITEM_PIPELINES = {
    'climate_tracker.pipelines.ValidateItemPipeline': 100,
    'climate_tracker.pipelines.CountryFlagsPipeline': 300
}

# Lower numbers = higher priority
# Range: 0-1000

Run a spider parser with the pipelines

scrapy parse --spider=spider_name --pipelines url

4. Control the output format

We can use Scrapy’s Feed Exports to control the output format. They are very similar to Item Pipelines, but they are built into Scrapy.

# settings.py
FEEDS = {
    'data/output.jsonl': {
        'format': 'jsonlines',
        'encoding': 'utf8',
        'store_empty': False,
        'overwrite': True,
    },
}

💡 Use built-in feed exports instead of custom pipelines when possible

You can also specify the output file when running the spider:

scrapy crawl spider_name -O output.jsonl

Common Gotchas of Item Pipelines

⚠️ Common gotchas:

  • Forgetting to add pipelines to settings
  • Not handling file resources properly
  • Wrong pipeline order
  • Missing error handling

5️⃣ Dynamic URL Discovery

11:35 – 12:00

From Static to Dynamic URLs

Static Approach

class MySpider(Spider):
    start_urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
    ]

Dynamic Discovery

# Start from a single URL
start_urls = [
    'https://example.com/countries/'
]

def parse(self, response):
    # Find all country links
    for href in response.css(
        '.country-link::attr(href)'
    ):
        yield response.follow(
            href,
            self.parse_country
        )

The .follow() method is used to follow a link and call a different callback function to parse the response.

Callback Functions

Main Parser

def parse(self, response):
    """Find all country URLs."""
    for href in response.css(
        '.country-link::attr(href)'
    ).getall():
        yield response.follow(
            href, 
            self.parse_country
        )

Detail Parser

def parse_country(self, response):
    """Extract country data."""
    return {
        'name': response.css(
            'h1::text'
        ).get(),
        'rating': response.css(
            '.rating::text'
        ).get()
    }

💡 Each callback function has a specific responsibility

Handling Pagination

By the way, you can also handle pagination in your callback functions.

def parse(self, response):
    # Process current page
    for item in response.css('.item'):
        yield self.parse_item(item)
    
    # Follow next page
    next_page = response.css('.next::attr(href)').get()
    
    if next_page:
        yield response.follow(next_page, self.parse)

    def parse_item(self, item_selector):

        return {
            'title': item_selector.css('h2::text').get(),
            'date': item_selector.css('.date::text').get()
        }

💡 Separating item parsing into its own method makes the code more maintainable

Best Practices

URL Management

  • Start from index pages
  • Follow internal links only
  • Implement rate limiting
  • Handle URL parameters
  • Check URL validity

Error Handling

  • Handle network errors
  • Implement retries
  • Log exceptions
  • Validate data
  • Monitor performance

What’s Next?

THE END

  1. Join us in the lab for a demo of Selenium.
  2. Practice with the 🗓️ W04-W05 Formative Exercise
  3. Review Scrapy documentation on:
  4. Start thinking about your Summative Exercise options