DS205 – Advanced Data Manipulation
17 Feb 2025
10:03 – 10:15
A few unusual Python concepts that might have left you a bit puzzled plus some new debugging tools.
You will probably have found it unusual that we’re not using Jupyter Notebooks much in this course. This is because for code that needs to be run in a production environment, Jupyter Notebooks are not a perfect fit.
👉 We want our code to run without a graphical user interface
Jupyter Notebooks
Python Scripts
yield
Keyword in ScrapyYou are aware of lists in Python, right? But maybe you don’t know about generators?
In the words of the Python documentation:
Regular functions compute a value and return it, but generators return an iterator that returns a stream of values.
This is useful when you want to process items one at a time, without loading them all into memory. It’s a type of lazy evaluation, a concept from functional programming in computer science.
def parse(self, response):
# This returns and keeps state
yield {
'title': response.css('h1::text').get(),
'date': response.css('.date::text').get()
}
The command above will not return the item immediately. Instead, it will yield the item to the Scrapy engine, which will process it later.
In the end, as the user of a function, you will perceive this as if the function was returnning a kind of list of items for you.
ipdb
Key Commands
Common Commands
n
: Next lines
: Step into functionc
: Continue executionp variable
: Print variablell
: List source codeq
: Quit debuggerprint()
You can also monitor your Python code using the Python logging
library.
logger = logging.getLogger()
def parse(self, response):
# INFO shows general information
logger.info("Starting parse")
... some code ...
try:
... some code ...
except Exception as e:
logger.error(f"Failed: {str(e)}")
Logging is a more professional way to monitor your code. You can enable different severity levels, making it easier to filter later on.
import logging
logger = logging.getLogger(__name__)
class MySpider(Spider):
def parse(self, response):
# Different severity levels
logger.debug("Detailed info for debugging")
logger.info("General information")
logger.warning("Something unexpected")
logger.error("Something failed")
💡 Logging is the professional way to monitor your spiders
You can customise how the logs are displayed. Here is the way I like to do it. I create a custom formatter that adds colors and mimics the Scrapy logging style:
class ColorFormatter(logging.Formatter):
green = "\033[32m"
reset = "\033[0m"
FORMATS = {
logging.INFO: green +
"%(asctime)s [%(name)s] %(levelname)s: "
"%(message)s" + reset
}
def __init__(self):
super().__init__(datefmt="%Y-%m-%d %H:%M:%S")
The string "\033[32m"
is the ANSI escape code for green. ANSI is a standard for controlling the formatting of text output on terminals.
This way, instead of seeing:
Parsing https://climateactiontracker.org/countries/brazil/
Successfully parsed data for Brazil
You see a colourful output that is more like this:
2025-02-16 12:21:08 [climate_tracker.spiders] INFO: Parsing https://climateactiontracker.org/countries/brazil/ 2025-02-16 12:21:09 [climate_tracker.spiders] DEBUG: Successfully parsed data for Brazil
10:15 – 10:40
Back to the world of web scraping…
Remember how last week we used CSS selectors to extract data?
# Using CSS selectors
title = response.css('h1::text').get()
rating = response.css('.ratings-matrix__overall dd::text').get()
There’s another way to select elements: XPath.
It’s more powerful but also more verbose 👇
CSS Selector
# Find elements containing specific text
response.xpath('//p[contains(text(), "climate")]')
# Complex conditions
response.xpath('//div[@class="rating" and @data-value > 5]')
# Navigate up the tree
response.xpath('//span[@class="price"]/ancestor::div')
# Select nth child
response.xpath('//ul/li[2]') # second list item
💡 We’ll use both CSS and XPath in our spiders - each has its strengths
Common CSS Selectors
h1 /* Element type */
.intro /* Class */
#title /* ID */
div p /* Descendant */
div > p /* Direct child */
img[alt] /* Has attribute */
[data-value='10'] /* Attribute value */
p:first-child /* First child */
Scrapy-specific Pseudo-elements
response.css('p::text').get() # Extract text content
response.css('a::attr(href)').get() # Extract attribute value
Note there is a difference between response.css('p::text').get()
and response.css('p ::text').get()
.
The > Combinator
<span>
<em> <!-- selected by both 'span *' and 'span > *' -->
<strong> <!-- selected by 'span *' only -->
text <!-- selected by 'span *' only -->
</strong>
</em>
<b> <!-- selected by both 'span *' and 'span > *' -->
<i> <!-- selected by 'span *' only -->
more <!-- selected by 'span *' only -->
</i>
</b>
</span>
In Scrapy
💡 The space in 'span *'
means “any descendant”, while 'span > *'
means “direct child”
Click here for the full list of CSS selectors.
XPath looks more like working with paths in a file system.
Basic Selection
//div # Any div anywhere
//div/p # Direct child p of div
//div//p # Any p descendant of any div
# no matter how deep
Attributes and Text
Navigation
//div/parent::* # Parent of div
//div/following::p # p elements after div
//div/ancestor::section # Any section ancestor of div
Indexing
//p[1] # First p (index starts at 1!)
//p[last()-1] # Second-to-last p
//ul/li[position()=2] # Second li in ul
//div[p][1] # First div that has a p child
(//p)[1] # First p in entire document (it starts at 1)
//div//p[2] # Second p within each div
Click here to see the full XPath spec
10:40 – 11:05
Just like we did with our API in Weeks 2 & 3, we need to make sure our spider:
Python Unit Tests
# test_api.py
def test_get_user():
response = client.get("/users/1")
assert response.status_code == 200
assert "name" in response.json()
Remember the ascor-api tests?
Key Concepts
Test one thing at a time
Arrange-Act-Assert pattern
Mock external dependencies
Clear test names
Isolated tests
Challenges
Network dependencies
Dynamic content
Rate limiting
State management
Complex setup
Web Scraping Needs
Test selectors
Validate data formats
Check pagination
Handle failures
Test pipelines
def parse(self, response):
"""Extract data from country pages.
@url https://climateactiontracker.org/countries/brazil/
@returns items 1 1
@scrapes country_name overall_rating flag_url
"""
# ... spider code ...
💡 Contracts are docstring-based tests
Unit Tests
More flexible
Better for complex logic
Can mock dependencies
Standard Python tools
IDE integration
Contracts
Spider-specific
Built into Scrapy
Tests real responses
Simpler to write
Self-documenting
API Models (W02-W03)
Pydantic models served as a good example of how to enforce data structure.
11:05 – 11:15
After the break:
11:15 – 11:35
Spider
-> Item
-> Pipeline
-> Output
class ValidateItemPipeline:
def process_item(self, item, spider):
if not isinstance(item, CountryClimateItem):
raise DropItem(f"Unknown item type: {type(item)}")
# Suppose we have a validation logic for the rating
if item['overall_rating'] not in ['Insufficient', 'Compatible']:
raise DropItem(f"Invalid rating: {item['overall_rating']}")
return item
Then add this to your settings.py:
Add logging to your pipeline to help debug.
You can run your spider with a specific type of log level.
# Run with logging enabled
scrapy crawl spider_name --logfile log.txt
# Run with debug level
scrapy crawl spider_name -L DEBUG
🐞 Current bug: the log level is not being applied to our custom logging, only to Scrapy’s default logging. If anyone finds a fix, send a PR!
When Pipelines Run
Common Issues
class CountryFlagsPipeline(FilesPipeline):
def get_media_requests(self, item, info):
"""Request SVG download if URL is present."""
if item.get('flag_url'):
yield Request(item['flag_url'])
def file_path(self, request, response=None, info=None, *, item=None):
"""Generate file path for storing the SVG."""
country = item['country_name'].lower().replace(' ', '_')
return f'flags/{country}.svg'
# settings.py
ITEM_PIPELINES = {
'climate_tracker.pipelines.ValidateItemPipeline': 100,
'climate_tracker.pipelines.CountryFlagsPipeline': 300
}
# Lower numbers = higher priority
# Range: 0-1000
We can use Scrapy’s Feed Exports to control the output format. They are very similar to Item Pipelines, but they are built into Scrapy.
# settings.py
FEEDS = {
'data/output.jsonl': {
'format': 'jsonlines',
'encoding': 'utf8',
'store_empty': False,
'overwrite': True,
},
}
💡 Use built-in feed exports instead of custom pipelines when possible
You can also specify the output file when running the spider:
⚠️ Common gotchas:
11:35 – 12:00
Static Approach
Dynamic Discovery
# Start from a single URL
start_urls = [
'https://example.com/countries/'
]
def parse(self, response):
# Find all country links
for href in response.css(
'.country-link::attr(href)'
):
The .follow()
method is used to follow a link and call a different callback function to parse the response.
Main Parser
💡 Each callback function has a specific responsibility
By the way, you can also handle pagination in your callback functions.
def parse(self, response):
# Process current page
for item in response.css('.item'):
yield self.parse_item(item)
# Follow next page
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_item(self, item_selector):
return {
'title': item_selector.css('h2::text').get(),
'date': item_selector.css('.date::text').get()
}
💡 Separating item parsing into its own method makes the code more maintainable
URL Management
Error Handling
THE END
LSE DS205 (2024/25)