π» Week 04 Lab
Web Scraping with Scrapy

Last Updated: 10 February 18:20
πTime and Location: Tuesday, 11 February 2025. Check your timetable for the precise time and location of your class.
π Preparation
Before attending this lab, ensure that:
You are caught up with the theoretical concepts covered in the π£οΈ Week 04 Lecture, including the principles of ethical web scraping.
Create a dedicated folder for this lab. Place it somewhere like
~/DS205/web-scraping
.For this lab, a simple local folder is sufficient, youβll be working with a dedicated GitHub repository later for the π W04-W05 Formative Exercise.
(Optional but recommended) Set up a dedicated virtual environment for this web scraping project.
We will soon have two concurrent projects running during the course: i) ascor-api and ii) climate-data-web-scraping. Itβs a good idea to keep their Python packages separate.
Setting up a virtual environment
Donβt just take my word for it. Even the Scrapy Installation Guide, the official documentation for the library we will use for web scraping, recommends setting up a virtual environment:
βWe strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system packages.β
- To set up a virtual environment, run the following command in your terminal:
python -m venv scraping-env
- Then, activate the virtual environment:
# If on Mac or Linux (e.g. Nuvolos)
source scraping-env/bin/activate
# If on Windows
scraping-env\Scripts\activate
You should see something like this in your terminal:
(scraping-env) jon@jon-desktop:~/DS205/web-scraping$
- Install Scrapy in the virtual environment:
pip install scrapy pandas
- Whenever you come back to work on this project, activate the virtual environment:
source scraping-env/bin/activate
π£οΈ Lab Roadmap
Scrapy is a powerful and flexible web scraping framework that allows you to crawl websites, extract structured data, and save it in various formats. It automates a lot of the tedious tasks that you would otherwise have to do manually. This is our tool of choice for web scraping in this lab.
You might want to bookmark the following pages to consult later once youβve understood the basics:
- Spiders: The main component for scraping web pages
- Item Pipeline: Process and store scraped data
Part I: Project Setup (15 min)
No need to wait for your class teacher. You can start working on this part on your own as soon as you arrive.
π― ACTION POINTS
Create a new Scrapy project.
Like FastAPI, Scrapy requires a specific set of files and folders that we need to set up when starting a new project. (You will find that Scrapy is more demanding in this regard.)
To create a new Scrapy project, letβs call it
climate_tracker
, run the following command in your terminal:scrapy startproject climate_tracker cd climate_tracker
Examine the project structure.
The command above will produce several files and folders. Take some time reading through them to understand the purpose of each file and folder.
climate_tracker/ βββ scrapy.cfg βββ climate_tracker/ βββ __init__.py βββ items.py βββ middlewares.py βββ pipelines.py βββ settings.py βββ spiders/ βββ __init__.py
Configure ethical scraping settings.
Open the
settings.py
file and configure the following settings:# Identify your spider = 'LSE DS205 Student Spider (GitHub: @your-username) (+https://lse-dsi.github.io/DS205)' USER_AGENT # Be polite = True # Make sure this is uncommented and set to True ROBOTSTXT_OBEY = 3 # Wait 3 seconds between requests (don't bombard the website with requests) DOWNLOAD_DELAY
Replace
@your-username
with your GitHub username. This way, the website maintainers knows we are not digital pirates.
π Acknowledgement:
In compliance with the Climate Analytics and NewClimate Institute copyright notice, if you end up putting your webscraping code on GitHub, remember to add the following notice to the top of your files:
Data and extracted textual content from the Climate Action Tracker website are copyrighted
Β© 2009-2025 by Climate Analytics and NewClimate Institute. All rights reserved.
Part II: Using the Scrapy Shell (25 min)
Note to instructor: Things to connect: the notion of DOM inspection, using Ctrl + F to use CSS selectors in the browser and how itβs similar to the Scrapy shell. Feel free to show other CSS pseudo-classes and how they work.
π£οΈ TEACHING MOMENT
Your class teacher will help you check your understand of what each of these files represent. Then, they will link back to the concepts covered in the π£οΈ Week 04 Lecture as he shows you around the Climate Action Tracker website.
Your class teacher will also demonstrate the steps below. Follow along and ask questions if you are unsure about anything.
Access the Scrapy shell:
Before you write code in a spider, it helps to test your CSS selectors in a shell dedicated to that purpose, the Scrapy shell.
scrapy shell "https://climateactiontracker.org/countries/brazil/"
This command sends a request to the website and loads the response into the shell, saving it to a variable called
response
.Use the
response
object to test your CSS selectors:If you know some HTML elements and CSS selectors, you can use the
response
object to test them. Use theresponse.css()
method to select elements and the.get()
method to extract the text or attributes:Run the commands below and observe how their output is different.
'title::text').get() response.css(
The above returns a single string, the text of the title of the page.
'title').get() response.css(
The above returns the
<title>
element itself!'title').getall() response.css(
The above returns a list of all the
<title>
elements in the page.'title') response.css(
Without the
get()
method, we get aSelectorList
object, a list of all the<title>
elements in the page.π€¨ Think about it: What do you think is a
SelectorList
object? Why not a simple list or string?Finally, hereβs something more complex and interesting. The CSS selector below grabs the first paragraph that is inside the text in the βOverviewβ section of the page.
= response.css('div.content-block > p ::text').get() selected_paragraph
Read more about CSS selectors in the official documentation.
The
::text
is not part of the official CSS specification but is supported by Scrapy.What if you want to select the
href
attribute of the link in the βOverviewβ section? You can do that with the::attr()
pseudo-class:# Grabs the first link that appears in the page and returns its href attribute = response.css('a::attr(href)').get() selected_link
The
:: attr()
pseudo-class is another non-standard feature supported by Scrapy.You can also use the
::attr()
pseudo-class to select thesrc
attribute of an image:= response.css('img::attr(src)').get() selected_image
Part III: Scrapy Shell Practice (20 min)
Itβs your turn to collect some data from the page.
π― ACTION POINTS
On the Scrapy shell, write the necessary CSS selectors to collect the following data:
The country name. (the big letter on the top left of the page)
The flag of the country. (the image on the top left of the page)
π The overall rating of the country.
Keeping Brazil as an example, your code should return as an output a simple string with the text
"Insufficient"
.
When youβre done, exit the Scrapy shell by typing exit()
and press Enter.
Part IV: Creating Your First Spider (30 min)
Itβs time to make use of the Scrapy project we created at the beginning of the lab. The Spider is the main component of a Scrapy project. It defines how to scrape a website and what data to extract.
π£οΈ TEACHING MOMENT
Your class teacher will guide you through the steps below. Follow along and ask questions if you are unsure about anything.
Create a new spider:
From the root directory of your Scrapy project (
climate_tracker/
), usescrapy genspider
to create a new spider:scrapy genspider climate_action_tracker climateactiontracker.org
This will create a new file in your project with this structure:
climate_tracker/ βββ scrapy.cfg βββ climate_tracker/ βββ __init__.py βββ items.py βββ middlewares.py βββ pipelines.py βββ settings.py βββ spiders/ βββ __init__.py βββ climate_action_tracker.py # π
You will see that it comes with the following code:
import scrapy class ClimateActionTrackerSpider(scrapy.Spider): = "climate_action_tracker" name = ["climateactiontracker.org"] allowed_domains = ["https://climateactiontracker.org"] start_urls def parse(self, response): pass
The key thing here is that the relevant scraping logic should be implemented in the
parse()
method. Right now, itβs empty (pass
means do nothing in Python).Implement the scraping logic and return a dictionary of items:
Inside the
parse()
method, add the CSS selectors you used in the previous part to extract the data you want.For example, to extract the country name, you can do the following:
def parse(self, response): = response.css('h1::text').get() country_name # Return a dictionary of items yield { 'country_name': country_name }
π€¨ Think about it: What is the purpose of the
yield
keyword? Why notreturn
?Modify the start URLs:
Modify the
start_urls
list to include the URLs of the countries you want to scrape.For example, to scrape the data for Brazil, you can do the following:
= ["https://climateactiontracker.org/countries/brazil/"] start_urls
Run your spider:
Now you can run your spider! From the root directory of the scrapy project, run:
scrapy crawl climate_action_tracker -O output.json
A LOT OF OUTPUT WILL BE GENERATED. This is normal. It means your spider is working as expected.
What is happening here?
- Scrapy sends a request to each one of the URLs in the
start_urls
list - It receives the response
- It sends the response to the
parse()
method - The
parse()
method is where your logic goes. Itβs where you extract the data you want - You control which data to extract and what you return in the dictionary
- The dictionary is then saved to a JSON file, a file format that computers love.
- Scrapy sends a request to each one of the URLs in the
π Feel like practicing some more?
This section will be part of your π W04-W05 Formative Exercise.
Now that youβve learned the basics of Scrapy, itβs time to practice on your own! Hereβs your challenge:
Expand your spiderβs reach:
- Modify the
start_urls
list to include at least 3 different countries - For example: Brazil, China, and India
= [ start_urls "https://climateactiontracker.org/countries/brazil/", "https://climateactiontracker.org/countries/china/", "https://climateactiontracker.org/countries/india/" ]
- Modify the
Collect more data:
Update your
parse()
method to collect:- The countryβs overall rating
- The flag image URL
- The first paragraph of the overview section
Your output dictionary should look something like this:
yield { 'country_name': country_name, 'rating': rating, 'flag_url': flag_url }
Convert to DataFrame:
After running your spider and getting the JSON output, try loading it into a pandas DataFrame. You can create a Jupyter notebook or a Python script to do this.
Sample code:
import pandas as pd # Read the JSON file = pd.read_json('output.json') df # Display the first few rows print(df.head())
Bonus Challenge: Try to extract all the indicators from the main div of the page (βPolicies and actionβ, βNDC targetβ, etc.)
Remember to keep the DOWNLOAD_DELAY
setting so you donβt overwhelm the website with requests.