πŸ’» Week 04 Lab

Web Scraping with Scrapy

Author
Published

10 February 2025

πŸ₯… Learning Goals
By the end of this lab, you will learn how to: i) set up a Scrapy project and configure it for ethical web scraping, ii) create a spider to collect country data from the Climate Action Tracker website, iii) extract specific data like country names, ratings and flags using CSS selectors, iv) save the scraped data as JSON, and v) load and analyse the data using pandas DataFrames.
DS205 icon

Last Updated: 10 February 18:20

πŸ“Time and Location: Tuesday, 11 February 2025. Check your timetable for the precise time and location of your class.

πŸ“‹ Preparation

Before attending this lab, ensure that:

  1. You are caught up with the theoretical concepts covered in the πŸ—£οΈ Week 04 Lecture, including the principles of ethical web scraping.

  2. Create a dedicated folder for this lab. Place it somewhere like ~/DS205/web-scraping.

    For this lab, a simple local folder is sufficient, you’ll be working with a dedicated GitHub repository later for the πŸ“ W04-W05 Formative Exercise.

  3. (Optional but recommended) Set up a dedicated virtual environment for this web scraping project.

    We will soon have two concurrent projects running during the course: i) ascor-api and ii) climate-data-web-scraping. It’s a good idea to keep their Python packages separate.

Setting up a virtual environment

Don’t just take my word for it. Even the Scrapy Installation Guide, the official documentation for the library we will use for web scraping, recommends setting up a virtual environment:

β€œWe strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system packages.”

  1. To set up a virtual environment, run the following command in your terminal:
python -m venv scraping-env
  1. Then, activate the virtual environment:
# If on Mac or Linux (e.g. Nuvolos)
source scraping-env/bin/activate

# If on Windows
scraping-env\Scripts\activate

You should see something like this in your terminal:

(scraping-env) jon@jon-desktop:~/DS205/web-scraping$
  1. Install Scrapy in the virtual environment:
pip install scrapy pandas
  1. Whenever you come back to work on this project, activate the virtual environment:
source scraping-env/bin/activate

πŸ›£οΈ Lab Roadmap

Scrapy is a powerful and flexible web scraping framework that allows you to crawl websites, extract structured data, and save it in various formats. It automates a lot of the tedious tasks that you would otherwise have to do manually. This is our tool of choice for web scraping in this lab.

You might want to bookmark the following pages to consult later once you’ve understood the basics:

Part I: Project Setup (15 min)

No need to wait for your class teacher. You can start working on this part on your own as soon as you arrive.

🎯 ACTION POINTS

  1. Create a new Scrapy project.

    Like FastAPI, Scrapy requires a specific set of files and folders that we need to set up when starting a new project. (You will find that Scrapy is more demanding in this regard.)

    To create a new Scrapy project, let’s call it climate_tracker, run the following command in your terminal:

    scrapy startproject climate_tracker
    cd climate_tracker
  2. Examine the project structure.

    The command above will produce several files and folders. Take some time reading through them to understand the purpose of each file and folder.

    climate_tracker/
    β”œβ”€β”€ scrapy.cfg
    └── climate_tracker/
        β”œβ”€β”€ __init__.py
        β”œβ”€β”€ items.py
        β”œβ”€β”€ middlewares.py
        β”œβ”€β”€ pipelines.py
        β”œβ”€β”€ settings.py
        └── spiders/
            └── __init__.py
  3. Configure ethical scraping settings.

    Open the settings.py file and configure the following settings:

    # Identify your spider
    USER_AGENT = 'LSE DS205 Student Spider (GitHub: @your-username) (+https://lse-dsi.github.io/DS205)'
    
    # Be polite
    ROBOTSTXT_OBEY = True # Make sure this is uncommented and set to True
    DOWNLOAD_DELAY = 3  # Wait 3 seconds between requests (don't bombard the website with requests)

    Replace @your-username with your GitHub username. This way, the website maintainers knows we are not digital pirates.

πŸ“Œ Acknowledgement:

In compliance with the Climate Analytics and NewClimate Institute copyright notice, if you end up putting your webscraping code on GitHub, remember to add the following notice to the top of your files:

Data and extracted textual content from the Climate Action Tracker website are copyrighted
Β© 2009-2025 by Climate Analytics and NewClimate Institute. 
All rights reserved.

Part II: Using the Scrapy Shell (25 min)

Note to instructor: Things to connect: the notion of DOM inspection, using Ctrl + F to use CSS selectors in the browser and how it’s similar to the Scrapy shell. Feel free to show other CSS pseudo-classes and how they work.

πŸ—£οΈ TEACHING MOMENT

Your class teacher will help you check your understand of what each of these files represent. Then, they will link back to the concepts covered in the πŸ—£οΈ Week 04 Lecture as he shows you around the Climate Action Tracker website.

Your class teacher will also demonstrate the steps below. Follow along and ask questions if you are unsure about anything.

  1. Access the Scrapy shell:

    Before you write code in a spider, it helps to test your CSS selectors in a shell dedicated to that purpose, the Scrapy shell.

    scrapy shell "https://climateactiontracker.org/countries/brazil/"

    This command sends a request to the website and loads the response into the shell, saving it to a variable called response.

  2. Use the response object to test your CSS selectors:

    If you know some HTML elements and CSS selectors, you can use the response object to test them. Use the response.css() method to select elements and the .get() method to extract the text or attributes:

    Run the commands below and observe how their output is different.

    response.css('title::text').get()

    The above returns a single string, the text of the title of the page.

    response.css('title').get()

    The above returns the <title> element itself!

    response.css('title').getall()

    The above returns a list of all the <title> elements in the page.

    response.css('title')

    Without the get() method, we get a SelectorList object, a list of all the <title> elements in the page.

    🀨 Think about it: What do you think is a SelectorList object? Why not a simple list or string?

    Finally, here’s something more complex and interesting. The CSS selector below grabs the first paragraph that is inside the text in the β€˜Overview’ section of the page.

    selected_paragraph = response.css('div.content-block > p ::text').get()

    Read more about CSS selectors in the official documentation.

    The ::text is not part of the official CSS specification but is supported by Scrapy.

    What if you want to select the href attribute of the link in the β€˜Overview’ section? You can do that with the ::attr() pseudo-class:

    # Grabs the first link that appears in the page and returns its href attribute
    selected_link = response.css('a::attr(href)').get()

    The :: attr() pseudo-class is another non-standard feature supported by Scrapy.

    You can also use the ::attr() pseudo-class to select the src attribute of an image:

    selected_image = response.css('img::attr(src)').get()

Part III: Scrapy Shell Practice (20 min)

It’s your turn to collect some data from the page.

🎯 ACTION POINTS

On the Scrapy shell, write the necessary CSS selectors to collect the following data:

  1. The country name. (the big letter on the top left of the page)

  2. The flag of the country. (the image on the top left of the page)

  3. πŸ† The overall rating of the country.

    Keeping Brazil as an example, your code should return as an output a simple string with the text "Insufficient".

When you’re done, exit the Scrapy shell by typing exit() and press Enter.

Part IV: Creating Your First Spider (30 min)

It’s time to make use of the Scrapy project we created at the beginning of the lab. The Spider is the main component of a Scrapy project. It defines how to scrape a website and what data to extract.

πŸ—£οΈ TEACHING MOMENT

Your class teacher will guide you through the steps below. Follow along and ask questions if you are unsure about anything.

  1. Create a new spider:

    From the root directory of your Scrapy project (climate_tracker/), use scrapy genspider to create a new spider:

    scrapy genspider climate_action_tracker climateactiontracker.org

    This will create a new file in your project with this structure:

    climate_tracker/
    β”œβ”€β”€ scrapy.cfg
    └── climate_tracker/
        β”œβ”€β”€ __init__.py
        β”œβ”€β”€ items.py
        β”œβ”€β”€ middlewares.py
        β”œβ”€β”€ pipelines.py
        β”œβ”€β”€ settings.py
        └── spiders/
            β”œβ”€β”€ __init__.py
            └── climate_action_tracker.py # πŸ†•

    You will see that it comes with the following code:

    import scrapy
    
    class ClimateActionTrackerSpider(scrapy.Spider):
        name = "climate_action_tracker"
        allowed_domains = ["climateactiontracker.org"]
        start_urls = ["https://climateactiontracker.org"]
    
        def parse(self, response):
            pass

    The key thing here is that the relevant scraping logic should be implemented in the parse() method. Right now, it’s empty (pass means do nothing in Python).

  2. Implement the scraping logic and return a dictionary of items:

    Inside the parse() method, add the CSS selectors you used in the previous part to extract the data you want.

    For example, to extract the country name, you can do the following:

    def parse(self, response):
        country_name = response.css('h1::text').get()
    
        # Return a dictionary of items
        yield {
            'country_name': country_name
        }

    🀨 Think about it: What is the purpose of the yield keyword? Why not return?

  3. Modify the start URLs:

    Modify the start_urls list to include the URLs of the countries you want to scrape.

    For example, to scrape the data for Brazil, you can do the following:

    start_urls = ["https://climateactiontracker.org/countries/brazil/"]
  4. Run your spider:

    Now you can run your spider! From the root directory of the scrapy project, run:

    scrapy crawl climate_action_tracker -O output.json

    A LOT OF OUTPUT WILL BE GENERATED. This is normal. It means your spider is working as expected.

    What is happening here?

    • Scrapy sends a request to each one of the URLs in the start_urls list
    • It receives the response
    • It sends the response to the parse() method
    • The parse() method is where your logic goes. It’s where you extract the data you want
    • You control which data to extract and what you return in the dictionary
    • The dictionary is then saved to a JSON file, a file format that computers love.

🏠 Feel like practicing some more?

This section will be part of your πŸ“ W04-W05 Formative Exercise.

Now that you’ve learned the basics of Scrapy, it’s time to practice on your own! Here’s your challenge:

  1. Expand your spider’s reach:

    • Modify the start_urls list to include at least 3 different countries
    • For example: Brazil, China, and India
    start_urls = [
        "https://climateactiontracker.org/countries/brazil/",
        "https://climateactiontracker.org/countries/china/",
        "https://climateactiontracker.org/countries/india/"
    ]
  2. Collect more data:

    Update your parse() method to collect:

    • The country’s overall rating
    • The flag image URL
    • The first paragraph of the overview section

    Your output dictionary should look something like this:

    yield {
        'country_name': country_name,
        'rating': rating,
        'flag_url': flag_url
    }
  3. Convert to DataFrame:

    After running your spider and getting the JSON output, try loading it into a pandas DataFrame. You can create a Jupyter notebook or a Python script to do this.

    Sample code:

    import pandas as pd
    
    # Read the JSON file
    df = pd.read_json('output.json')
    
    # Display the first few rows
    print(df.head())
  4. Bonus Challenge: Try to extract all the indicators from the main div of the page (β€œPolicies and action”, β€œNDC target”, etc.)

Remember to keep the DOWNLOAD_DELAY setting so you don’t overwhelm the website with requests.