๐Ÿ“ W04-W05 Formative Exercise

Web Scraping with Scrapy

Author
Published

10 February 2025

๐ŸŽฏ Learning Goals
By the end of this exercise, you will: i) understand how to read and test existing Scrapy spiders, ii) learn to extract complex nested data from web pages, and iii) practice submitting pull requests for code review.
DS205 icon

Briefing

โณ DEADLINE Tuesday, 20 February, 23:59 GMT
๐Ÿ“‚ Repository lse-ds205/climate-data-web-scraping
๐Ÿ’Ž Key Learning Concept Ethical web scraping, complex data extraction, and code review through pull requests

๐Ÿ’ก Your First Web Scraping Project:

This exercise builds upon the ๐Ÿ’ป W04 Lab and extends it further. Youโ€™ll start with a working spider that extracts basic information and enhance it to collect more complex, nested data.

I will dedicate Friday, 21 February to review your pull requests and provide feedback. This will help prepare you for your first graded assignment in Week 06. I will only review pull requests created by the deadline (20 February, 23:59 GMT).

Late submissions will not receive feedback in time for the graded assignment.

Part I: A โœ… W04 Lab Solution

Letโ€™s start by examining a working spider, effectively a solution to the ๐Ÿ’ป W04 Lab. This spider extracts basic information from the Climate Action Tracker website.

Iโ€™ve created a dedicated GitHub repository for this mini-project: lse-ds205/climate-data-web-scraping

  1. Read and follow the instructions in the README.md file to clone the repository and set up your environment.

    ๐Ÿ’ก Note: We strongly recommend using a virtual environment for this exercise. This keeps your project dependencies isolated, preventing conflicts with other Python projects (like the ascor-api).

  2. Create your own branch to work on this exercise.

    Just like we did in the previous mini-project, this will allow you to keep your main branch clean and to revert to it if you make any mistakes.

    Name your branch feature/my-first-spider-<your-github-username> and push it to the remote repository:

    git checkout -b feature/my-first-spider-<your-github-username>
    git push origin feature/my-first-spider-<your-github-username>

    Of course, replace <your-github-username> with your actual GitHub username. Delete the < and > characters.

    Keep committing and pushing as you work on the rest of the exercise.

  3. Run the spider to see it in action:

    cd climate_tracker
    scrapy crawl climate_action_tracker -O ./data/output.json
  4. Examine the spider code in climate_action_tracker.py. Pay attention to:

    • How the spider is configured (name, allowed domains, start URLs)
    • The basic data extraction in the parse() method
    • The use of CSS selectors to extract data (does it match your understanding of the HTML structure?)

๐Ÿ” Need a refresher?

If youโ€™re having trouble understanding how the CSS selectors work, revisit the ๐Ÿ—ฃ๏ธ W04 Lecture, as well as your ๐Ÿ’ป W04 Lab notes on using the Scrapy shell. (Avoid missing lectures or you will soon feel overwhelmed with the amount of โ€˜undigestedโ€™ content!)

You can test selectors interactively:

scrapy shell "https://climateactiontracker.org/countries/india/"

Then, once youโ€™re inside the shell, you can test the selectors:

response.css('h1::text').get()
  1. ๐Ÿ†• Test it out!

    We showed you scrapy crawl, which in the long run will be the command you use to run your spiders as it allows you to run them on a list of URLs. But here, it might also be useful to use another command, scrapy parse, to test the spider on a single URL:

    # You can specify any URL you want to test
    scrapy parse https://climateactiontracker.org/countries/india/ --spider climate_action_tracker

    This command is particularly useful when debugging as it shows you exactly what data your spider extracts from a specific page.

Part II: Extending the Spider

Now that you understand the basic spider, letโ€™s extend it to collect more detailed data. Your task is to modify the spider to extract all climate indicators for each country.

The desired output should look like this:

[{
  "country_name": "India",
  "overall_rating": "Highly insufficient",
  "indicators": [{
      "term": "Policies and action",
      "term_details": "against fair share",
      "value": "Insufficient",
      "metric": "< 3ยฐC World"
    },
    {
      "term": "Conditional NDC target",
      "term_details": "against modelled domestic pathways",
      "value": "Highly insufficient",
      "metric": "< 4ยฐC World"
    },
    // ... other indicators
  ]
}]

๐Ÿ˜ฌ How do I put thisโ€ฆ this wonโ€™t be easy!

You will find similarities with the way you had to construct nested dictionaries during your earlier mini-project with the ascor-api. However, while the HTML structure isnโ€™t deliberately obfuscated (like Amazonโ€™s), extracting this data will still be challenging.

Weโ€™ll work through most of these challenges together in the ๐Ÿ—ฃ๏ธ Week 05 Lecture, but for now, take this as an opportunity to challenge yourself!

Tips for success:

  1. Start small - test extracting just one indicator at a time

  2. Use the Scrapy shell extensively to test your selectors

  3. You can chain .css() selectors to get more specific, like:

    response.css('div.some-class').css('h1::text').get()
  4. You can also loop through the indicators to handle their inner HTML differently:

    for indicator in response.css('div.indicator'):
        if indicator.css(...):
            # do something
        else:
            # do something else
  5. Print intermediate results to understand what youโ€™re getting

  6. If using AI tools, make sure you understand every line of suggested code and also try to prompt the AI to simplify the code as much as possible

  7. Donโ€™t hesitate to ask for help in the #help Slack channel

Part III: Submit Your Work

Once youโ€™ve successfully extended the spider:

  1. Keep your branch updated. Commit your remaining changes and push your branch to the remote repository.

  2. Open a Pull Request. Visit the repository on GitHub, click on the Pull Requests tab, and click New pull request. Select your branch and click Create pull request. Tag me (@jonjoncardoso) as a reviewer.

    Please include a brief description explaining your implementation choices and approach.

I will review your pull request during Week 05 and provide feedback on your implementation. This will be the last time I will review your code before the first graded assignment.