📝 W04-W05 Formative Exercise

Web Scraping with Scrapy

Author

Published

10 February 2025

🎯 Learning Goals

By the end of this exercise, you will: i) understand how to read and test existing Scrapy spiders, ii) learn to extract complex nested data from web pages, and iii) practice submitting pull requests for code review.

Briefing

⏳	DEADLINE	Tuesday, 20 February, 23:59 GMT
📂	Repository	lse-ds205/climate-data-web-scraping
💎	Key Learning Concept	Ethical web scraping, complex data extraction, and code review through pull requests

💡 Your First Web Scraping Project:

This exercise builds upon the 💻 W04 Lab and extends it further. You’ll start with a working spider that extracts basic information and enhance it to collect more complex, nested data.

I will dedicate Friday, 21 February to review your pull requests and provide feedback. This will help prepare you for your first graded assignment in Week 06. I will only review pull requests created by the deadline (20 February, 23:59 GMT).

Late submissions will not receive feedback in time for the graded assignment.

Part I: A ✅ W04 Lab Solution

Let’s start by examining a working spider, effectively a solution to the 💻 W04 Lab. This spider extracts basic information from the Climate Action Tracker website.

I’ve created a dedicated GitHub repository for this mini-project: lse-ds205/climate-data-web-scraping

Read and follow the instructions in the README.md file to clone the repository and set up your environment.

💡 Note: We strongly recommend using a virtual environment for this exercise. This keeps your project dependencies isolated, preventing conflicts with other Python projects (like the ascor-api).
Create your own branch to work on this exercise.

Just like we did in the previous mini-project, this will allow you to keep your main branch clean and to revert to it if you make any mistakes.

Name your branch feature/my-first-spider-<your-github-username> and push it to the remote repository:
```
git checkout -b feature/my-first-spider-<your-github-username>
git push origin feature/my-first-spider-<your-github-username>
```
Of course, replace <your-github-username> with your actual GitHub username. Delete the < and > characters.

Keep committing and pushing as you work on the rest of the exercise.

Run the spider to see it in action:

cd climate_tracker
scrapy crawl climate_action_tracker -O ./data/output.json

Examine the spider code in climate_action_tracker.py. Pay attention to:
- How the spider is configured (name, allowed domains, start URLs)
- The basic data extraction in the parse() method
- The use of CSS selectors to extract data (does it match your understanding of the HTML structure?)

🔍 Need a refresher?

If you’re having trouble understanding how the CSS selectors work, revisit the 🗣️ W04 Lecture, as well as your 💻 W04 Lab notes on using the Scrapy shell. (Avoid missing lectures or you will soon feel overwhelmed with the amount of ‘undigested’ content!)

You can test selectors interactively:

scrapy shell "https://climateactiontracker.org/countries/india/"

Then, once you’re inside the shell, you can test the selectors:

response.css('h1::text').get()

🆕 Test it out!

We showed you scrapy crawl, which in the long run will be the command you use to run your spiders as it allows you to run them on a list of URLs. But here, it might also be useful to use another command, scrapy parse, to test the spider on a single URL:
```
# You can specify any URL you want to test
scrapy parse https://climateactiontracker.org/countries/india/ --spider climate_action_tracker
```
This command is particularly useful when debugging as it shows you exactly what data your spider extracts from a specific page.

Part II: Extending the Spider

Now that you understand the basic spider, let’s extend it to collect more detailed data. Your task is to modify the spider to extract all climate indicators for each country.

The desired output should look like this:

[{
  "country_name": "India",
  "overall_rating": "Highly insufficient",
  "indicators": [{
      "term": "Policies and action",
      "term_details": "against fair share",
      "value": "Insufficient",
      "metric": "< 3°C World"
    },
    {
      "term": "Conditional NDC target",
      "term_details": "against modelled domestic pathways",
      "value": "Highly insufficient",
      "metric": "< 4°C World"
    },
    // ... other indicators
  ]
}]

😬 How do I put this… this won’t be easy!

You will find similarities with the way you had to construct nested dictionaries during your earlier mini-project with the ascor-api. However, while the HTML structure isn’t deliberately obfuscated (like Amazon’s), extracting this data will still be challenging.

We’ll work through most of these challenges together in the 🗣️ Week 05 Lecture, but for now, take this as an opportunity to challenge yourself!

Tips for success:

Start small - test extracting just one indicator at a time
Use the Scrapy shell extensively to test your selectors
You can chain .css() selectors to get more specific, like:
```
response.css('div.some-class').css('h1::text').get()
```

You can also loop through the indicators to handle their inner HTML differently:

for indicator in response.css('div.indicator'):
    if indicator.css(...):
        # do something
    else:
        # do something else

Print intermediate results to understand what you’re getting
If using AI tools, make sure you understand every line of suggested code and also try to prompt the AI to simplify the code as much as possible
Don’t hesitate to ask for help in the #help Slack channel

Part III: Submit Your Work

Once you’ve successfully extended the spider:

Keep your branch updated. Commit your remaining changes and push your branch to the remote repository.
Open a Pull Request. Visit the repository on GitHub, click on the Pull Requests tab, and click New pull request. Select your branch and click Create pull request. Tag me (@jonjoncardoso) as a reviewer.

Please include a brief description explaining your implementation choices and approach.

I will review your pull request during Week 05 and provide feedback on your implementation. This will be the last time I will review your code before the first graded assignment.