๐ W04-W05 Formative Exercise
Web Scraping with Scrapy

Briefing
โณ | DEADLINE | Tuesday, 20 February, 23:59 GMT |
๐ | Repository | lse-ds205/climate-data-web-scraping |
๐ | Key Learning Concept | Ethical web scraping, complex data extraction, and code review through pull requests |
๐ก Your First Web Scraping Project:
This exercise builds upon the ๐ป W04 Lab and extends it further. Youโll start with a working spider that extracts basic information and enhance it to collect more complex, nested data.
I will dedicate Friday, 21 February to review your pull requests and provide feedback. This will help prepare you for your first graded assignment in Week 06. I will only review pull requests created by the deadline (20 February, 23:59 GMT).
Late submissions will not receive feedback in time for the graded assignment.
Part I: A โ W04 Lab Solution
Letโs start by examining a working spider, effectively a solution to the ๐ป W04 Lab. This spider extracts basic information from the Climate Action Tracker website.
Iโve created a dedicated GitHub repository for this mini-project: lse-ds205/climate-data-web-scraping
Read and follow the instructions in the README.md file to clone the repository and set up your environment.
๐ก Note: We strongly recommend using a virtual environment for this exercise. This keeps your project dependencies isolated, preventing conflicts with other Python projects (like the
ascor-api
).Create your own branch to work on this exercise.
Just like we did in the previous mini-project, this will allow you to keep your main branch clean and to revert to it if you make any mistakes.
Name your branch
feature/my-first-spider-<your-github-username>
and push it to the remote repository:git checkout -b feature/my-first-spider-<your-github-username> git push origin feature/my-first-spider-<your-github-username>
Of course, replace
<your-github-username>
with your actual GitHub username. Delete the<
and>
characters.Keep committing and pushing as you work on the rest of the exercise.
Run the spider to see it in action:
cd climate_tracker scrapy crawl climate_action_tracker -O ./data/output.json
Examine the spider code in
climate_action_tracker.py
. Pay attention to:- How the spider is configured (name, allowed domains, start URLs)
- The basic data extraction in the
parse()
method - The use of CSS selectors to extract data (does it match your understanding of the HTML structure?)
๐ Need a refresher?
If youโre having trouble understanding how the CSS selectors work, revisit the ๐ฃ๏ธ W04 Lecture, as well as your ๐ป W04 Lab notes on using the Scrapy shell. (Avoid missing lectures or you will soon feel overwhelmed with the amount of โundigestedโ content!)
You can test selectors interactively:
scrapy shell "https://climateactiontracker.org/countries/india/"
Then, once youโre inside the shell, you can test the selectors:
'h1::text').get() response.css(
๐ Test it out!
We showed you
scrapy crawl
, which in the long run will be the command you use to run your spiders as it allows you to run them on a list of URLs. But here, it might also be useful to use another command,scrapy parse
, to test the spider on a single URL:# You can specify any URL you want to test scrapy parse https://climateactiontracker.org/countries/india/ --spider climate_action_tracker
This command is particularly useful when debugging as it shows you exactly what data your spider extracts from a specific page.
Part II: Extending the Spider
Now that you understand the basic spider, letโs extend it to collect more detailed data. Your task is to modify the spider to extract all climate indicators for each country.
The desired output should look like this:
[{
"country_name": "India",
"overall_rating": "Highly insufficient",
"indicators": [{
"term": "Policies and action",
"term_details": "against fair share",
"value": "Insufficient",
"metric": "< 3ยฐC World"
},
{
"term": "Conditional NDC target",
"term_details": "against modelled domestic pathways",
"value": "Highly insufficient",
"metric": "< 4ยฐC World"
},
// ... other indicators
]
}]
๐ฌ How do I put thisโฆ this wonโt be easy!
You will find similarities with the way you had to construct nested dictionaries during your earlier mini-project with the ascor-api
. However, while the HTML structure isnโt deliberately obfuscated (like Amazonโs), extracting this data will still be challenging.
Weโll work through most of these challenges together in the ๐ฃ๏ธ Week 05 Lecture, but for now, take this as an opportunity to challenge yourself!
Tips for success:
Start small - test extracting just one indicator at a time
Use the Scrapy shell extensively to test your selectors
You can chain
.css()
selectors to get more specific, like:'div.some-class').css('h1::text').get() response.css(
You can also loop through the indicators to handle their inner HTML differently:
for indicator in response.css('div.indicator'): if indicator.css(...): # do something else: # do something else
Print intermediate results to understand what youโre getting
If using AI tools, make sure you understand every line of suggested code and also try to prompt the AI to simplify the code as much as possible
Donโt hesitate to ask for help in the
#help
Slack channel
Part III: Submit Your Work
Once youโve successfully extended the spider:
Keep your branch updated. Commit your remaining changes and push your branch to the remote repository.
Open a Pull Request. Visit the repository on GitHub, click on the
Pull Requests
tab, and clickNew pull request
. Select your branch and clickCreate pull request
. Tag me (@jonjoncardoso) as a reviewer.Please include a brief description explaining your implementation choices and approach.
I will review your pull request during Week 05 and provide feedback on your implementation. This will be the last time I will review your code before the first graded assignment.