🗓️ Week 02 – Day 04: Crawlers & Browser Automation

When you thought web scraping couldn’t get any cooler.

Author

Dr Jon Cardoso-Silva

Last updated

18 July 2024

🥅 Learning Objectives

Review the goals for today

At the end of the day, you should be able to:

Set up Git on your machine.
Understand the advantages of using scrapy spider over requests + Scrapy Selectors.
Understand the architecture of a scrapy spider.
Use the scrapy shell to test your CSS selectors and XPath expressions.
Create a new Scrapy project and a new spider.
Use the scrapy crawl command to run your spider.
Save the scraped data to a JSON or JSONL file.

Today you will learn a few new webscraping tricks. But, before we dive into that, we need to set up Git on your machine.

Part I: ⚙️ Setting Up Git on Your Machine

Starting today, as we have to share code that does not fit into single Jupyter Notebooks, we will use Git to manage our code. While Git is a free and open-source software for version control of files (mostly code), GitHub is a platform that allows you to host your code online and collaborate with others. GitHub has a free tier that is more than enough for our purposes.

Git/GitHub is to code what Google Docs is to documents: it allows multiple people to work on the same codebase without (always) stepping on each other’s toes.

🎯 ACTION POINTS

📝 NOTE: If you have already done the 📋 Take-Home Activity: Creating a website with Markdown of a few days back, then you can start from Step 2.

If you haven’t done so already, head to GitHub and create an account. Choose a username that you like and that you will be happy to share with others.

💡 TIP: Choose a username that you wouldn’t mind using in a more professional, serious setting. GitHub can be used for personal projects but it is perhaps more commonly used for professional reasons, such as for hosting a portfolio of your data science projects.
Share your GitHub username with me, otherwise you won’t have access to the repository we will be using today.

Click 🔗 HERE to inform me of your GitHub username
Check if you already have Git installed on your machine. Open a terminal and type git --version. If this does not throw an error, you have Git installed and you can skip to Step 5. If you get an error, you need to install Git.
(Potentially optional) Install Git on your machine. The installation procedure depends on your Operating System (OS):

Windows

Install the Git Bash by downloading the installer from the Git for Windows website. More than Git, this will also install the bash terminal on your machine, which is a Unix-like terminal that is an alternative to the PowerShell or Command Prompt that comes with Windows.
IMPORTANT: When prompted, add the Git Bash to your PATH environment variable. This will allow you to run Git commands from any terminal on your machine, including the Command Prompt or PowerShell.

macOS

If you have Homebrew installed, you can install Git by running brew install git in your terminal.
If you don’t have Homebrew, the next easiest way is to install Xcode Command Line Tools. Open a terminal and run sudo xcode-select --install. This will open a dialog box that will guide you through the installation process.
If the above does not work, you can install Xcode Developer Tools manually. This is a large download and will take some time.

Configure your Git with your name and email address. Ideally, this e-mail address should be the same one you used to create your GitHub account. Run the following commands in your terminal:
```
git config --global user.name "<your_name>"
git config --global user.email "<your_email>"
```
Obviously, you must replace <your_name> and <your_email> with your actual name and email address.
Create an SSH key on your machine using a key-generator program called ssh-keygen.

Read the instructions from GitHub’s official website: Generating a new SSH key and adding it to the ssh-agent to find out how to do so. Remember to use the instructions appropriate for your OS.

⚠️ IMPORTANT: Please ignore the section "Generating a new SSH key for a hardware security key".
Let GitHub know about your SSH key by adding it to your GitHub account.

Read the instructions from GitHub’s official website: Adding a new SSH key to your GitHub account to find out how to do so.
Test that your SSH key works by connecting to GitHub.

Read the instructions from GitHub’s official website: Testing your SSH connection to find out how to do so.

Cool. You are now all set up to use git commands from your terminal.

Part II: 🕷️ The world of spiders

The scrapy package which we have been using has a lot more functionality than just the Selector class. In particular, we will look at the scrapy spider functionality today.

scrapy spider is a powerful web scraping framework that allows you to extract data from websites in an asynchronous ( \(\neq\) sequential) manner. It provides a lot of the same functionality as we’ve explored previously in the using the requests and Scrapy Selector modules, but it is more efficient and can handle more complex tasks.

Advantages of `scrapy spider`

Let’s compare spiders and requests:

Characteristic	`requests` + `Scrapy Selectors`	The `scrapy spider` framework
Asynchronous	❌	✅
Built-in pagination	❌	✅
Middleware and pipelines¹	❌	✅
Caching²	❌	✅
`robots.txt` settings³	❌	✅
User-Agent settings	✅	✅
Built-in saving to JSON, CSV, JSONL etc.	❌	✅
Built-in `shell` for testing selectors⁴	❌	✅
Built-in `crawl` command	❌	✅
Boilerplate code generation⁵	❌	✅
Intergration with headless browsers⁶	❌	✅
Built-in sitemap parsers and link extractors⁷	❌	✅
Works well in Jupyter Notebooks	✅	❌
Built-in `FormRequest` for submitting forms⁸	❌	✅
Integration with `scrapyd` for cloud deployment⁹	❌	✅

The `scrapy spider` architecture

Spiders are not just a single file, but a collection of files that work together to scrape a website. The main components of a spider are:

#	Component	Functionality
1	`items.py`	Defines the data structure that will be scraped.
2	`middlewares.py`	Processes requests and responses before and after they are sent and received.
3	`pipelines.py`	Processes the scraped data before it is saved to a file or database.
4	`settings.py`	Contains the configuration settings for the spider.
5	`spiders/`	Contains the spider classes that will scrape the website.

💡 TIP: We will not be dealing with middlewares, pipelines, or items in this tutorial, but they are important components of a Scrapy project—do not delete them.

Time to code

To show the full power of spiders, we have set up a template GitHub repository:

🖇️ LINK TO ACCEPT REPOSITORY ASSIGNMENT

I set this up as an assignment on GitHub Classroom. Click the link above to accept the assignment. This will create a unique repository for you to work in.
If you get a 404 error or a permission error, it means you haven’t shared your GitHub username with me. Please do so now by clicking HERE.

Footnotes

Middleware and pipelines are used to process requests and responses before and after they are sent and received, respectively.↩︎
Caching is the process of storing data in a temporary storage area so that it can be accessed more quickly.↩︎
robots.txt is a file that tells search engine crawlers which pages or files the crawler can or can’t request from your site.↩︎
The Scrapy shell is a powerful tool for testing your CSS selectors and XPath expressions in the terminal.↩︎
Scrapy can generate boilerplate code for a new spider, project, or item.↩︎
Scrapy can be used with headless browsers like Puppeteer and Playwright to scrape websites that require JavaScript to render.↩︎
Scrapy has built-in classes for parsing sitemap.xml (eg. LSE sitemap) and extracting links.↩︎
Scrapy has a FormRequest class that can be used to submit forms.↩︎
scrapyd is a service for running, scheduling, and monitoring Scrapy spiders in the cloud.↩︎