ποΈ Week 02 β Day 04: Crawlers & Browser Automation
When you thought web scraping couldnβt get any cooler.
π₯ Learning Objectives
Review the goals for today
At the end of the day, you should be able to:
- Set up Git on your machine.
- Understand the advantages of using
scrapy spider
overrequests
+Scrapy Selectors
. - Understand the architecture of a
scrapy spider
. - Use the
scrapy shell
to test your CSS selectors and XPath expressions. - Create a new Scrapy project and a new spider.
- Use the
scrapy crawl
command to run your spider. - Save the scraped data to a JSON or JSONL file.
Today you will learn a few new webscraping tricks. But, before we dive into that, we need to set up Git on your machine.
Part I: βοΈ Setting Up Git on Your Machine
Starting today, as we have to share code that does not fit into single Jupyter Notebooks, we will use Git to manage our code. While Git is a free and open-source software for version control of files (mostly code), GitHub is a platform that allows you to host your code online and collaborate with others. GitHub has a free tier that is more than enough for our purposes.
Git/GitHub is to code what Google Docs is to documents: it allows multiple people to work on the same codebase without (always) stepping on each otherβs toes.
π― ACTION POINTS
π NOTE: If you have already done the π Take-Home Activity: Creating a website with Markdown of a few days back, then you can start from Step 2.
If you havenβt done so already, head to GitHub and create an account. Choose a username that you like and that you will be happy to share with others.
π‘ TIP: Choose a username that you wouldnβt mind using in a more professional, serious setting. GitHub can be used for personal projects but it is perhaps more commonly used for professional reasons, such as for hosting a portfolio of your data science projects.
Share your GitHub username with me, otherwise you wonβt have access to the repository we will be using today.
Click π HERE to inform me of your GitHub username
Check if you already have Git installed on your machine. Open a terminal and type
git --version
. If this does not throw an error, you have Git installed and you can skip to Step 5. If you get an error, you need to install Git.(Potentially optional) Install Git on your machine. The installation procedure depends on your Operating System (OS):
Windows
Install the Git Bash by downloading the installer from the Git for Windows website. More than Git, this will also install the
bash
terminal on your machine, which is a Unix-like terminal that is an alternative to the PowerShell or Command Prompt that comes with Windows.IMPORTANT: When prompted, add the Git Bash to your PATH environment variable. This will allow you to run Git commands from any terminal on your machine, including the Command Prompt or PowerShell.
macOS
If you have Homebrew installed, you can install Git by running
brew install git
in your terminal.If you donβt have Homebrew, the next easiest way is to install Xcode Command Line Tools. Open a terminal and run
sudo xcode-select --install
. This will open a dialog box that will guide you through the installation process.If the above does not work, you can install Xcode Developer Tools manually. This is a large download and will take some time.
Configure your Git with your name and email address. Ideally, this e-mail address should be the same one you used to create your GitHub account. Run the following commands in your terminal:
git config --global user.name "<your_name>" git config --global user.email "<your_email>"
Obviously, you must replace
<your_name>
and<your_email>
with your actual name and email address.Create an SSH key on your machine using a key-generator program called
ssh-keygen
.Read the instructions from GitHubβs official website: Generating a new SSH key and adding it to the ssh-agent to find out how to do so. Remember to use the instructions appropriate for your OS.
β οΈ IMPORTANT: Please ignore the section "Generating a new SSH key for a hardware security key".
Let GitHub know about your SSH key by adding it to your GitHub account.
Read the instructions from GitHubβs official website: Adding a new SSH key to your GitHub account to find out how to do so.
Test that your SSH key works by connecting to GitHub.
Read the instructions from GitHubβs official website: Testing your SSH connection to find out how to do so.
Cool. You are now all set up to use git
commands from your terminal.
Part II: π·οΈ The world of spiders
The scrapy
package which we have been using has a lot more functionality than just the Selector
class. In particular, we will look at the scrapy spider
functionality today.
scrapy spider
is a powerful web scraping framework that allows you to extract data from websites in an asynchronous ( \(\neq\) sequential) manner. It provides a lot of the same functionality as weβve explored previously in the using the requests
and Scrapy Selector
modules, but it is more efficient and can handle more complex tasks.
Advantages of scrapy spider
Letβs compare spiders
and requests
:
Characteristic | requests + Scrapy Selectors |
The scrapy spider framework |
---|---|---|
Asynchronous | β | β |
Built-in pagination | β | β |
Middleware and pipelines1 | β | β |
Caching2 | β | β |
robots.txt settings3 |
β | β |
User-Agent settings | β | β |
Built-in saving to JSON, CSV, JSONL etc. | β | β |
Built-in shell for testing selectors4 |
β | β |
Built-in crawl command |
β | β |
Boilerplate code generation5 | β | β |
Intergration with headless browsers6 | β | β |
Built-in sitemap parsers and link extractors7 | β | β |
Works well in Jupyter Notebooks | β | β |
Built-in FormRequest for submitting forms8 |
β | β |
Integration with scrapyd for cloud deployment9 |
β | β |
The scrapy spider
architecture
Spiders are not just a single file, but a collection of files that work together to scrape a website. The main components of a spider are:
# | Component | Functionality |
---|---|---|
1 | items.py |
Defines the data structure that will be scraped. |
2 | middlewares.py |
Processes requests and responses before and after they are sent and received. |
3 | pipelines.py |
Processes the scraped data before it is saved to a file or database. |
4 | settings.py |
Contains the configuration settings for the spider. |
5 | spiders/ |
Contains the spider classes that will scrape the website. |
π‘ TIP: We will not be dealing with middlewares
, pipelines
, or items
in this tutorial, but they are important components of a Scrapy projectβdo not delete them.
Time to code
To show the full power of spiders, we have set up a template GitHub repository:
ποΈ LINK TO ACCEPT REPOSITORY ASSIGNMENT
- I set this up as an assignment on GitHub Classroom. Click the link above to accept the assignment. This will create a unique repository for you to work in.
- If you get a 404 error or a permission error, it means you havenβt shared your GitHub username with me. Please do so now by clicking HERE.
Footnotes
Middleware and pipelines are used to process requests and responses before and after they are sent and received, respectively.β©οΈ
Caching is the process of storing data in a temporary storage area so that it can be accessed more quickly.β©οΈ
robots.txt
is a file that tells search engine crawlers which pages or files the crawler can or canβt request from your site.β©οΈThe Scrapy shell is a powerful tool for testing your CSS selectors and XPath expressions in the terminal.β©οΈ
Scrapy can generate boilerplate code for a new spider, project, or item.β©οΈ
Scrapy can be used with headless browsers like Puppeteer and Playwright to scrape websites that require JavaScript to render.β©οΈ
Scrapy has built-in classes for parsing
sitemap.xml
(eg. LSE sitemap) and extracting links.β©οΈScrapy has a
FormRequest
class that can be used to submit forms.β©οΈscrapyd
is a service for running, scheduling, and monitoring Scrapy spiders in the cloud.β©οΈ