π» Lab 05 β Web Scraping Practice I
Week 02 β Day 01 - Lab Roadmap (90 min)
ποΈ The Brief
We will practice collecting specific information from a website using programming. This process is known as web scraping. We will use the requests
package from Python to send a request to a website and the scrapy
package to parse the HTML content.
βοΈ Setup
Install the required packages for todayβs session:
pip install requests scrapy
Create a new Jupyter Notebook for this lab. Give it a meaningful name, such as
LSE_ME204_W02D01_lecture.ipynb
.Create a new Python cell and add the imports below:
import requests # for sending HTTP requests from scrapy import Selector # for parsing HTML content
π Lab Tasks
Part 2: Understanding the HTML Structure (30 minutes)
π― ACTION POINTS
Using the
css()
method, as before, extract the<div>
element that contains the text βFrom todayβs featured articleβ. Call itfeatured_article
.β οΈ Donβt use the
.get()
method yet. We will use it in the next step.π£οΈ CLASSROOM-WIDE DISCUSSION: there are multiple ways to solve this. Did everyone solve it the same way?
Now extract and print just the text from the
featured_article
element.Now extract all the
<a>
elements from thefeatured_article
element. Call itfeatured_article_links
.π‘ Tip:: You might want to use something like:
= featured_article.css(SOME_SELECTOR_HERE) featured_article_links
The output of this object is a list. But is it a list of what? A list of strings, a list of dictionaries, a list of objects? How can you find out?
CHALLENGE: Convert the
featured_article_links
object to a single data frame with two columns:href
andtitle
. Your data frame will look somewhat like this:href title /wiki/Paint_It_Black Paint It Black /wiki/The_Rolling_Stones The Rolling Stones /wiki/Single_(music) Single (music) β¦ β¦
π THINK ABOUT IT: Why do the href
(links) start with /wiki/
? What does this tell you about the structure of the HTML?
Part 3: Website crawl (30 minutes)
We will take this knowledge beyond a single page and crawl data from related pages.
π― ACTION POINTS
(You might want to work in pairs for this part)
Identify the first link in the
featured_article_links
data frame that relates to other Wikipedia articles. Store it in a variable calledfirst_link
.Concatenate the
first_link
variable with the Wikipedia URL to create a full URL. Store it in a variable calledfirst_link_url
.Scrape the text contained in all
<h2>
headers of the page represented by thefirst_link_url
variable. Ensure the output is a character vector.
π‘ Bonus Task
π― ACTION POINTS
Repeat the steps in Part 3 for ALL the links contained in the
featured_article_links
data frame. Ignore all links that do not point to other Wikipedia articles.Store the results in a data frame called
featured_article_h2
. The data frame must have the following columns:url
: the URL from which the<h2>
header was extracted.page_title
: the title of the page from which the<h2>
header was extracted.link
: the link from which the<h2>
header was extracted.header
: the text contained in the<h2>
header.
Save the
featured_article_h2
data frame as a CSV file calledfeatured_article_h2.csv
.
This exercise might be quite challenging. It will require knowledge of loops and functions, to be done efficiently.