π» Lab 04 β Web Scraping with the rvest Package
Week 01 β Day 04 - Lab Roadmap (90 min)
π LAB DIFFICULTY: π° LIKELY DIFFICULT
(It can be challenging to navigate the DOM and web documents in general. We hope that you can combine what you learned from the morning lecture with your previous knowledge of XML, HTML and CSS to complete this lab.)
π₯ Objectives
- Apply the concepts learned in the Day 04 lecture (the DOM, HTML, CSS) on web scraping using the
rvest
package. - Gain hands-on experience scraping data from Wikipedia.
- Practice extracting relevant information such as article titles, summaries, tables, and references.
- Learn how to navigate and parse HTML structure to locate specific elements.
- Store scraped data in a structured format for further analysis.
βοΈ Setup
Open RStudio and create a new R script. Save the script as
lab04.R
.Load the necessary libraries for the lab:
library(tidyverse) library(rvest)
Or, if you prefer to load each package individually:
library(dplyr) # for data manipulation library(stringr) # for string manipulation library(tibble) # for tibbles (data frames) library(rvest) # for web scraping
π Lab Tasks
Part 2: Understanding HTML Structure (30 minutes)
π― ACTION POINTS
Using the
html_element()
function, extract the<div>
element that is right below βFrom todayβs featured articleβ. Call itfeatured_article
.π£οΈ CLASSROOM-WIDE DISCUSSION: there are multiple ways to solve this. Did everyone solve it the same way?
Now extract and print just the text from the
featured_article
element.π‘ Tip: Take a look at the
html_text()
function.Now, use the
html_elements()
(notice the βsβ at the end) function to extract all the<a>
elements from thefeatured_article
element. Call itfeatured_article_links
.Convert the
featured_article_links
object to a single data frame with two columns:href
andtitle
. Your data frame will look somewhat like this:href title /wiki/Paint_It_Black Paint It Black /wiki/The_Rolling_Stones The Rolling Stones /wiki/Single_(music) Single (music) β¦ β¦
π THINK ABOUT IT: Why do the href
(links) start with /wiki/
? What does this tell you about the structure of the HTML? (This will be relevant for Part 2)
Part 3: Website crawl (30 minutes)
We will take this knowledge beyond a single page and crawl data from related pages.
π― ACTION POINTS
(You might want to work in pairs for this part)
Identify the first link in the
featured_article_links
data frame that relates to other Wikipedia articles. Store it in a variable calledfirst_link
.Use the
paste0()
function to create a full URL from thefirst_link
variable and store it in a variable calledfirst_link_url
.Scrape the text contained in all
<h2>
headers of the page represented by thefirst_link_url
variable. Ensure the output is a character vector.π‘ Tip: You will need to combine
html_nodes()
andhtml_text()
for this.
π‘ Bonus Task
π― ACTION POINTS
Repeat the steps in Part 3 for ALL the links contained in the
featured_article_links
data frame. Ignore all links that do not point to other Wikipedia articles.Store the results in a data frame called
featured_article_h2
. The data frame must have the following columns:url
: the URL from which the<h2>
header was extracted.page_title
: the title of the page from which the<h2>
header was extracted.link
: the link from which the<h2>
header was extracted.header
: the text contained in the<h2>
header.
Save the
featured_article_h2
data frame as a CSV file calledfeatured_article_h2.csv
.
This exercise might be quite challenging. It will require knowledge of loops and functions, to be done efficiently.