πŸ’» Lab 04 – Web Scraping with the rvest Package

Week 01 – Day 04 - Lab Roadmap (90 min)

Author
Published

13 July 2023

πŸ“‹ LAB DIFFICULTY: 😰 LIKELY DIFFICULT

(It can be challenging to navigate the DOM and web documents in general. We hope that you can combine what you learned from the morning lecture with your previous knowledge of XML, HTML and CSS to complete this lab.)

πŸ₯… Objectives

  • Apply the concepts learned in the Day 04 lecture (the DOM, HTML, CSS) on web scraping using the rvest package.
  • Gain hands-on experience scraping data from Wikipedia.
  • Practice extracting relevant information such as article titles, summaries, tables, and references.
  • Learn how to navigate and parse HTML structure to locate specific elements.
  • Store scraped data in a structured format for further analysis.

βš™οΈ Setup

  1. Open RStudio and create a new R script. Save the script as lab04.R.

  2. Load the necessary libraries for the lab:

    library(tidyverse)
    library(rvest)

    Or, if you prefer to load each package individually:

    library(dplyr)     # for data manipulation
    library(stringr)   # for string manipulation
    library(tibble)    # for tibbles (data frames)
    library(rvest)     # for web scraping

πŸ“‹ Lab Tasks

Part 1: Navigating the DOM with rvest (30 min)

🎯 ACTION POINTS

  1. Open the homepage of the English version of Wikipedia in your web browser and inspect the HTML structure of the page.

  2. Identify the path in the DOM tree to the Wikipedia logo () and write it down.

  3. Use the read_html() function from the rvest package to read the Wikipedia website HTML into R:

    url <- "https://en.wikipedia.org/"
    html <- read_html(url)
  4. Use the html_element() function from the rvest package to collect the Wikipedia logo (<img element) from the HTML, using the path you identified in step 2. Save it in a variable called logo.

  5. Run the code below to extract all the attributes of the Wikipedia logo:

    html_attrs(logo)
  6. Answer: what is the data type returned by the html_attrs() function? Where did you see this data type before?

Now, it’s your turn! The following exercises will increase in difficulty. If you get stuck, ask your instructor for help.

Part 2: Understanding HTML Structure (30 minutes)

🎯 ACTION POINTS

  1. Using the html_element() function, extract the <div> element that is right below β€œFrom today’s featured article”. Call it featured_article.

  2. πŸ—£οΈ CLASSROOM-WIDE DISCUSSION: there are multiple ways to solve this. Did everyone solve it the same way?

  3. Now extract and print just the text from the featured_article element.

    πŸ’‘ Tip: Take a look at the html_text() function.

  4. Now, use the html_elements() (notice the β€œs” at the end) function to extract all the <a> elements from the featured_article element. Call it featured_article_links.

  5. Convert the featured_article_links object to a single data frame with two columns: href and title. Your data frame will look somewhat like this:

    href title
    /wiki/Paint_It_Black Paint It Black
    /wiki/The_Rolling_Stones The Rolling Stones
    /wiki/Single_(music) Single (music)
    … …

πŸ’­ THINK ABOUT IT: Why do the href (links) start with /wiki/? What does this tell you about the structure of the HTML? (This will be relevant for Part 2)

Part 3: Website crawl (30 minutes)

We will take this knowledge beyond a single page and crawl data from related pages.

🎯 ACTION POINTS

(You might want to work in pairs for this part)

  1. Identify the first link in the featured_article_links data frame that relates to other Wikipedia articles. Store it in a variable called first_link.

  2. Use the paste0() function to create a full URL from the first_link variable and store it in a variable called first_link_url.

  3. Scrape the text contained in all <h2> headers of the page represented by the first_link_url variable. Ensure the output is a character vector.

    πŸ’‘ Tip: You will need to combine html_nodes() and html_text() for this.

🏑 Bonus Task

🎯 ACTION POINTS

  1. Repeat the steps in Part 3 for ALL the links contained in the featured_article_links data frame. Ignore all links that do not point to other Wikipedia articles.

  2. Store the results in a data frame called featured_article_h2. The data frame must have the following columns:

    • url: the URL from which the <h2> header was extracted.
    • page_title: the title of the page from which the <h2> header was extracted.
    • link: the link from which the <h2> header was extracted.
    • header: the text contained in the <h2> header.
  3. Save the featured_article_h2 data frame as a CSV file called featured_article_h2.csv.

This exercise might be quite challenging. It will require knowledge of loops and functions, to be done efficiently.