✅ (Solutions) Lab 04

Author
Published

13 July 2023

Solutions to the Lab 04 exercises.

(Ideally you would have solved this in a markdown file)

⚙️ Setup

The packages you’ll need:

library(tidyverse)
library(rvest)
Part 1: Navigating the DOM with rvest

Part 1: Navigating the DOM with rvest

# Reading in the HTML
url <- "https://en.wikipedia.org/"
wiki <- read_html(url)
# This is a simple way to get to the logo image
logo <- wiki %>% html_element("a.mw-logo img")

So when you run:

html_attrs(logo)

you should see something like this:

        class                                  src                                  alt 
"mw-logo-icon" "/static/images/icons/wikipedia.png"                                   "" 
    aria-hidden                               height                                width 
        "true"                                 "50"                                 "50" 
Part 2: Extracting information from the featured article

Part 2: Extracting information from the featured article

Today’s featured article was a summary of the Wikipedia entry for the British Egyptologist Margaret Murray, as can be seen in the screenshot below:

Step 1: is pretty straightforward.

featured_article <- wiki %>% html_element("#mp-tfa")

producing:

featured_article
{html_node}
<div id="mp-tfa" class="mp-contains-float">
[1] <div id="mp-tfa-img" style="float: left; margin: 0.5em 0.9em 0.4em 0em;">\n<div class="thumbinner mp-thumb" ...
[2] <p><b><a href="/wiki/Margaret_Murray" title="Margaret Murray">Margaret Murray</a></b> (13 July 1863 – 13 No ...
[3] <div class="tfa-recent" style="text-align: right;">\nRecently featured: <style data-mw-deduplicate="Templat ...
[4] <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374">\n
[5] <div class="hlist tfa-footer noprint" style="text-align:right;">\n<ul>\n<li><b><a href="/wiki/Wikipedia:Tod ...

Step 2 Getting the text is easy, and we can pipe it to cat or print:


featured_article %>% html_element("p") %>% html_text() %>% print()

But then you might be asking, why do I only see the first few lines of the article?

[1] "Margaret Murray (13 July 1863 – 13 November 1963) was an Anglo-Indian Egyptologist, archaeologist, historian,"

Well, it’s just because R is preventing you from seeing the whole thing, as it is quite long. You can write the output text to a file and open it in a text editor to see the whole thing:

featured_article %>% 
    html_element("p") %>% 
    html_text() %>% 
    writeLines("featured_article.txt", useBytes = TRUE)

Note that I used useBytes = TRUE to ensure that the text is written to the file in UTF-8 encoding.

Step 3: We make use of the html_elements() function.

featured_article_links <- 
    featured_article %>% 
    html_element("p") %>%
    html_elements("a")

which produces:

{xml_nodeset (22)}
 [1] <a href="/wiki/Margaret_Murray" title="Margaret Murray">Margaret Murray</a>
 [2] <a href="/wiki/Anglo-Indian_people" title="Anglo-Indian people">Anglo-Indian</a>
 [3] <a href="/wiki/Archaeology" title="Archaeology">archaeologist</a>
 [4] <a href="/wiki/Folklore_studies" title="Folklore studies">folklorist</a>
 [5] <a href="/wiki/University_College_London" title="University College London">University College London</a>
 [6] <a href="/wiki/The_Folklore_Society" title="The Folklore Society">the Folklore Society</a>
 [7] <a href="/wiki/Flinders_Petrie" title="Flinders Petrie">Flinders Petrie</a>
 [8] <a href="/wiki/Egyptology" title="Egyptology">Egyptology</a>
 [9] <a href="/wiki/Archaeological_excavation" title="Archaeological excavation">excavations</a>
[10] <a href="/wiki/Osireion" title="Osireion">Osireion</a>
[11] <a href="/wiki/Saqqara" title="Saqqara">Saqqara</a>
[12] <a href="/wiki/British_Museum" title="British Museum">British Museum</a>
[13] <a href="/wiki/Manchester_Museum" title="Manchester Museum">Manchester Museum</a>
[14] <a href="/wiki/Mummy" title="Mummy">mummies</a>
[15] <a href="/wiki/Tomb_of_Two_Brothers" title="Tomb of Two Brothers">Tomb of the Two Brothers</a>
[16] <a href="/wiki/First-wave_feminism" title="First-wave feminism">first-wave feminist</a>
[17] <a href="/wiki/Women%27s_Social_and_Political_Union" title="Women's Social and Political Union">Women's So ...
[18] <a href="/wiki/Witch-cult_hypothesis" title="Witch-cult hypothesis">witch-cult hypothesis</a>
[19] <a href="/wiki/Witch_trials_in_the_early_modern_period" title="Witch trials in the early modern period">wi ...
[20] <a href="/wiki/Horned_God" title="Horned God">Horned God</a>
...

Step 4: Create a data frame either with data.frame function or tibble function.

featured_article_df <- 
    tibble(href = html_attr(featured_article_links, "href"),
           title = html_text(featured_article_links))

featured_article_df

produces a data frame of the form:

href title
/wiki/Margaret_Murray Margaret Murray
/wiki/Anglo-Indian_people Anglo-Indian
/wiki/Archaeology archaeologist
/wiki/Folklore_studies folklorist
/wiki/University_College_London University College London
/wiki/The_Folklore_Society the Folklore Society
/wiki/Flinders_Petrie Flinders Petrie
/wiki/Egyptology Egyptology
/wiki/Archaeological_excavation excavations
/wiki/Osireion Osireion
/wiki/Saqqara Saqqara
/wiki/British_Museum British Museum
/wiki/Manchester_Museum Manchester Museum
/wiki/Mummy mummies
/wiki/Tomb_of_Two_Brothers Tomb of the Two Brothers
/wiki/First-wave_feminism first-wave feminist
/wiki/Women%27s_Social_and_Political_Union Women’s Social and Political Union
/wiki/Witch-cult_hypothesis witch-cult hypothesis
/wiki/Witch_trials_in_the_early_modern_period witch trials of early modern Christendom
/wiki/Horned_God Horned God
/wiki/Wicca Wicca
/wiki/Margaret_Murray Full article…
Part 3: Website crawl
# Step 1: Take the first link from the `featured_article_df`
first_link <- featured_article_df$href[1]
# Step 2: Use the `paste0()` function to create a full URL
first_link_url <- paste0("https://en.wikipedia.org", first_link)

produces:

[1] "https://en.wikipedia.org/wiki/Margaret_Murray"
# Step 3: Read the new page then use html_elements to get `<h2>` headers
first_link_html <- read_html(first_link_url)
h2_headers <- first_link_html %>%
  html_elements("h2") %>%
  html_text()

h2_headers

to obtain the solution:

 [1] "Contents"                       "Early life"                     "Later life"
 [4] "Murray's witch-cult hypotheses" "Personal life"                  "Legacy"
 [7] "Bibliography"                   "See also"                       "References"
[10] "External links"