πŸ’» Lab 05 – Web Scraping Practice I

Week 02 – Day 01 - Lab Roadmap (90 min)

Author
Published

15 July 2024

πŸ—’οΈ The Brief

We will practice collecting specific information from a website using programming. This process is known as web scraping. We will use the requests package from Python to send a request to a website and the scrapy package to parse the HTML content.

βš™οΈ Setup

  1. Install the required packages for today’s session:

    pip install requests scrapy
  2. Create a new Jupyter Notebook for this lab. Give it a meaningful name, such as LSE_ME204_W02D01_lecture.ipynb.

  3. Create a new Python cell and add the imports below:

    import requests                # for sending HTTP requests
    
    from scrapy import Selector    # for parsing HTML content

πŸ“‹ Lab Tasks

Part 1: Navigating the structure of a webpage (30 minutes)

🎯 ACTION POINTS

  1. Open the homepage of the English version of Wikipedia in your web browser and inspect the HTML structure of the page.

  2. Identify the path in the DOM tree to the Wikipedia logo () and write it down.

    • On most browsers, once you have inspected the element, you can right-click on the HTML code and select β€œCopy” > β€œCopy Selector” to get a CSS Selectyor code you could use in your Python script.
  3. Send the request to the Wikipedia homepage using the requests package and store the HTML content in a variable called html.

    url = "https://en.wikipedia.org/"
    html = requests.get(url).content # This sends a GET request to the URL and stores the HTML content
  4. Use the Selector class from the scrapy package to parse the HTML content. Save it in a variable called sel.

    sel = Selector(text=html)

    This object stores the full HTML content of the Wikipedia homepage, at the time you sent the request.

  5. Use the css() method from the Selector object to extract the Wikipedia logo from the HTML, using the path you identified in step 2. Save it in a variable called url_logo. Here is an example of how you can do this:

    url_logo = sel.css("your_selector_here")
  6. Print all the attributes of the Wikipedia logo using the attrib attribute of the Selector object.

    url_logo.attrib

    What is the data structure of the output? Is it a list, a dictionary, a string, or something else?

Now, it’s your turn! The following exercises will increase in difficulty. If you get stuck, ask your instructor for help.

Part 2: Understanding the HTML Structure (30 minutes)

🎯 ACTION POINTS

  1. Using the css() method, as before, extract the <div> element that contains the text β€œFrom today’s featured article”. Call it featured_article.

    ⚠️ Don’t use the .get() method yet. We will use it in the next step.

  2. πŸ—£οΈ CLASSROOM-WIDE DISCUSSION: there are multiple ways to solve this. Did everyone solve it the same way?

  3. Now extract and print just the text from the featured_article element.

  4. Now extract all the <a> elements from the featured_article element. Call it featured_article_links.

    πŸ’‘ Tip:: You might want to use something like:

    featured_article_links = featured_article.css(SOME_SELECTOR_HERE)

    The output of this object is a list. But is it a list of what? A list of strings, a list of dictionaries, a list of objects? How can you find out?

  5. CHALLENGE: Convert the featured_article_links object to a single data frame with two columns: href and title. Your data frame will look somewhat like this:

    href title
    /wiki/Paint_It_Black Paint It Black
    /wiki/The_Rolling_Stones The Rolling Stones
    /wiki/Single_(music) Single (music)
    … …

πŸ’­ THINK ABOUT IT: Why do the href (links) start with /wiki/? What does this tell you about the structure of the HTML?

Part 3: Website crawl (30 minutes)

We will take this knowledge beyond a single page and crawl data from related pages.

🎯 ACTION POINTS

(You might want to work in pairs for this part)

  1. Identify the first link in the featured_article_links data frame that relates to other Wikipedia articles. Store it in a variable called first_link.

  2. Concatenate the first_link variable with the Wikipedia URL to create a full URL. Store it in a variable called first_link_url.

  3. Scrape the text contained in all <h2> headers of the page represented by the first_link_url variable. Ensure the output is a character vector.

🏑 Bonus Task

🎯 ACTION POINTS

  1. Repeat the steps in Part 3 for ALL the links contained in the featured_article_links data frame. Ignore all links that do not point to other Wikipedia articles.

  2. Store the results in a data frame called featured_article_h2. The data frame must have the following columns:

    • url: the URL from which the <h2> header was extracted.
    • page_title: the title of the page from which the <h2> header was extracted.
    • link: the link from which the <h2> header was extracted.
    • header: the text contained in the <h2> header.
  3. Save the featured_article_h2 data frame as a CSV file called featured_article_h2.csv.

This exercise might be quite challenging. It will require knowledge of loops and functions, to be done efficiently.