LSE Data Science Institute | DS105W (2023/24) | Week 05

# üóìÔ∏è Week 05: More Web scraping: CSS Selectors & XPaths

Theme: Collecting Data

**DATE:** 15 February 2024

**AUTHOR:** Dr [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

-----

**üìö LEARNING OBJECTIVES:**

- Discover the basics of XPath and how it compares to CSS selectors
- How to orient the web scraping process around **containers** instead of individual elements
- How to write **custom functions** to extract data from a website

**PRE-REQUISITES:**

To understand this notebook better, you should first revisit the following notebooks: 

- üìö Week 04 Lecture Notebook
- üõ£Ô∏è Week 05 Lab Roadmap
- ‚úÖ Week 05 Lab Solutions

**USEFUL LINKS:**

- <span style="font-size:1.5em;">Strong recommendation: üìò [HTML & CSS](https://wtf.tw/ref/duckett.pdf) book</span>
- [W3 Schools - CSS Selectors](https://www.w3schools.com/CSS/css_selectors.asp)
- [Complete list of CSS selectors](https://www.w3.org/TR/selectors-3/#selectors)
- [`scrapy` extensions to CSS Selectors](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors)

<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;color:#c89020">CSS selectors cheatsheet ‚≠ê</h2></summary> 

<div style="margin-top:1.5em;width:80%;font-size:0.9em;">

| Selector              | Example                  | Use Case Scenario                                                                                                                              |
|-----------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| *                     | *                        | This selector picks all elements within a page. It‚Äôs not that different from a page. Not much use for it but still good to know                |
| .class                | .card-title              | The simplest CSS selector is targeting the class attribute. If only your target element is using it, then it might be sufficient.            |
| .class1.class2        | .card-heading.card-title | There are elements with a class like class=‚Äúcard-heading card-title‚Äù. When we see a space, it is because the element is using several classes. However, there‚Äôs no one fixed way of selecting the element. Try keeping the space, if that doesn‚Äôt work, then replace the space with a dot. |
| #id                   | #card-description        | What if the class is used in too many elements or if the element doesn‚Äôt have a class? Picking the ID can be the next best thing. The only problem is that IDs are unique per element. So won‚Äôt cut to scrape several elements at once.                   |
| element               | h4                       | To pick an element, all we need to add to our parser is the HTML tag name.                                                                  |
| element.class         | h4.card-title            | This is the most common we‚Äôll be using in our projects.                                                                                      |
| parentElement > childElement | div > h4          | We can tell our scraper to extract an element inside another. In this example, we want it to find the h4 element whose parent element is a div.                                                             |
| parentElement.class > childElement | div.card-body > h4 | We can combine the previous logic to specify a parent element and extract a specific CSS child element. This is super useful when the data we want doesn‚Äôt have any class or ID but is inside a parent element with a unique class/ID. |
| [attribute]           | [href]                   | Another great way to target an element with no clear class to choose from. Your scraper will extract all elements containing the specific attribute. In this case, it will take all <a> tags which are the most common element to contain an href attribute. |
| [attribute=value]     | [target=_blank]          | We can tell our scraper to extract only the elements with a specific value inside its attribute.                                              |
| element[attribute=value] | a[rel=next]          | This is the selector we used to add a crawling feature to our Scrapy script: next_page = response.css(‚Äòa[rel=next]‚Äô).attrib[‚Äòhref‚Äô] The target website was using the same class for all its pagination links so we had to come up with a different solution. |
| [attribute~=value]    | [title~=rating]         | This selector will pick all the elements containing the word ‚Äòrating‚Äô inside its title attribute.                                             |

</div>

Source: [The Only CSS Selectors Cheat Sheet You Need for Web Scraping](https://www.scraperapi.com/blog/css-selectors-cheat-sheet/#CSS-Selectors-Cheat-Sheet)

</details>

## ‚öôÔ∏è Setup

In [None]:
import requests
from scrapy import Selector
import pandas as pd
from tqdm.notebook import tqdm

If the above throws an error of module not found, open the terminal and type `pip install tqdm`

**OBJECTIVE:**

In this tutorial, we will revisit the same exercise from the W05 lab, only this time we will solve it with the help of XPath instead of CSS selectors.

# 1. The (X)Path to success

We have seen how we can use CSS selectors to find elements in the HTML code of a webpage. However, there is another way to do this: **XPath**. 

<div style="width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

üìï **What is XPath?** 

XPath is a query language for selecting nodes from an XML language

 XPath is a bit more powerful than CSS selectors, but it is also a bit more complicated. It allows us more flexibility when it comes to finding the children or partents of certain elements. What's more, it you find yourself being able to get to a parent element with CSS but cannot get to its children because they have some obscure class name, XPath can be 'chained' at the end of a CSS element. 

</div>

## 1.1 Loading the page and getting the HTML

First, we send a GET request to the page to obtain a response object, with which we can get the HTML code of the page:

In [None]:
url = 'https://socialdatascience.network/index.html#schedule'

# Load the first page
response = requests.get(url)

<div style="width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

üí° **PRO-TIP:** 

In reality, `requests.get(url)` is a shortening of `requests.request('GET', url)`.

Whenever we send a request to a server, we need to specify the type of request we are sending. The most common types are GET, which intuitively means that we are asking the server to send us some data, and POST, which means that we are sending some data to the server.


</div>

Right after collecting the page, it's a good practice to check if the request was successful. We can do this by checking the status code of the response:

In [None]:
# Check the status code, 200 means OK, anything else means something went wrong
if not response.status_code == 200:
    print('Something went wrong, status code:', response.status_code)
else:
    print('Everything is OK, status code:', response.status_code)

# print the HTML code
print(response.text[:100])

<div style="width:80%;font-size:0.85em;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

**Did you get a 403 error?**

If, instead, you got an 403 error at this stage, it's probably because the server is refusing to respond to our request because it thinks we are a 'robot', not a user accessing from a web browser. They are not entirely wrong...

When we send a request, the server searches for a header (a metadata) in our request called `User-Agent`. Browsers have unique `User-Agent` headers that uniquely identifies the browser and the operating system that is being used. For example:

- Chrome on Windows 10: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36`

- Firefox on Windows 10: `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0`

- Chrome on Mac OS: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11_6_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36`

A server might be configured to block the request when it doesn't come from a standard browser. We can fix this by amending our request, specifying a `User-Agent` header, and then passing it to the `requests.get()` function. So far though, it looks like we are getting the HTML code of the page, so we can move on to the next step.

</div>

## 1.2 Creating a Selector object

Now that we have the HTML code, we can create a scrapy `Selector` object from it. This will allow us to use CSS and XPath selectors to find elements in the HTML code.

In [None]:
# Create a Selector object from the HTML code
sel = Selector(text = response.text)

You know we can now use the `sel.css()` method to find elements in the HTML code using CSS selectors. 

Let's practice with the `*` selector, which selects all elements in the HTML code:

In [None]:
# Print number of HTML elements returned by the CSS selector
print('Number of elements on the page with CSS:', len(sel.css('*')))

We can do the same with XPath selectors, using the `sel.xpath()` method. The syntax is a bit different, but the result is the same:

In [None]:
# Print out the number of elements on the page
print('Number of elements on the page with XPath:', len(sel.xpath('//*')))

# 2. CSS vs XPath selectors

## 2.1 Inspecting the elements

Let's start by inspecting the elements on the page. As we saw last week, we can inspect the elements on the page by right-clicking on the page and selecting 'Inspect'. (You can achieve the same by pressing `Ctrl+Shift+I` on Windows or `Cmd+Shift+I` on Mac)

Once you have the developer tools open, hover over the elements on the page to see their HTML code:

![](./figures/civica_screenshot.png)

Let's compare how we can get this element with CSS and XPath selectors.

## 2.2. CSS selectors

We learned that the titles of events are stored inside `h6` tags nested inside the link (`a`) tags, nested inside a `div` tag with the class `card-body`.

Using CSS selectors, we specify the tag name, the class with the `.` operator and the child element with the `>` operator:

```css
div.card-body > a > h6
```

In [None]:
# Extract the HTML code of the element with CSS selectors
pprint(sel.css('div.card-body > a > h6').extract_first()) # this gets the first element

To extract ALL the elements with this class, use the `extract()` method after the `sel.css()` method:

In [None]:
# The getall() method returns a list of strings
titles = sel.css('div.card-body > a > h6').extract()
print('There are', len(titles), 'event titles on the page.') 

**OK, but this returns the tag itself. I only care about the text inside the tag.**

For this, we use [`scrapy`'s extension](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors) `::text`:

```css
div.card-body > a > h6::text
```

## 2.3 XPath

The syntax resembles a bit the way we would write a path to a file in the **Terminal**. You can use the familiar `..` and `/` symbols when navigating through the HTML code.

### 2.3.1 Absolute paths


For example, a single forward slash `/` at the beginning of the path indicates that you're looking for elements within the current element. When we get a `sel` object, we are looking at the entire HTML code of the page, so the following syntax:

```xpath
/html/head/title
```

will return the tag that contains the title of the page:

In [None]:
sel.xpath('/html/head/title').extract()

::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

**How do I test that on the browser, without having to write Python code?**

On the Inspector, you can click on the 'Console' tab and type the following:

```javascript
$x('/html/head/title')
```

This will return an Array (similar to a Python list) that contains the element we are looking for. Take a look at the video below for a demonstration:

<div style="position: relative; padding-bottom: 59.375%; height: 0;"><iframe src="https://www.loom.com/embed/5c324141c4c24a6fb1ce7417c84ec422?sid=3d1cd528-fa33-4c55-ae25-adc0aeb6f0a6" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

:::

### 2.3.2 Relative paths

Just like with CSS selector, you don't need to specify the entire path to the element you are looking for. You can use the `//` symbol to indicate that you are looking for an element anywhere in the HTML code. For example, the following XPath selector:

```xpath
//title
```
would also return the `<title>` tag.

Give it a go! Open the browser console and type:

```javascript
$x('//title')
```

To confirm that you got the same result as before.

Perhaps more usefully, use `.outerHTML` to see the entire HTML code of the element:

```javascript
$x('//title')[0].outerHTML // ONLY WORKS IN THE BROWSER
```

Like with CSS selectors, the search will not stop at the first element that matches the selector. It will return all the elements that match the selector.

### 2.3.3 Selecting elements by class

Instead of the `.` symbol used in CSS selectors, we use a mix of `contains()` and `@class` to select elements by class:

This is the syntax to select `<div>` elements with a `card__content` class:

```css
//div[contains(@class, 'card-body')]
```

In [None]:
# Extract the HTML code of all elements with XPath selectors
xpath_event_divs = '//div[contains(@class, "card-body")]'
xpath_all_events = sel.xpath(xpath_event_divs).extract()
print('There are', len(xpath_all_events), 'events on the page.')

In [None]:
# Uncomment to see the first event with all details
#print(xpath_all_events[0])

### 2.3.4 Of parents and children: the power of XPath

As mentioned above, XPath behaves a bit like a file system. We can use the `/..` symbol to get the parent of an element:

In [None]:
# Here's another way to grab that `div.card-body` element
print(sel.xpath('//h6/../..').extract_first())

üí° **PRO-TIP:** You can index the results with XPath. Say you only care about the first match:

In [None]:
# Try it: replace [1] with [2] or any other number to get the corresponding element
indexed_div = sel.xpath('(//div[contains(@class, "card-body")])[1]').extract()

# I still used extract() instead of extract_first() 
# to show you that it returns a list with a single element
indexed_div

How do we get **just the children** of an element? We use the `/*` symbol:

In [None]:
children_of_div = sel.xpath('(//div[contains(@class, "card-body")])[1]/*').extract()

children_of_div

**You could not have done this with CSS selectors!**

How do I get the text inside the tag? We use the `text()` method at the end (similarly to `::text` in CSS selectors):


In [None]:
sel.xpath('(//div[contains(@class, "card-body")])[1]//h6/text()').extract()

How do I get attributes? We use the `@` symbol:

In [None]:
sel.xpath('(//a)[10]/@href').extract()

**What else can I do?**

- [W3 Schools XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp)
- Check out this [old school XPath Tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html)

## 2.4 More power to you: chaining CSS and XPath selectors

Note the different syntax for CSS and XPath selectors. CSS selectors are more straightforward, but XPath selectors are more powerful.

üí° **PRO-TIP:** What is really nice about `scrapy` is that you can chain CSS and XPath selectors together! This means you can use a CSS selector to get an element and an XPath selector to get its children.

Remember from the ‚úÖ [Week 05 lab solutions](https://lse-dsi.github.io/DS105/2023/winter-term/weeks/week05/lab-solutions.html) that we used XPath to get speakers and dates:

```python
speakers_xpath = "//p[@class='card-text']/text()[1]"
speakers = selector.xpath(speakers_xpath).extract()

dates_xpath = "//p[@class='card-text']/text()[2]"
dates = selector.xpath(dates_xpath).extract()
```

Could I do this with a mix CSS and XPath selectors?

In [None]:
speakers = sel.css('div.card-body > p').xpath('text()[1]').extract()
dates    = sel.css('div.card-body > p').xpath('text()[2]').extract()

In the end, it is up to you to decide which one you prefer.

# 3. Orienting the scraping process around containers


We have already talked about `pandas` but haven't done much with them at this stage in the course. This is because we're focused on **collecting** data. Once we have enough data and are ready to work with it, you will see how tables (data frames) provide an excellent and very convenient way to store data for analysis.

**In the meantime, trust me! You want to convert whatever data you capture into a pandas data frame.**

With what we have learned so far, it is easy to convert the event details we care about into a pandas data frame. We can summarise the whole process as:



In [None]:
# Capture all the info individually
titles = sel.css('div.card-body > a > h6').extract()
speakers = sel.css('div.card-body > p').xpath('text()[1]').extract()
dates = sel.css('div.card-body > p').xpath('text()[2]').extract()

# Put it all together into a DataFrame
events = pd.DataFrame({'title': titles, 'speaker': speakers, 'date': dates})

# Uncomment to browse the DataFrame
#events

While the above works, it can lead to much repetition and 'workaround' code if the website you are scraping is not the most consistent. For example, some event divs may have a different class name, or some events list the speakers and dates in the opposite order.

## 3.1 Treat each event box as a template

Instead of going directly to the titles, speakers, and dates, we can capture the `<div>` that represents an event and then extract the details from its children.

In [None]:
# Capture the divs but don't extract yet!
containers = sel.css('div.card-body')

len(containers)

In [None]:
containers[0]

Note one very important thing about the `containers[0]` object: it is a `Selector` object, not a string. Therefore, it contains other methods and attributes inside it.

**Why is that useful?** 

We can treat this object as our HTML code, ignore the rest of the HTML, and use the same methods to get the information we need from it. We don't need to scrape the entire page again to get the necessary information.


## 3.1 Extract all details of a single event

To illustrate the observation above, let's get the title, speaker and date of _this particular event_:

In [None]:
event_title   = containers[0].css('h6::text').extract_first()
event_speaker = containers[0].xpath('p/text()[1]').extract_first()
event_date    = containers[0].xpath('p/text()[2]').extract_first() 

## 3.2 Putting it all together (NOT ELEGANT)


I know what you are thinking... _'but then I will have to do this for every event!'_

True. If you were to proceed with the above and use what you learned from your previous Python training, you would have to write a loop that goes through each event and extracts the details.

It's likely that you would be tempted to write a code that looks like this:

```python
#### WE DON'T WANT YOU TO WRITE THIS TYPE OF CODE IN THIS COURSE! I'LL EXPLAIN IN SECTION 3.3 ####


# Create an empty list to store the details of each event
event_titles = []
event_speakers = []
event_dates = []

# Loop through each event
for container in containers:
    # Extract the details of the event
    title = container.css('h6::text').extract_first()
    speaker = container.xpath('p/text()[1]').extract_first()
    date = container.xpath('p/text()[2]').extract_first()

    # Append the details to the lists
    event_titles.append(title)
    event_speakers.append(speaker)
    event_dates.append(date)

```


## 3.3 Using custom functions (ELEGANT üé©)

What we want is for you to use best practices when writing code. Code that is efficient, easy to read, and easy to alter in the future. With practice, you will realise that `for` loops are not always the best way to go about things. If you find a bug in the code above, you would have to go through the entire loop to locate the source of the bug.

**Functions** (with the `def` operator) are a great way to encapsulate a piece of code that does a single task. You can run the same function with different parameters to test out different scenarios. If you find a bug in the function, you only have to fix it once.

How would a function look like for the above?


In [None]:
def scrape_event(event_container):
    event_title   = event_container.css('h6::text').extract_first()
    event_speaker = event_container.xpath('p/text()[1]').extract_first()
    event_date    = event_container.xpath('p/text()[2]').extract_first()
    return {'title': event_title, 'speaker': event_speaker, 'date': event_date}

This way, you can test the function for individual containers:

In [None]:
# Change [0] to [1] or any other number to get the corresponding event
scrape_event(containers[0])

**Notice that we returned a dictionary instead of a list.** Key-value pairs are the most natural way to store a single record of data. 

If we add up all the dictionaries, we get a list of dictionaries, making it easier to convert to a pandas data frame.

In [None]:
# Scrape all events (note that we're using list comprehension)
events = [scrape_event(container) for container in containers]

# Creating a dataframe is easier
df = pd.DataFrame(events)

# Uncomment to browse the DataFrame
#df

## 3.3 Writing Great Documentation

**Your future self (and the reviewers of your code) will thank you for it!**

Writing maintainable code is not just about writing code that works. It's about writing code that is easy to understand and easy to alter in the future.


<div style="width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

üí° **TIP:** Always think to yourself: what comments or documentation can I add to my code so that if I return to it in a few months, I would still understand what I was trying to do?

</div>

One excellent way to document your code is to write a [**docstring**](https://realpython.com/documenting-python-code/). A docstring is a string that comes right after the `def` operator and describes what the function does. Writing a docstring for every function you write is an excellent practice.

In [None]:
def scrape_event(event_container):
    """Scrape details from a single event container.
    
    An event container looks like this:
    <div class="card-body">
        <a href="...">
            <h6>Event title</h6>
        </a>
        <p>Speaker name</p>
        <p>Date</p>
    </div>

    This function captures the title, speaker, and date from the container.
    
    Args:
        event_container (Selector): a Selector object with the HTML of the event container

    Returns:
        dict: a dictionary with the title, speaker, and date of the event
    """

    event_title   = event_container.css('h6::text').extract_first()
    event_speaker = event_container.xpath('p/text()[1]').extract_first()
    event_date    = event_container.xpath('p/text()[2]').extract_first()
    return {'title': event_title, 'speaker': event_speaker, 'date': event_date}

I could also encapsulate the whole process of collecting the containers into a function:

In [None]:
def scrape_events(url):
    """Scrape all events from a given URL.
    
    Args:
        url (str): the URL of the page with events

    Returns:
        pd.DataFrame: a DataFrame with the title, speaker, and date of each event
    """

    # Load the page
    response = requests.get(url)

    if not response.status_code == 200:
        print('Something went wrong, status code:', response.status_code)
        return

    sel = Selector(text = response.text)

    # Capture the divs but don't extract yet!
    containers = sel.css('div.card-body')

    # Scrape all events
    events = [scrape_event(container) for container in containers]

    # Creating a dataframe is easier
    df = pd.DataFrame(events)
    return df

In theory, you wouldn't even keep the code above inside a notebook like this. You would write it in a `.py` file and import the function when needed.

Your final code would then look like this:



In [None]:
url = 'https://socialdatascience.network/index.html#schedule'
df = scrape_events(url)

df.head()

Neat!

# What to do next

- Go through all the reference links in this notebook to recap CSS and XPath selectors
- Try incorporating the notions above into your üìù **W06 Summative (30%)** submission