{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "LSE Data Science Institute | ME204 (2023/24) | Week 02 Day 02\n", "\n", "# 🗓️ Week 02 – Day 02: Web scraping tricks\n", "\n", "\n", "MORNING NOTEBOOK\n", "\n", "**DATE:** 16 July 2024\n", "\n", "**AUTHOR:** Dr [Jon Cardoso-Silva](https://jonjoncardoso.github.io)\n", "\n", "-----\n", "\n", "**📚 LEARNING OBJECTIVES:**\n", "\n", "- Discover the basics of XPath and how it compares to CSS selectors\n", "- How to orient the web scraping process around **containers** instead of individual elements\n", "- How to write **custom functions** to extract data from a website\n", "\n", "\n", "**USEFUL LINKS:**\n", "\n", "- [W3 Schools - CSS Selectors](https://www.w3schools.com/CSS/css_selectors.asp) (note: not all of these selectors are supported by `scrapy`)\n", "- [Complete list of CSS selectors](https://www.w3.org/TR/selectors-3/#selectors)\n", "- [`scrapy` extensions to CSS Selectors](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reference Material " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

1. HTML in brief

\n", "\n", "- You learned that **HTML files are structured into [tags](https://www.w3schools.com/TAGs/) (or elements)**. Each tag carries a specific meaning, allowing browsers to display the information accurately. \n", "\n", " - For example, a `

` tells the browser, 'This is a paragraph,' \n", " \n", " - whereas a `
` tag tells the browser, 'this is a box of elements'.\n", "\n", "- HTML tags can have **attributes**.\n", "\n", " - For example, whenever we add a link (``), we need to specify the location where this link is pointing to (`href`):\n", "\n", " ```html\n", " DS105 main page\n", " ```\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

2. Styling (CSS) in brief

\n", "\n", "You also learned that one can apply **styles** to a tag using a language called CSS.\n", "\n", "- Styles can appear **inline**, as specified by the `style` attribute: \n", "\n", " ```html\n", "

Some text

\n", " ```\n", "\n", "- But styles can also be specified separately via a `.css` file. In that file, one uses **CSS Selectors** to identify which tags should be styled and how. For example, if I want _all_ my `

` tags to have the same style, I'd write:\n", "\n", " ```css\n", " p {\n", " margin-bottom:10px;\n", " background-color:red;\n", " color:white\n", " }\n", " ```\n", "\n", " When I load this CSS file into my HTML, the styling above will apply to all `

`s.\n", "\n", " For the above to work, I'd have to add the following to my HTML document:\n", "\n", " ```html\n", " \n", " \n", " \n", " \n", "\n", " \n", " ...\n", " \n", " \n", " ```\n", "\n", "

\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

3. The class attribute

\n", "\n", "A class can be applied to style multiple elements at once. \n", "\n", "```html\n", "

\n", "```\n", "\n", "The way to specify the style of a class using **CSS selectors** is with a dot (`.`). \n", "\n", "For example, the class above can be specified in my CSS file as:\n", "\n", "```css\n", "p.coloured{\n", " ......\n", "}\n", "```\n", "\n", "or simply:\n", "\n", "```css\n", ".coloured{\n", " ......\n", "}\n", "```\n", "\n", "\n", "
\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

4. The id attribute

\n", "\n", "An `id` is a unique identifier or an element. It should only appear once in a page. We specify ids with the 'hashtag' symbol (`#`).\n", "\n", "Therefore, if I have a \n", "\n", "```html\n", "

\n", "```\n", "\n", "I could specify the **CSS selector** as:\n", "\n", "```css\n", "p#uniquely-huge {\n", " ......\n", "}\n", "``` \n", "\n", "or simply:\n", "\n", "```css\n", "#uniquely-huge {\n", " ......\n", "}\n", "```\n", "\n", "(we don't even need to specify the tag)\n", "\n", "

\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

5. A full example with classes and id

\n", "\n", "Take, for example, the following HTML document:\n", "\n", "```html\n", "\n", " \n", " \n", " \n", "\n", " \n", "

Some text

\n", "\n", "

Some text with coloured background

\n", "\n", "

Some text with coloured background

\n", "\n", "

\n", " \n", "\n", "```\n", "\n", "Suppose we also have a `my_styles.css` file as below:\n", "\n", "```css\n", "p {\n", " margin-bottom:10px;\n", "}\n", "\n", "p.coloured {\n", " background-color:red;\n", " color:white\n", "}\n", "\n", "#uniquely-huge {\n", " font-size: 2em;\n", "}\n", "```\n", "\n", "Paste the above into a text editor and save it as `my_styles.css`. Then, open the HTML file in a browser to see the styles applied.\n", "\n", "
\n", "\n", "\n", "\n", " \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

6. Using CSS selectors for web scraping

\n", "\n", "In brief, when collecting data from a public webpage, this is a skeleton what you need:\n", "\n", "```python\n", "response = requests.get('')\n", "sel = Selector(response.text)\n", "sel.css('')\n", "```\n", "\n", "The key for the rest of this notebook is learning to identify what must be written in the ``. \n", "\n", "- You learned that you can include the names of specific tags directly. For example, `sel.css('h3').getall()` will return a list of all H3 in the entire page\n", "- You also learned that you can find the closest **container** (say, `div.card-box`) and then scrape the contents of this box later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

7. ⭐ CSS selectors cheatsheet ⭐

\n", "\n", "
\n", "\n", "| Selector | Example | Use Case Scenario |\n", "|-----------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|\n", "| * | * | This selector picks all elements within a page. It’s not that different from a page. Not much use for it but still good to know |\n", "| .class | .card-title | The simplest CSS selector is targeting the class attribute. If only your target element is using it, then it might be sufficient. |\n", "| .class1.class2 | .card-heading.card-title | There are elements with a class like class=“card-heading card-title”. When we see a space, it is because the element is using several classes. However, there’s no one fixed way of selecting the element. Try keeping the space, if that doesn’t work, then replace the space with a dot. |\n", "| #id | #card-description | What if the class is used in too many elements or if the element doesn’t have a class? Picking the ID can be the next best thing. The only problem is that IDs are unique per element. So won’t cut to scrape several elements at once. |\n", "| element | h4 | To pick an element, all we need to add to our parser is the HTML tag name. |\n", "| element.class | h4.card-title | This is the most common we’ll be using in our projects. |\n", "| parentElement > childElement | div > h4 | We can tell our scraper to extract an element inside another. In this example, we want it to find the h4 element whose parent element is a div. |\n", "| parentElement.class > childElement | div.card-body > h4 | We can combine the previous logic to specify a parent element and extract a specific CSS child element. This is super useful when the data we want doesn’t have any class or ID but is inside a parent element with a unique class/ID. |\n", "| [attribute] | [href] | Another great way to target an element with no clear class to choose from. Your scraper will extract all elements containing the specific attribute. In this case, it will take all tags which are the most common element to contain an href attribute. |\n", "| [attribute=value] | [target=_blank] | We can tell our scraper to extract only the elements with a specific value inside its attribute. |\n", "| element[attribute=value] | a[rel=next] | This is the selector we used to add a crawling feature to our Scrapy script: next_page = response.css(‘a[rel=next]’).attrib[‘href’] The target website was using the same class for all its pagination links so we had to come up with a different solution. |\n", "| [attribute~=value] | [title~=rating] | This selector will pick all the elements containing the word ‘rating’ inside its title attribute. |\n", "\n", "
\n", "\n", "Source: [The Only CSS Selectors Cheat Sheet You Need for Web Scraping](https://www.scraperapi.com/blog/css-selectors-cheat-sheet/#CSS-Selectors-Cheat-Sheet)\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ⚙️ Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from scrapy import Selector\n", "import pandas as pd\n", "from tqdm.notebook import tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A little demonstration of `tqdm` again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "for i in tqdm(range(10)):\n", " print(i)\n", " time.sleep(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 📋 Warm-Up Activity: Practice with CSS selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "🎯 **ACTION POINTS**\n", "\n", "1. Go to the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) website and inspect the page (mouse right-click + Inspect) and find the way to the name of the first event on the page. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Write down the **full** \"directions\" inside the HTML file to reach the event title. For example, maybe you will find that:\n", "\n", " > _The first event title is inside a \\ ➡️ \\ ➡️ \\ ➡️ \\ tag_.\n", "\n", " Write it in the markdown cell below:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Delete this line and write your answer here_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Write the required Python code to scrape the CSS selector you identified above. \n", "\n", " - Don't use the notion of containers just yet - we will practice that later in the W05 lecture. \n", " - For now, just write the full CSS selector you identified above\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete this line and replace it with your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. **Let's simplify.** Let's capture the title of the **first event** again, but instead of writing the entire full absolute path, like above, identify a more direct way to capture it. \n", "\n", " - Note: Either use scrapy's `.get()` or use `getall()` and later filter the list using regular Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete this line and replace it with your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. **Collect all the titles**. OK, now let's practice getting all event titles from the entire page. Save the titles into a list.\n", "\n", " **NOTE:** Again, collect all the information from the webpage at once. Don't use the notion of containers just yet. We will practice it in the W05 lecture." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete this line and replace it with your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Do the same with the dates of the events and speaker names and save them to separate lists. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete this line and replace it with your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. 🥇 **Challenge:** Combine all these lists you captured above into a single pandas data frame and save it to a CSV file. \n", "\n", " Tip 1: Say you have lists called `dates`, `titles`, `speakers`, you can create a data frame (a table) like this:\n", " \n", " ```python\n", " df = pd.DataFrame({'date': dates,\n", " 'title': titles,\n", " 'speakers': speakers})\n", " ``` \n", " \n", " Tip 2: What if an event does not have a date or speaker name? Set that particular event's date or speaker to `None`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete this line and replace it with your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Double-check that the CSV file was created correctly by opening it using pandas. Then convert the columns to appropriate data types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete this line and replace it with your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. The (X)Path to success\n", "\n", "We have seen how we can use CSS selectors to find elements in the HTML code of a webpage. However, there is another way to do this: **XPath**. \n", "\n", "
\n", "\n", "📕 **What is XPath?** \n", "\n", "XPath is a query language for selecting nodes from an XML language\n", "\n", " XPath is a bit more powerful than CSS selectors, but it is also a bit more complicated. It allows us more flexibility when it comes to finding the children or partents of certain elements. What's more, it you find yourself being able to get to a parent element with CSS but cannot get to its children because they have some obscure class name, XPath can be 'chained' at the end of a CSS element. \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1 Loading the page and getting the HTML\n", "\n", "First, we send a GET request to the page to obtain a response object, with which we can get the HTML code of the page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'https://socialdatascience.network/index.html#schedule'\n", "\n", "# Load the first page\n", "response = requests.get(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "💡 **PRO-TIP:** \n", "\n", "In reality, `requests.get(url)` is a shortening of `requests.request('GET', url)`.\n", "\n", "Whenever we send a request to a server, we need to specify the type of request we are sending. The most common types are `GET`, which intuitively means that we are asking the server to send us some data, and `POST`, which means that we are sending some data to the server.\n", "\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right after collecting the page, it's a good practice to check if the request was successful. We can do this by checking the status code of the response:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the status code, 200 means OK, anything else means something went wrong\n", "if not response.status_code == 200:\n", " print('Something went wrong, status code:', response.status_code)\n", "else:\n", " print('Everything is OK, status code:', response.status_code)\n", "\n", "# print the first few lines of the HTML code\n", "print(response.text[:100])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This can also be done with the response.ok attribute which returns True if the status code is 200\n", "if not response.ok:\n", " print('Something went wrong, status code:', response.status_code)\n", "else:\n", " print('Everything is OK, status code:', response.status_code)\n", "\n", "# print the first few lines of the HTML code\n", "print(response.text[:100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is good practice to build functions for repetitive tasks. Let's create a function that will load the page and check if the request was successful and test it within the same cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_page(url):\n", " response = requests.get(url)\n", " if not response.ok:\n", " print('Something went wrong, status code:', response.status_code)\n", " return None\n", " return response.text\n", "\n", "html = get_page(url)\n", "print(html[:100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Did you get a 403 error?**\n", "\n", "If, instead, you got an 403 error at this stage, it's probably because the server is refusing to respond to our request because it thinks we are a 'robot', not a user accessing from a web browser. They are not entirely wrong...\n", "\n", "When we send a request, the server searches for a header (a metadata) in our request called `User-Agent`. Browsers have unique `User-Agent` headers that uniquely identifies the browser and the operating system that is being used. For example:\n", "\n", "- Chrome on Windows 10: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36`\n", "\n", "- Firefox on Windows 10: `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0`\n", "\n", "- Chrome on Mac OS: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11_6_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36`\n", "\n", "A server might be configured to block the request when it doesn't come from a standard browser. We can fix this by amending our request, specifying a `User-Agent` header, and then passing it to the `requests.get()` function. So far though, it looks like we are getting the HTML code of the page, so we can move on to the next step.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 Creating a Selector object\n", "\n", "Now that we have the HTML code, we can create a scrapy `Selector` object from it. This will allow us to use CSS and XPath selectors to find elements in the HTML code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a Selector object from the HTML code\n", "sel = Selector(text = response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You know we can now use the `sel.css()` method to find elements in the HTML code using CSS selectors. \n", "\n", "Let's practice with the `*` selector, which selects all elements in the HTML code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Print number of HTML elements returned by the CSS selector\n", "print('Number of elements on the page with CSS:', len(sel.css('*')))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(sel.css('div > div#logo')[0].get())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same with XPath selectors, using the `sel.xpath()` method. The syntax is a bit different, but the result is the same:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print out the number of elements on the page\n", "print('Number of elements on the page with XPath:', len(sel.xpath('//*')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. CSS vs XPath selectors\n", "\n", "Let's compare how we can get this element with CSS and XPath selectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2. CSS selectors\n", "\n", "We learned that the titles of events are stored inside `h6` tags nested inside the link (`a`) tags, nested inside a `div` tag with the class `card-body`.\n", "\n", "Using CSS selectors, we specify the tag name, the class with the `.` operator and the child element with the `>` operator:\n", "\n", "```css\n", "div.card-body > a > h6\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract the HTML code of the element with CSS selectors\n", "print(sel.css('div.card-body > a > h6::text').get()) # this gets the first element" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract ALL the elements with this class, use the `extract()` method after the `sel.css()` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The .get() method returns a list of strings\n", "titles = sel.css('div.card-body > a > h6').get()\n", "print('There are', len(titles), 'event titles on the page.') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**OK, but this returns the tag itself. I only care about the text inside the tag.**\n", "\n", "For this, we use [`scrapy`'s extension](https://docs.scrapy.org/en/latest/topics/selectors.html#extensions-to-css-selectors) `::text`:\n", "\n", "```css\n", "div.card-body > a > h6::text\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 XPath\n", "\n", "The syntax resembles a bit the way we would write a path to a file in the **Terminal**. You can use the familiar `..` and `/` symbols when navigating through the HTML code.\n", "\n", "### 2.3.1 Absolute paths\n", "\n", "\n", "For example, a single forward slash `/` at the beginning of the path indicates that you're looking for elements within the current element. When we get a `sel` object, we are looking at the entire HTML code of the page, so the following syntax:\n", "\n", "```xpath\n", "/html/head/title\n", "```\n", "\n", "will return the tag that contains the title of the page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "sel.xpath('//title').get()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sel.xpath('//div/p/..').getall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "::: {style=\"margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;\"}\n", "\n", "**How do I test that on the browser, without having to write Python code?**\n", "\n", "On the Inspector, you can click on the 'Console' tab and type the following:\n", "\n", "```javascript\n", "$x('/html/head/title')\n", "```\n", "\n", "This will return an Array (similar to a Python list) that contains the element we are looking for. Take a look at the video below for a demonstration:\n", "\n", "
\n", "\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3.2 Relative paths\n", "\n", "Just like with CSS selector, you don't need to specify the entire path to the element you are looking for. You can use the `//` symbol to indicate that you are looking for an element anywhere in the HTML code. For example, the following XPath selector:\n", "\n", "```xpath\n", "//title\n", "```\n", "would also return the `` tag.\n", "\n", "Give it a go! Open the browser console and type:\n", "\n", "```javascript\n", "$x('//title')\n", "```\n", "\n", "To confirm that you got the same result as before.\n", "\n", "Perhaps more usefully, use `.outerHTML` to see the entire HTML code of the element:\n", "\n", "```javascript\n", "$x('//title')[0].outerHTML // ONLY WORKS IN THE BROWSER\n", "```\n", "\n", "Like with CSS selectors, the search will not stop at the first element that matches the selector. It will return all the elements that match the selector." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3.3 Selecting elements by class\n", "\n", "Instead of the `.` symbol used in CSS selectors, we use a mix of `contains()` and `@class` to select elements by class:\n", "\n", "This is the syntax to select `<div>` elements with a `card__content` class:\n", "\n", "```css\n", "//div[contains(@class, 'card-body')]\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract the HTML code of all elements with XPath selectors\n", "xpath_event_divs = '//div[contains(@class, \"card-body\")]'\n", "xpath_all_events = sel.xpath(xpath_event_divs).getall()\n", "print('There are', len(xpath_all_events), 'events on the page.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment to see the first event with all details\n", "print(xpath_all_events[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3.4 Of parents and children: the power of XPath\n", "\n", "As mentioned above, XPath behaves a bit like a file system. We can use the `/..` symbol to get the parent of an element:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here's another way to grab that `div.card-body` element\n", "print(sel.xpath('//h6/../..').get())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "💡 **PRO-TIP:** You can index the results with XPath. Say you only care about the first match:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Try it: replace [1] with [2] or any other number to get the corresponding element\n", "indexed_div = sel.xpath('(//div[contains(@class, \"card-body\")])[1]').getall()\n", "\n", "# I still used getall() instead of get() \n", "# to show you that it returns a list with a single element\n", "indexed_div" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we get **just the children** of an element? We use the `/*` symbol:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "children_of_div = sel.xpath('(//div[contains(@class, \"card-body\")])[1]/*').getall()\n", "\n", "children_of_div" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You could not have done this with CSS selectors!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do I get the text inside the tag? We use the `text()` method at the end (similarly to `::text` in CSS selectors):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sel.xpath('(//div[contains(@class, \"card-body\")])[1]//h6/text()').getall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do I get attributes? We use the `@` symbol:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sel.xpath('(//a)[10]/@href').getall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What else can I do?**\n", "\n", "- [W3 Schools XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp)\n", "- Check out this [old school XPath Tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4 More power to you: chaining CSS and XPath selectors\n", "\n", "Note the different syntax for CSS and XPath selectors. CSS selectors are more straightforward, but XPath selectors are more powerful.\n", "\n", "💡 **PRO-TIP:** What is really nice about `scrapy` is that you can chain CSS and XPath selectors together! This means you can use a CSS selector to get an element and an XPath selector to get its children.\n", "\n", "With XPath, we managed to split content of the card-text paragraph into two separate elements: the speaker and the date. \n", "\n", "```python\n", "speakers_xpath = \"//p[@class='card-text']/text()[1]\"\n", "speakers = selector.xpath(speakers_xpath).get()\n", "\n", "dates_xpath = \"//p[@class='card-text']/text()[2]\"\n", "dates = selector.xpath(dates_xpath).get()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Could I do this with a mix CSS and XPath selectors?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "speakers = sel.css('div.card-body > p').xpath('text()[1]').getall()\n", "dates = sel.css('div.card-body > p').xpath('text()[2]').getall()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "speakers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the end, it is up to you to decide which one you prefer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Orienting the scraping process around containers\n", "\n", "\n", "We have already talked about `pandas` but haven't done much with them at this stage in the course. This is because we're focused on **collecting** data. Once we have enough data and are ready to work with it, you will see how tables (data frames) provide an excellent and very convenient way to store data for analysis.\n", "\n", "**In the meantime, trust me! You want to convert whatever data you capture into a pandas data frame.**\n", "\n", "With what we have learned so far, it is easy to convert the event details we care about into a pandas data frame. We can summarise the whole process as:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# Capture all the info individually\n", "titles = sel.css('div.card-body > a > h6::text').getall()\n", "speakers = sel.css('div.card-body > p').xpath('text()[1]').getall()\n", "dates = sel.css('div.card-body > p').xpath('text()[2]').getall()\n", "\n", "# Put it all together into a DataFrame\n", "events = pd.DataFrame({'title': titles, 'speaker': speakers, 'date': dates})\n", "\n", "# Uncomment to browse the DataFrame\n", "events" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the above works, it can lead to much repetition and 'workaround' code if the website you are scraping is not the most consistent. For example, some event divs may have a different class name, or some events list the speakers and dates in the opposite order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 Treat each event box as a template\n", "\n", "Instead of going directly to the titles, speakers, and dates, we can capture the `<div>` that represents an event and then extract the details from its children." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Capture the divs but don't extract yet!\n", "containers = sel.css('div.card-body')\n", "\n", "len(containers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_box = containers[0]\n", "# equivalent to the .get() method\n", "# first_box = containers.get()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(first_box.get())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note one very important thing about the `containers[0]` object: it is a `Selector` object, not a string. Therefore, it contains other methods and attributes inside it.\n", "\n", "**Why is that useful?** \n", "\n", "We can treat this object as our HTML code, ignore the rest of the HTML, and use the same methods to get the information we need from it. We don't need to scrape the entire page again to get the necessary information.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 Extract all details of a single event\n", "\n", "To illustrate the observation above, let's get the title, speaker and date of _this particular event_:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "event_title = first_box.css('h6::text').get()\n", "event_speaker = first_box.xpath('p/text()[1]').get()\n", "event_date = first_box.xpath('p/text()[2]').get() " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "event_title" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "event_speaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "event_date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 Putting it all together (NOT ELEGANT)\n", "\n", "\n", "I know what you are thinking... _'but then I will have to do this for every event!'_\n", "\n", "True. If you were to proceed with the above and use what you learned from your previous Python training, you would have to write a loop that goes through each event and extracts the details.\n", "\n", "It's likely that you would be tempted to write a code that looks like this:\n", "\n", "```python\n", "#### WE DON'T WANT YOU TO WRITE THIS TYPE OF CODE IN THIS COURSE! I'LL EXPLAIN IN SECTION 3.3 ####\n", "\n", "\n", "# Create an empty list to store the details of each event\n", "event_titles = []\n", "event_speakers = []\n", "event_dates = []\n", "\n", "# Loop through each event\n", "for container in containers:\n", " # Extract the details of the event\n", " title = container.css('h6::text').get()\n", " speaker = container.xpath('p/text()[1]').get()\n", " date = container.xpath('p/text()[2]').get()\n", "\n", " # Append the details to the lists\n", " event_titles.append(title)\n", " event_speakers.append(speaker)\n", " event_dates.append(date)\n", "\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['container'].apply(get_speaker) #Goal with Pandas; which is why we don't want to use the for loop method above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.3 Using custom functions (ELEGANT 🎩)\n", "\n", "What we want is for you to use best practices when writing code. Code that is efficient, easy to read, and easy to alter in the future. With practice, you will realise that `for` loops are not always the best way to go about things. If you find a bug in the code above, you would have to go through the entire loop to locate the source of the bug.\n", "\n", "**Functions** (with the `def` operator) are a great way to encapsulate a piece of code that does a single task. You can run the same function with different parameters to test out different scenarios. If you find a bug in the function, you only have to fix it once.\n", "\n", "How would a function look like for the above?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def scrape_event(event_container):\n", " event_title = event_container.css('h6::text').get()\n", " event_speaker = event_container.xpath('p/text()[1]').get()\n", " event_date = event_container.xpath('p/text()[2]').get()\n", " return {'title': event_title, 'speaker': event_speaker, 'date': event_date}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This way, you can test the function for individual containers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Change [0] to [1] or any other number to get the corresponding event\n", "scrape_event(containers[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notice that we returned a dictionary instead of a list.** Key-value pairs are the most natural way to store a single record of data. \n", "\n", "If we add up all the dictionaries, we get a list of dictionaries, making it easier to convert to a pandas data frame." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Scrape all events (note that we're using list comprehension)\n", "events = [scrape_event(container) for container in containers]\n", "\n", "# Creating a dataframe is easier\n", "df = pd.DataFrame(events)\n", "\n", "# Uncomment to browse the DataFrame\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# NOT IDEAL: it is better to use the list comprehension method above\n", "\n", "events = []\n", "\n", "for container in containers:\n", " events.append(scrape_event(container))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.3 Writing Great Documentation\n", "\n", "**Your future self (and the reviewers of your code) will thank you for it!**\n", "\n", "Writing maintainable code is not just about writing code that works. It's about writing code that is easy to understand and easy to alter in the future.\n", "\n", "\n", "<div style=\"width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;\">\n", "\n", "💡 **TIP:** Always think to yourself: what comments or documentation can I add to my code so that if I return to it in a few months, I would still understand what I was trying to do?\n", "\n", "</div>\n", "\n", "One excellent way to document your code is to write a [**docstring**](https://realpython.com/documenting-python-code/). A docstring is a string that comes right after the `def` operator and describes what the function does. Writing a docstring for every function you write is an excellent practice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def scrape_event(event_container):\n", " \"\"\"Scrape details from a single event container.\n", " \n", " An event container looks like this:\n", " <div class=\"card-body\">\n", " <a href=\"...\">\n", " <h6>Event title</h6>\n", " </a>\n", " <p>Speaker name</p>\n", " <p>Date</p>\n", " </div>\n", "\n", " This function captures the title, speaker, and date from the container.\n", " \n", " Args:\n", " event_container (Selector): a Selector object with the HTML of the event container\n", "\n", " Returns:\n", " dict: a dictionary with the title, speaker, and date of the event\n", " \"\"\"\n", "\n", " event_title = event_container.css('h6::text').get()\n", " event_speaker = event_container.xpath('p/text()[1]').get()\n", " event_date = event_container.xpath('p/text()[2]').get()\n", " return {'title': event_title, 'speaker': event_speaker, 'date': event_date}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I could also encapsulate the whole process of collecting the containers into a function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def scrape_events(url):\n", " \"\"\"Scrape all events from a given URL.\n", " \n", " Args:\n", " url (str): the URL of the page with events\n", "\n", " Returns:\n", " pd.DataFrame: a DataFrame with the title, speaker, and date of each event\n", " \"\"\"\n", "\n", " # Load the page\n", " response = requests.get(url)\n", "\n", " if not response.status_code == 200:\n", " print('Something went wrong, status code:', response.status_code)\n", " return\n", "\n", " sel = Selector(text = response.text)\n", "\n", " # Capture the divs but don't extract yet!\n", " containers = sel.css('div.card-body')\n", "\n", " # Scrape all events\n", " events = [scrape_event(container) for container in containers]\n", "\n", " # Creating a dataframe is easier\n", " df = pd.DataFrame(events)\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In theory, you wouldn't even keep the code above inside a notebook like this. You would write it in a `.py` file and import the function when needed.\n", "\n", "Your final code would then look like this:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'https://socialdatascience.network/index.html#schedule'\n", "df = scrape_events(url)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neat!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What to do next\n", "\n", "- Go through all the reference links in this notebook to recap CSS and XPath selectors" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }