` elements with a `card__content` class:\n",
"\n",
"```css\n",
"//div[contains(@class, 'card-body')]\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract the HTML code of all elements with XPath selectors\n",
"xpath_event_divs = '//div[contains(@class, \"card-body\")]'\n",
"xpath_all_events = sel.xpath(xpath_event_divs).getall()\n",
"print('There are', len(xpath_all_events), 'events on the page.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment to see the first event with all details\n",
"print(xpath_all_events[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.4 Of parents and children: the power of XPath\n",
"\n",
"As mentioned above, XPath behaves a bit like a file system. We can use the `/..` symbol to get the parent of an element:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Here's another way to grab that `div.card-body` element\n",
"print(sel.xpath('//h6/../..').get())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"💡 **PRO-TIP:** You can index the results with XPath. Say you only care about the first match:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Try it: replace [1] with [2] or any other number to get the corresponding element\n",
"indexed_div = sel.xpath('(//div[contains(@class, \"card-body\")])[1]').getall()\n",
"\n",
"# I still used getall() instead of get() \n",
"# to show you that it returns a list with a single element\n",
"indexed_div"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do we get **just the children** of an element? We use the `/*` symbol:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"children_of_div = sel.xpath('(//div[contains(@class, \"card-body\")])[1]/*').getall()\n",
"\n",
"children_of_div"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**You could not have done this with CSS selectors!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do I get the text inside the tag? We use the `text()` method at the end (similarly to `::text` in CSS selectors):\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sel.xpath('(//div[contains(@class, \"card-body\")])[1]//h6/text()').getall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do I get attributes? We use the `@` symbol:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sel.xpath('(//a)[10]/@href').getall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What else can I do?**\n",
"\n",
"- [W3 Schools XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp)\n",
"- Check out this [old school XPath Tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.4 More power to you: chaining CSS and XPath selectors\n",
"\n",
"Note the different syntax for CSS and XPath selectors. CSS selectors are more straightforward, but XPath selectors are more powerful.\n",
"\n",
"💡 **PRO-TIP:** What is really nice about `scrapy` is that you can chain CSS and XPath selectors together! This means you can use a CSS selector to get an element and an XPath selector to get its children.\n",
"\n",
"With XPath, we managed to split content of the card-text paragraph into two separate elements: the speaker and the date. \n",
"\n",
"```python\n",
"speakers_xpath = \"//p[@class='card-text']/text()[1]\"\n",
"speakers = selector.xpath(speakers_xpath).get()\n",
"\n",
"dates_xpath = \"//p[@class='card-text']/text()[2]\"\n",
"dates = selector.xpath(dates_xpath).get()\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Could I do this with a mix CSS and XPath selectors?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"speakers = sel.css('div.card-body > p').xpath('text()[1]').getall()\n",
"dates = sel.css('div.card-body > p').xpath('text()[2]').getall()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"speakers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the end, it is up to you to decide which one you prefer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Orienting the scraping process around containers\n",
"\n",
"\n",
"We have already talked about `pandas` but haven't done much with them at this stage in the course. This is because we're focused on **collecting** data. Once we have enough data and are ready to work with it, you will see how tables (data frames) provide an excellent and very convenient way to store data for analysis.\n",
"\n",
"**In the meantime, trust me! You want to convert whatever data you capture into a pandas data frame.**\n",
"\n",
"With what we have learned so far, it is easy to convert the event details we care about into a pandas data frame. We can summarise the whole process as:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Capture all the info individually\n",
"titles = sel.css('div.card-body > a > h6::text').getall()\n",
"speakers = sel.css('div.card-body > p').xpath('text()[1]').getall()\n",
"dates = sel.css('div.card-body > p').xpath('text()[2]').getall()\n",
"\n",
"# Put it all together into a DataFrame\n",
"events = pd.DataFrame({'title': titles, 'speaker': speakers, 'date': dates})\n",
"\n",
"# Uncomment to browse the DataFrame\n",
"events"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the above works, it can lead to much repetition and 'workaround' code if the website you are scraping is not the most consistent. For example, some event divs may have a different class name, or some events list the speakers and dates in the opposite order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 Treat each event box as a template\n",
"\n",
"Instead of going directly to the titles, speakers, and dates, we can capture the `
` that represents an event and then extract the details from its children."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Capture the divs but don't extract yet!\n",
"containers = sel.css('div.card-body')\n",
"\n",
"len(containers)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"first_box = containers[0]\n",
"# equivalent to the .get() method\n",
"# first_box = containers.get()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(first_box.get())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note one very important thing about the `containers[0]` object: it is a `Selector` object, not a string. Therefore, it contains other methods and attributes inside it.\n",
"\n",
"**Why is that useful?** \n",
"\n",
"We can treat this object as our HTML code, ignore the rest of the HTML, and use the same methods to get the information we need from it. We don't need to scrape the entire page again to get the necessary information.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 Extract all details of a single event\n",
"\n",
"To illustrate the observation above, let's get the title, speaker and date of _this particular event_:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"event_title = first_box.css('h6::text').get()\n",
"event_speaker = first_box.xpath('p/text()[1]').get()\n",
"event_date = first_box.xpath('p/text()[2]').get() "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"event_title"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"event_speaker"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"event_date"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 Putting it all together (NOT ELEGANT)\n",
"\n",
"\n",
"I know what you are thinking... _'but then I will have to do this for every event!'_\n",
"\n",
"True. If you were to proceed with the above and use what you learned from your previous Python training, you would have to write a loop that goes through each event and extracts the details.\n",
"\n",
"It's likely that you would be tempted to write a code that looks like this:\n",
"\n",
"```python\n",
"#### WE DON'T WANT YOU TO WRITE THIS TYPE OF CODE IN THIS COURSE! I'LL EXPLAIN IN SECTION 3.3 ####\n",
"\n",
"\n",
"# Create an empty list to store the details of each event\n",
"event_titles = []\n",
"event_speakers = []\n",
"event_dates = []\n",
"\n",
"# Loop through each event\n",
"for container in containers:\n",
" # Extract the details of the event\n",
" title = container.css('h6::text').get()\n",
" speaker = container.xpath('p/text()[1]').get()\n",
" date = container.xpath('p/text()[2]').get()\n",
"\n",
" # Append the details to the lists\n",
" event_titles.append(title)\n",
" event_speakers.append(speaker)\n",
" event_dates.append(date)\n",
"\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['container'].apply(get_speaker) #Goal with Pandas; which is why we don't want to use the for loop method above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 Using custom functions (ELEGANT 🎩)\n",
"\n",
"What we want is for you to use best practices when writing code. Code that is efficient, easy to read, and easy to alter in the future. With practice, you will realise that `for` loops are not always the best way to go about things. If you find a bug in the code above, you would have to go through the entire loop to locate the source of the bug.\n",
"\n",
"**Functions** (with the `def` operator) are a great way to encapsulate a piece of code that does a single task. You can run the same function with different parameters to test out different scenarios. If you find a bug in the function, you only have to fix it once.\n",
"\n",
"How would a function look like for the above?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def scrape_event(event_container):\n",
" event_title = event_container.css('h6::text').get()\n",
" event_speaker = event_container.xpath('p/text()[1]').get()\n",
" event_date = event_container.xpath('p/text()[2]').get()\n",
" return {'title': event_title, 'speaker': event_speaker, 'date': event_date}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This way, you can test the function for individual containers:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Change [0] to [1] or any other number to get the corresponding event\n",
"scrape_event(containers[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notice that we returned a dictionary instead of a list.** Key-value pairs are the most natural way to store a single record of data. \n",
"\n",
"If we add up all the dictionaries, we get a list of dictionaries, making it easier to convert to a pandas data frame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Scrape all events (note that we're using list comprehension)\n",
"events = [scrape_event(container) for container in containers]\n",
"\n",
"# Creating a dataframe is easier\n",
"df = pd.DataFrame(events)\n",
"\n",
"# Uncomment to browse the DataFrame\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# NOT IDEAL: it is better to use the list comprehension method above\n",
"\n",
"events = []\n",
"\n",
"for container in containers:\n",
" events.append(scrape_event(container))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"events"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 Writing Great Documentation\n",
"\n",
"**Your future self (and the reviewers of your code) will thank you for it!**\n",
"\n",
"Writing maintainable code is not just about writing code that works. It's about writing code that is easy to understand and easy to alter in the future.\n",
"\n",
"\n",
"
\n",
"\n",
"💡 **TIP:** Always think to yourself: what comments or documentation can I add to my code so that if I return to it in a few months, I would still understand what I was trying to do?\n",
"\n",
"
\n",
"\n",
"One excellent way to document your code is to write a [**docstring**](https://realpython.com/documenting-python-code/). A docstring is a string that comes right after the `def` operator and describes what the function does. Writing a docstring for every function you write is an excellent practice."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def scrape_event(event_container):\n",
" \"\"\"Scrape details from a single event container.\n",
" \n",
" An event container looks like this:\n",
"
\n",
"
\n",
" Event title
\n",
" \n",
"
Speaker name
\n",
"
Date
\n",
"
\n",
"\n",
" This function captures the title, speaker, and date from the container.\n",
" \n",
" Args:\n",
" event_container (Selector): a Selector object with the HTML of the event container\n",
"\n",
" Returns:\n",
" dict: a dictionary with the title, speaker, and date of the event\n",
" \"\"\"\n",
"\n",
" event_title = event_container.css('h6::text').get()\n",
" event_speaker = event_container.xpath('p/text()[1]').get()\n",
" event_date = event_container.xpath('p/text()[2]').get()\n",
" return {'title': event_title, 'speaker': event_speaker, 'date': event_date}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I could also encapsulate the whole process of collecting the containers into a function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def scrape_events(url):\n",
" \"\"\"Scrape all events from a given URL.\n",
" \n",
" Args:\n",
" url (str): the URL of the page with events\n",
"\n",
" Returns:\n",
" pd.DataFrame: a DataFrame with the title, speaker, and date of each event\n",
" \"\"\"\n",
"\n",
" # Load the page\n",
" response = requests.get(url)\n",
"\n",
" if not response.status_code == 200:\n",
" print('Something went wrong, status code:', response.status_code)\n",
" return\n",
"\n",
" sel = Selector(text = response.text)\n",
"\n",
" # Capture the divs but don't extract yet!\n",
" containers = sel.css('div.card-body')\n",
"\n",
" # Scrape all events\n",
" events = [scrape_event(container) for container in containers]\n",
"\n",
" # Creating a dataframe is easier\n",
" df = pd.DataFrame(events)\n",
" return df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In theory, you wouldn't even keep the code above inside a notebook like this. You would write it in a `.py` file and import the function when needed.\n",
"\n",
"Your final code would then look like this:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = 'https://socialdatascience.network/index.html#schedule'\n",
"df = scrape_events(url)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neat!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What to do next\n",
"\n",
"- Go through all the reference links in this notebook to recap CSS and XPath selectors"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}