<font style='font-size:1.5em'>**üíª Week 04 lab ‚Äì Web Scraping** </font>

<font style='font-size:1.2em'>DS105A ‚Äì Data for Data Science</font>

**AUTHORS:**  [Anton Boichenko](https://github.com/antonboychenko) & [Alex Soldatkin](https://github.com/alex-soldatkin) & Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Learn how to collect data from the Web using Python packages

**LAST REVISION:** 18 October 2023

::: callout-important

## This lab is part of the ![](/figures/logos/GENIAL_favicon.png){width=1em}  [<span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project](https://lse-dsi.github.io/genial). 

If you never accessed ChatGPT, you must create an account. Click on [chat.openai.com](https://chat.openai.com/) and sign up with your email address (it doesn't have to be your LSE email address).

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

:::


--- 

# Part I: ‚öôÔ∏è The setup (15 min)

You will need to install the requests and Scrapy packages in order to complete this lab. I will assume you have configured the virtual environment for this course as follows. 

üéØ **ACTION POINTS:**

1. Open the terminal (directly from within VS Code will be easier) and run each of the following commands:


    ```bash
    pip install pandas
    pip install requests
    pip install scrapy
    ```

2. Now, create a new code chunk below and import the packages you just installed. You should have something like this:

    ```python
    import requests               # This is how we access the web
    import pandas as pd           # This is how we work with data frames

    from pprint import pprint     # Print things in a pretty way
    from scrapy import Selector   # This is how we parse HTML
    ```

In [1]:
import requests               # This is how we access the web
import pandas as pd           # This is how we work with data frames

from pprint import pprint     # Print things in a pretty way
from scrapy import Selector   # This is how we parse HTML

# Part II: Requesting a web page (30 min)

üë®üèª‚Äçüè´ **TEACHING MOMENT**

The entire Part II is a teaching moment. Your class teacher will help you collect this information from the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) web page.

Pay close attention and follow along on your computer.

---

You might have heard of [CIVICA](https://www.civica.eu/who-we-are/about-civica/) before. It is a body that unites several European universities to collaborate in the areas of social sciences, humanities, business and public policy. CIVICA hosts [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) that might be of interest to you. Today we will collect information on some of the seminars. Maybe you can use it in the future! 

**Our main task** is to create a üêº pandas data frame that would contain:

1. names of the seminars 
2. names of speakers of those seminars
3. dates of the seminars
4. bios of the speakers from each individual event 


## 2.1. Request a website


In [2]:
# This is the address of the website we want to scrape
my_url = 'https://socialdatascience.network/index.html#schedule'

# We set a GET request to the website
response = requests.get(my_url)

# What is the response code?
response

<Response [200]>

**üìú Other possible responses**

The response code is standard way of communicating the status of a request. There are many other possible responses:

- **200** OK
- **204** No Content
- **400** Bad Request
- **401** Unauthorized
- **402** Payment Required
- **403** Forbidden
- **404** Not Found
- **500** Internal Server Error
- **502** Bad Gateway

üó£Ô∏è **CLASSROOM DISCUSSION:** Have you ever encountered any of these responses when browsing the Web on your browser? Where? What did you do about it?


You can find a full list [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

## 2.2. A closer look at the response

What else is stored in the `response` object?

In [3]:
# The vars function returns all attributes of an object, along with their values
# You will see that it is essentially just a dictionary
vars(response)

 '_content_consumed': True,
 '_next': None,
 'status_code': 200,
 'headers': {'Connection': 'keep-alive', 'Content-Length': '18391', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Mon, 09 Oct 2023 14:47:42 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"6524128e-1b910"', 'expires': 'Thu, 19 Oct 2023 11:52:54 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'D3EA:BD7C:805B9C:81CCC7:6531163E', 'Accept-Ranges': 'bytes', 'Date': 'Thu, 19 Oct 2023 13:09:11 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-lcy-eglc8600058-LCY', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1697720951.288726,VS0,VE114', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': 'abfc4a76ccad74e73b272586690688e3ab38717a'},
 'raw': <urllib3.response.HTTPResponse at 0x7fbdb9484640>,
 'url': 'https://socialdatascience.network/index.html#schedule',
 'encoding': 'utf-8',
 'history': [],
 'reason': 

üó£Ô∏è **CLASSROOM DISCUSSION:**

You have already looked at `response.status_code`. But what do you think the following attributes of the `response` object are?

- `response.headers`
- `response.cookies`
- `response.content`

Feel free to open a new chunk of code below and explore these attributes.

But encoding is not the only **metadata** we can get from the response. Let's take a look at all the headers:

In [4]:
# Headers are metadata about the response
pprint(response.headers)

{'Connection': 'keep-alive', 'Content-Length': '18391', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Mon, 09 Oct 2023 14:47:42 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"6524128e-1b910"', 'expires': 'Thu, 19 Oct 2023 11:52:54 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'D3EA:BD7C:805B9C:81CCC7:6531163E', 'Accept-Ranges': 'bytes', 'Date': 'Thu, 19 Oct 2023 13:09:11 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-lcy-eglc8600058-LCY', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1697720951.288726,VS0,VE114', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': 'abfc4a76ccad74e73b272586690688e3ab38717a'}


We could choose to manipulate the headers above as a `pd.Series`:

In [5]:
pd.Series(response.headers)

Connection                                                   keep-alive
Content-Length                                                    18391
Server                                                       GitHub.com
Content-Type                                   text/html; charset=utf-8
Last-Modified                             Mon, 09 Oct 2023 14:47:42 GMT
Access-Control-Allow-Origin                                           *
ETag                                                 W/"6524128e-1b910"
expires                                   Thu, 19 Oct 2023 11:52:54 GMT
Cache-Control                                               max-age=600
Content-Encoding                                                   gzip
x-proxy-cache                                                      MISS
X-GitHub-Request-Id                    D3EA:BD7C:805B9C:81CCC7:6531163E
Accept-Ranges                                                     bytes
Date                                      Thu, 19 Oct 2023 13:09

Let me know you what is in the object `response` by printing it.

In [6]:
pprint(response.text)

('<!DOCTYPE html>\r\n'
 '<html lang="en">\r\n'
 '\r\n'
 '<head>\r\n'
 '  <meta charset="utf-8">\r\n'
 '  <title>CIVICA Data Science Seminar</title>\r\n'
 '  <meta content="width=device-width, initial-scale=1.0" name="viewport">\r\n'
 '  <meta content="CIVICA Data Science Seminar" name="keywords">\r\n'
 '  <meta content="A series of data science workshops and seminars" '
 'name="description">\r\n'
 '\r\n'
 '  <!-- Favicons -->\r\n'
 '  <link href="img/c-favicon.png" rel="icon">\r\n'
 '  <link href="img/apple-touch-icon.png" rel="apple-touch-icon">\r\n'
 '\r\n'
 '  <!-- Google Fonts -->\r\n'
 '  <link '
 'href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,700,700i|Raleway:300,400,500,700,800" '
 'rel="stylesheet">\r\n'
 '\r\n'
 '  <!-- Bootstrap CSS File -->\r\n'
 '  <link href="lib/bootstrap/css/bootstrap.min.css" rel="stylesheet">\r\n'
 '\r\n'
 '  <!-- Libraries CSS Files -->\r\n'
 '  <link href="lib/font-awesome/css/font-awesome.min.css" '
 'rel="stylesheet">\r\

The code chunk above makes sense here because I want to show you how to inspect objects when in **prototype mode**. However, whenever you are writing in a Jupyter Notebook to report to someone (say, when submitting your assignment), you should remove code chunks that produce a lot of unnecessary output.


üí° **A DETAIL THAT SEEMS INSIGNIFICANT BUT THAT IS EXTREMELY IMPORTANT**: 
- If you are on Mac or Linux, you will find that the break line character is `\n`. 
- If you are on Windows, you will find that the break line character is `\r\n`. 
- Windows uses two characters to break lines, while Mac and Linux use only one. 
- This is a common source of errors when working with with text files in two different OS. (For example: you use Mac and collaborate with someone who uses Windows.)

How many characters are there in the `response.text`?

In [7]:
len(response.text)

112894

Not very useful to treat is as pure string, right? We need to find a better way to parse this data.


## 2.3. Parsing HTML

The Scrapy Selector package is a Python library for extracting data from HTML and XML documents. It uses CSS or XPath selectors for data extraction making it a powerful tool for web scraping. It is often an essential part of the Scrapy framework but can also be used independently.

When you feed HTML text to the Scrapy Selector, it processes the HTML and preserves it in a particular **object** <sup>1</sup>. This object allows you to access parts of the HTML using Python's common dot notation in combination with the CSS syntax. If, for instance, you want to fetch the title of the page, you might use `selector.css('title')`.

<sup>1</sup>: Re-watch üóìÔ∏è Week 04 lecture if you need to revise what an object is

In [8]:
# parse the HTML code using Scrapy Selector
sel = Selector(text=response.text)

üí° Note: I was only able to call `Selector()` directly because I had already imported it at the top of the notebook. Scroll up to see it. If I hadn't, the code above would have thrown an error.

**Check `sel.get()` to see the full HTML document**

This has the same effect as `response.text`.

In [9]:
sel.get()



**HTML documents usually have a \<header\> tag:**

(‚ö†Ô∏è not to be confused with the HTTP header we saw with `response.headers`)

In [10]:
sel.css('header')

[<Selector xpath='descendant-or-self::header' data='<header id="header">\r\n    <div class=...'>]

There is also usually a `<body>` tag, which contains the main content of the page:

In [None]:
sel.css('body')

üîë **Takeaway of the output above:**

- The output is a list, as indicated by the square brackets. 
- HTML pages only have one `<body>` tag, so this list contains a single element, which is an object of the class Selector.

What if I want to look at the content of the `<body>` tag?

In [11]:
pprint(sel.css('body').get())

('<body>\r\n'
 '\r\n'
 '    Header\r\n'
 '  <header id="header">\r\n'
 '    <div class="container">\r\n'
 '\r\n'
 '      <div id="logo" class="pull-left">\r\n'
 '        <!-- Uncomment below if you prefer to use a text logo -->\r\n'
 '        <!-- <h1><a href="#main">C<span>o</span>nf</a></h1>-->\r\n'
 '        <a href="#intro" class="scrollto"><img src="img/logo.png" alt="" '
 'title=""></a>\r\n'
 '      </div>\r\n'
 '\r\n'
 '      <nav id="nav-menu-container">\r\n'
 '        <ul class="nav-menu">\r\n'
 '          <li class="menu-active"><a href="#intro">Home</a></li>\r\n'
 '          <li><a href="#about">About</a></li>\r\n'
 '<!--           <li><a href="#speakers">Speakers</a></li>\r\n'
 ' -->          <li><a href="#schedule">Schedule</a></li>\r\n'
 '          <li><a href="#supporters">Partner Institutions</a></li>\r\n'
 '          <li><a href="summerschool.html">Summer School</a></li>\r\n'
 '          <li><a href="#gallery">Gallery</a></li>\r\n'
 '          <li><a href="#contact">Co

**Are there any `<h1>` tags in this page?**

In [12]:
sel.css('h1').get()

'<h1 class="mb-4 pb-0">CIVICA<br><span>Data Science</span> Seminar Series</h1>'

What about `<h2>` tags?

In [13]:
sel.css('h2').getall()

['<h2>Seminar Schedule</h2>',
 '<h2>Partner Institutions</h2>',
 '<h2>Gallery</h2>',
 '<h2>F.A.Q </h2>',
 '<h2>Newsletter</h2>',
 '<h2>Contact Us</h2>']

If you care just about the **first** `<h2>` tag, you can use the `.get()` method instead of `.getall()`:

In [14]:
sel.css("h2").get()

'<h2>Seminar Schedule</h2>'

**How to get the text from a tag:**

In [15]:
sel.css("h2 ::text").get()

'Seminar Schedule'

**How to get the text of tags returned by the `.css()` method?**

You can also use `::text` on each tag element within the CSS selector returned by the `css()` method.


In [17]:
# Pure Python way
all_h2_tags = sel.css("h2 ::text").getall()
all_h2_texts = []

for tag in all_h2_tags:
    all_h2_texts.append(tag)

all_h2_texts

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

**Consider using [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) for a cleaner code:**

In [18]:
# one-liner way
all_h2_texts = [tag.get() for tag in sel.css("h2 ::text")]
all_h2_texts

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

üí° **IMPORTANT TIPS:**

- Make it a habit in the next couple of weeks to every now and then, right-click on a webpage and select "Inspect" (or "Inspect Element") to explore how the HTML is structured. This will help you understand how to use CSS selectors to extract the data you need.
- Tag names and ` ::text` are just the tip of the iceberg. Read about other CSS selectors [here](https://www.w3schools.com/cssref/css_selectors.asp).
- Bookmark the [Scrapy Selectors documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html) and revisit it whenever you need to. Practice using different CSS selectors to extract data from the HTML.

# Part III: Your turn! (45 min)


<details style="border: 1px solid #D55816; border-radius: 5px; padding: 0.5em;">
<summary style="font-weight:bold;margin-top:0.5em;margin-bottom:0.5em;font-size:1.4em;"> I am part of the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project</summary>

If you are participating in the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project, you are asked to:

- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.

- Have **only** the following tabs open in your browser:

    1. These lab instructions

    2. The [ChatGPT](https://chat.openai.com) website (**open a new chat window and name it 'DS105A - Week 04'**)

    3. The [W3Schools CSS Selector reference page](https://www.w3schools.com/cssref/css_selectors.asp)
    
    4. The [Scrapy documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html)

- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.

- **Fill out this brief survey at the end of the lab:** üîó [link](https://forms.office.com/e/h0dXriciyy) (requires LSE login)

</details>

<br>

<details style="border: 1px solid gray; border-radius: 5px; padding: 0.5em;">
<summary style="font-weight:bold;margin-top:0.5em;margin-bottom:0.5em;font-size:1.4em;"> I'm not participating in the GENIAL project :\</summary>

In case you are not participating in the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

1. The [W3Schools CSS Selector reference page](https://www.w3schools.com/cssref/css_selectors.asp)

2. The [Scrapy documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html)

</details>

üéØ **ACTION POINTS**

1. Go to the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) website and inspect the page (mouse right-click + Inspect) and find the way to the name of the first event on the page. 

3. Write down the "directions" inside the HTML file to reach the event title. For example, maybe you will find that:

    > _The first event title is inside a \<html\> ‚û°Ô∏è \<div\> ‚û°Ô∏è \<div\> ‚û°Ô∏è \<h3\> tag_.

    Write it in the markdown cell below:

_Delete this line and write your answer here_

4. Now, use the skill that you have just learned to scrape the names of ALL events. Save them all to a list.

In [41]:
# Delete this line and replace it with your code
cards = sel.css("div.card")
len(cards)
event_names = cards.css("a h6 ::text").getall()
event_names

['Data science for the Sustainable Development Goals: the case of food security',
 'CentralBankRoBERTa: A Fine-Tuned Large Language Model for Central Bank Communications',
 'The Evolution of the Climate Discourse on Twitter: Polarization, Hypocrisy, and the Musk Takeover',
 'The Handbook of Computational Social Science for Policy',
 'Artificial Intelligence, Algorithmic Recommendations and Competition',
 'Exploring A New Model of Industry/Academic Collaboration: the U.S. 2020 Facebook and Instagram Election Study',
 'Using Multimodal Neural Networks to Better Understand How Voters Process Audiovisual Information',
 "Models, mathematics, and data science: how to make sure we're answering the right questions",
 'CIVICA Conference on European Polarisation',
 'New Faces of Bias in Online Platforms',
 'Introducing the Online Harms Observatory: AI powered mapping of online abuse in real-time',
 'Using Open Source Data Streams and Surveys to Improve Our Understanding of Elections',
 'Does Epi

5. Do the same with the dates of the events and speaker names and save them to separate lists. 



In [61]:
dates_speakers = cards.css("div.card-body p ::text")

speakers = dates_speakers[::2]
print(speakers.getall())

dates = dates_speakers[1::2]
print(dates.getall())


['Speaker: Prof. Elisa Omodei, CEU ', 'Speaker: Moritz Pfeifer & Vincent Philipp Marohl ', 'Speaker: Dr. Max Falkenberg ', 'Speaker: Dr. Eleonora Bertoni ', 'Speaker: Prof. Giacomo Calzolari ', 'Speaker: Prof. Pablo Barber√° ', 'Speaker: Prof. Bryce Jensen Dietrich ', 'Speaker: Dr. Erica Thompson ', 'Full day event ', 'Speaker: Prof. Aniko Hannak ', 'Speaker: Pica Johnsson ', 'Speaker: Prof. Lisa Singh ', 'Speaker: Dr. Marco Meyer ', 'Speaker: Prof. Anne Beaulieu ', 'Speaker: Dr. Michelle Reddy & Dr. H√©l√®ne Thiollet ', 'Speaker: Prof. Stephanie Lackner ', 'Speaker: Dr. Omar A. Guerrero ', 'Speaker: Prof. David Chavalarias ', 'Speaker: Prof. Laszlo Barabasi ', 'Speaker: Prof. Arthur Spirling ', 'Speaker: Dr. Alexandra Scacco ', 'Speaker: Prof. Margaret Roberts ', 'Speaker: Prof. Lauren Klein ', 'Speaker: Prof. Cesar A. Hidalgo ', 'Speaker: Prof. Christopher Lucas ', 'Speaker: Prof. Camille Roth ', 'Speaker: Prof. Michelle Torrest ', 'Speaker: Prof. Suzy Moat ', 'Speaker: Prof. Macarta

6. Convert the lists to a pandas data frame and save it to a CSV file.

In [63]:
df = pd.DataFrame({'event': event_names, 'date': dates.getall(), 'speaker': speakers.getall()})
df.to_csv('events.csv', index=False)
df

Unnamed: 0,event,date,speaker
0,Data science for the Sustainable Development G...,"Date: Wednesday, 18 October 2023","Speaker: Prof. Elisa Omodei, CEU"
1,CentralBankRoBERTa: A Fine-Tuned Large Languag...,"Date: Wednesday, 27 September 2023",Speaker: Moritz Pfeifer & Vincent Philipp Marohl
2,The Evolution of the Climate Discourse on Twit...,"Date: Wednesday, 13 September 2023",Speaker: Dr. Max Falkenberg
3,The Handbook of Computational Social Science f...,"Date: Wednesday, 31 May 2023",Speaker: Dr. Eleonora Bertoni
4,"Artificial Intelligence, Algorithmic Recommend...","Date: Wednesday, 03 May 2023",Speaker: Prof. Giacomo Calzolari
5,Exploring A New Model of Industry/Academic Col...,"Date: Wednesday, 19 April 2023",Speaker: Prof. Pablo Barber√°
6,Using Multimodal Neural Networks to Better Und...,"Date: Wednesday, 22 March 2023",Speaker: Prof. Bryce Jensen Dietrich
7,"Models, mathematics, and data science: how to ...","Date: Wednesday, 08 March 2023",Speaker: Dr. Erica Thompson
8,CIVICA Conference on European Polarisation,"Date: Wednesday, 15 February 2023",Full day event
9,New Faces of Bias in Online Platforms,"Date: Wednesday, 08 February 2023",Speaker: Prof. Aniko Hannak


7. Double-check that the CSV file was created correctly by opening it using pandas. Then convert the columns to appropriate data types (use what you've learned in Week 04 lecture).

In [72]:
# Read the CSV file 'events.csv' into a pandas dataframe
df2 = pd.read_csv('events.csv')

# Print the data types of each column in the dataframe: 'event', 'date', and 'speaker'
# Note that the 'date' column is of type 'object' (i.e. string) and not datetime
print(df2.dtypes)

# Display the dataframe
display(df2)

# Clean the 'date' column by removing the 'Date: ' prefix and joining the date and time strings
# The lambda function takes a string as input, replaces the substring 'Date: ' with an empty string, and returns the modified string
df2['date'] = df2['date'].apply(lambda x: x.replace('Date: ', ''))
df2['date'] = df2['date'].apply(lambda x: ' '.join(x.split(', ')[1:]))
df2['date'] = pd.to_datetime(df2['date'])

# Clean the 'speaker' column by removing the 'Speaker: ' prefix
# The lambda function takes a string as input, replaces the substring 'Speaker: ' with an empty string, and returns the modified string
df2['speaker'] = df2['speaker'].astype(str)
df2['speaker'] = df2['speaker'].apply(lambda x: x.replace('Speaker: ', ''))

# Convert the 'event' column to string data type
df2['event'] = df2['event'].astype(str) 

# Print the data types of each column in the dataframe after cleaning and see what's changed
display(df2.dtypes)

# Display the cleaned dataframe
display(df2)

event              object
date       datetime64[ns]
speaker            object
dtype: object

Unnamed: 0,event,date,speaker
0,Data science for the Sustainable Development G...,2023-10-18,"Prof. Elisa Omodei, CEU"
1,CentralBankRoBERTa: A Fine-Tuned Large Languag...,2023-09-27,Moritz Pfeifer & Vincent Philipp Marohl
2,The Evolution of the Climate Discourse on Twit...,2023-09-13,Dr. Max Falkenberg
3,The Handbook of Computational Social Science f...,2023-05-31,Dr. Eleonora Bertoni
4,"Artificial Intelligence, Algorithmic Recommend...",2023-05-03,Prof. Giacomo Calzolari
5,Exploring A New Model of Industry/Academic Col...,2023-04-19,Prof. Pablo Barber√°
6,Using Multimodal Neural Networks to Better Und...,2023-03-22,Prof. Bryce Jensen Dietrich
7,"Models, mathematics, and data science: how to ...",2023-03-08,Dr. Erica Thompson
8,CIVICA Conference on European Polarisation,2023-02-15,Full day event
9,New Faces of Bias in Online Platforms,2023-02-08,Prof. Aniko Hannak


In Python, `apply()` is a method that is used to apply a function along an axis of a DataFrame. The `lambda` function is a small anonymous function that can take any number of arguments, but can only have one expression. 

In the given code, `apply()` is used to apply a `lambda` function to each element of the 'date' column of the DataFrame `df2`. The `lambda` function takes each element of the 'date' column, replaces the string 'Date: ' with an empty string, and then joins the remaining strings using a space. This is done to clean the 'date' column and convert it to a datetime data type.

# Putting it all together in a function

Now that we have refined out approach, we can write a function to scrape the website and return a dataframe with the event information: we will then be able to call it on multiple pages instead of having to rewrite and copy the code.

In [1]:
from typing import List
import pandas as pd
import requests
from scrapy import Selector

def parse_page(response: requests.Response) -> pd.DataFrame:
    """
    Parses the HTML response from a webpage and extracts the event names, dates, and speakers.
    
    Args:
      response: requests.Response object containing the HTML response from a webpage
    
    Returns:
      pandas DataFrame with columns 'event', 'date', and 'speaker'
    """
    sel = Selector(text=response.text)
    cards = sel.css("div.card")
    event_names = cards.css("a h6 ::text")
    dates_speakers = cards.css("div.card-body p ::text")
    speakers = dates_speakers[::2]
    dates = dates_speakers[1::2]
    df = pd.DataFrame({'event': event_names.getall(), 'date': dates.getall(), 'speaker': speakers.getall()})
    return df


def scrape_page(url: str) -> pd.DataFrame:
    """
    Scrapes a single webpage and returns a pandas DataFrame with the event information.
    
    Args:
      url: string containing the URL of the webpage to scrape
    
    Returns:
      pandas DataFrame with columns 'event', 'date', and 'speaker'
    """
    response = requests.get(url)
    return parse_page(response)


# try this out on our page
scrape_page('https://socialdatascience.network/index.html#schedule') 

# save to csv
df = scrape_page('https://socialdatascience.network/index.html#schedule')
df.to_csv('events.csv', index=False)

# this is equivalent to:
scrape_page('https://socialdatascience.network/index.html#schedule').to_csv('events.csv', index=False)
# because the function returns a dataframe, so we can call the .to_csv() method on it

# Dealing with pagination

Pagination is a technique used in web development where content is divided across multiple pages. It's commonly used in scenarios where displaying all the content at once would be overwhelming or resource-intensive, such as search results, forum posts, or product listings.

When scraping data from a website, dealing with pagination can be a bit tricky because the structure and method of pagination can vary from site to site. Here are some common methods:

1. **URL-based pagination**: The page number is included in the URL. For example, `http://example.com/posts?page=2`. In this case, you can simply change the page number in the URL to navigate to different pages.

2. **Form-based pagination**: The page number is sent as a POST request. This is a bit more complex to handle, but can usually be done by inspecting the network traffic when you click on a page number, and replicating that request in your code.

3. **JavaScript-based pagination**: The page number is fetched using JavaScript. This is the most complex to handle, and may require tools like Selenium or Puppeteer that can execute JavaScript.

In all cases, you'll need to write your scraping code in a loop that goes through each page, extracts the data, and then moves on to the next page. You'll also need to include some logic to detect when you've reached the last page, so that the loop can stop.

Here's a simple example of how you might handle URL-based pagination in Python using BeautifulSoup:



In [None]:
for i in range(1, 11):  # assuming the site has 10 pages
    url = f"http://example.com/posts?page={i}"
    response = requests.get(url)
    # your scraping code here

# this is equivalent to a list comprehension with the function we created above
df_list = [scrape_page(f'https://socialdatascience.network/index.html#schedule?page={i}') for i in range(1, 11)]
# concatenate all the dataframes in the list into one dataframe
df = pd.concat(df_list)
df.to_csv('events.csv', index=False)

In this example, the loop goes from 1 to 10, and for each iteration, it constructs a URL for the corresponding page, sends a GET request to that URL, and parses the response. You would replace the comment with the actual code to extract the data from each page.

# A few notes on style

In the functions above, you saw good documentation and type hints. They are important for several reasons:

1. **Readability and Maintainability**: They make your code easier to understand and maintain. By providing a description of what a function does, what the arguments are, and what it returns, you make it easier for others (and your future self) to understand your code.

2. **Documentation**: Docstrings are used to automatically create documentation for your code. Tools like Sphinx can generate HTML documentation from your docstrings.

3. **Type Checking**: Type hints allow you to specify the expected type of function arguments and return values. This can help catch certain types of bugs before runtime and can also make your code easier to understand.

4. **IDE Support**: Many Integrated Development Environments (IDEs) and code editors use docstrings and type hints to provide useful features like autocompletion, function signature popups, and automatic refactoring.

Here's an example of a function with docstrings and type hints:



In [None]:
def add_numbers(a: int, b: int) -> int:
    """
    Adds two numbers together.

    Args:
        a (int): The first number.
        b (int): The second number.

    Returns:
        int: The sum of a and b.
    """
    return a + b



In this example, the docstring explains what the function does, what each argument is, and what the function returns. The type hints indicate that the function expects two integers as input and returns an integer.