<font style='font-size:1.5em'>**üíª Week 04 lab ‚Äì Web Scraping (with solutions)** </font>

<font style='font-size:1.2em'>DS105A ‚Äì Data for Data Science</font>

**AUTHORS:**  [Anton Boichenko](https://github.com/antonboychenko) & [Alex Soldatkin](https://github.com/alex-soldatkin) & Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Learn how to collect data from the Web using Python packages

**LAST REVISION:** 18 October 2023

::: callout-important

## This lab is part of the ![](/figures/logos/GENIAL_favicon.png){width=1em}  [<span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project](https://lse-dsi.github.io/genial). 

If you never accessed ChatGPT, you must create an account. Click on [chat.openai.com](https://chat.openai.com/) and sign up with your email address (it doesn't have to be your LSE email address).

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

:::


--- 

# Part I: ‚öôÔ∏è The setup (15 min)

You will need to install the requests and Scrapy packages in order to complete this lab. I will assume you have configured the virtual environment for this course as follows. 

üéØ **ACTION POINTS:**

1. Open the terminal (directly from within VS Code will be easier) and run each of the following commands:


    ```bash
    pip install pandas
    pip install requests
    pip install scrapy
    ```

2. Now, create a new code chunk below and import the packages you just installed. You should have something like this:

    ```python
    import requests               # This is how we access the web
    import pandas as pd           # This is how we work with data frames

    from pprint import pprint     # Print things in a pretty way
    from scrapy import Selector   # This is how we parse HTML
    ```

In [2]:
import requests               # This is how we access the web
import pandas as pd           # This is how we work with data frames

from pprint import pprint     # Print things in a pretty way
from scrapy import Selector   # This is how we parse HTML

# Part II: Requesting a web page (30 min)

üë®üèª‚Äçüè´ **TEACHING MOMENT**

The entire Part II is a teaching moment. Your class teacher will help you collect this information from the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) web page.

Pay close attention and follow along on your computer.

---

You might have heard of [CIVICA](https://www.civica.eu/who-we-are/about-civica/) before. It is a body that unites several European universities to collaborate in the areas of social sciences, humanities, business and public policy. CIVICA hosts [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) that might be of interest to you. Today we will collect information on some of the seminars. Maybe you can use it in the future! 

**Our main task** is to create a üêº pandas data frame that would contain:

1. names of the seminars 
2. names of speakers of those seminars
3. dates of the seminars
4. bios of the speakers from each individual event 


## 2.1. Request a website


In [5]:
# This is the address of the website we want to scrape
my_url = 'https://socialdatascience.network/index.html#schedule'

# We set a GET request to the website
response = requests.get(my_url)

# What is the response code?
response

<Response [200]>

**üìú Other possible responses**

The response code is standard way of communicating the status of a request. There are many other possible responses:

- **200** OK
- **204** No Content
- **400** Bad Request
- **401** Unauthorized
- **402** Payment Required
- **403** Forbidden
- **404** Not Found
- **500** Internal Server Error
- **502** Bad Gateway

üó£Ô∏è **CLASSROOM DISCUSSION:** Have you ever encountered any of these responses when browsing the Web on your browser? Where? What did you do about it?


You can find a full list [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

## 2.2. A closer look at the response

What else is stored in the `response` object?

In [6]:
# The vars function returns all attributes of an object, along with their values
# You will see that it is essentially just a dictionary
vars(response)

 '_content_consumed': True,
 '_next': None,
 'status_code': 200,
 'headers': {'Connection': 'keep-alive', 'Content-Length': '18391', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Mon, 09 Oct 2023 14:47:42 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"6524128e-1b910"', 'expires': 'Fri, 20 Oct 2023 08:28:45 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': '5554:8686:11B0ABB:11E9FA6:653237E5', 'Accept-Ranges': 'bytes', 'Date': 'Fri, 20 Oct 2023 16:02:02 GMT', 'Via': '1.1 varnish', 'Age': '356', 'X-Served-By': 'cache-lcy-eglc8600036-LCY', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1697817722.189471,VS0,VE1', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '5007cef8a172045113bbfe177050e20bc8314d02'},
 'raw': <urllib3.response.HTTPResponse at 0x14c26a4d0>,
 'url': 'https://socialdatascience.network/index.html#schedule',
 'encoding': 'utf-8',
 'history': [],
 'reason': '

üó£Ô∏è **CLASSROOM DISCUSSION:**

You have already looked at `response.status_code`. But what do you think the following attributes of the `response` object are?

- `response.headers`
- `response.cookies`
- `response.content`

Feel free to open a new chunk of code below and explore these attributes.

But encoding is not the only **metadata** we can get from the response. Let's take a look at all the headers:

In [7]:
# Headers are metadata about the response
pprint(response.headers)

{'Connection': 'keep-alive', 'Content-Length': '18391', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Mon, 09 Oct 2023 14:47:42 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"6524128e-1b910"', 'expires': 'Fri, 20 Oct 2023 08:28:45 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': '5554:8686:11B0ABB:11E9FA6:653237E5', 'Accept-Ranges': 'bytes', 'Date': 'Fri, 20 Oct 2023 16:02:02 GMT', 'Via': '1.1 varnish', 'Age': '356', 'X-Served-By': 'cache-lcy-eglc8600036-LCY', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1697817722.189471,VS0,VE1', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '5007cef8a172045113bbfe177050e20bc8314d02'}


We could choose to manipulate the headers above as a `pd.Series`:

In [9]:
pd.Series(response.headers)

Connection                                                   keep-alive
Content-Length                                                    18391
Server                                                       GitHub.com
Content-Type                                   text/html; charset=utf-8
Last-Modified                             Mon, 09 Oct 2023 14:47:42 GMT
Access-Control-Allow-Origin                                           *
ETag                                                 W/"6524128e-1b910"
expires                                   Fri, 20 Oct 2023 08:28:45 GMT
Cache-Control                                               max-age=600
Content-Encoding                                                   gzip
x-proxy-cache                                                      MISS
X-GitHub-Request-Id                  5554:8686:11B0ABB:11E9FA6:653237E5
Accept-Ranges                                                     bytes
Date                                      Fri, 20 Oct 2023 16:02

Let me know you what is in the object `response` by printing it.

In [10]:
pprint(response.text)

('<!DOCTYPE html>\r\n'
 '<html lang="en">\r\n'
 '\r\n'
 '<head>\r\n'
 '  <meta charset="utf-8">\r\n'
 '  <title>CIVICA Data Science Seminar</title>\r\n'
 '  <meta content="width=device-width, initial-scale=1.0" name="viewport">\r\n'
 '  <meta content="CIVICA Data Science Seminar" name="keywords">\r\n'
 '  <meta content="A series of data science workshops and seminars" '
 'name="description">\r\n'
 '\r\n'
 '  <!-- Favicons -->\r\n'
 '  <link href="img/c-favicon.png" rel="icon">\r\n'
 '  <link href="img/apple-touch-icon.png" rel="apple-touch-icon">\r\n'
 '\r\n'
 '  <!-- Google Fonts -->\r\n'
 '  <link '
 'href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,700,700i|Raleway:300,400,500,700,800" '
 'rel="stylesheet">\r\n'
 '\r\n'
 '  <!-- Bootstrap CSS File -->\r\n'
 '  <link href="lib/bootstrap/css/bootstrap.min.css" rel="stylesheet">\r\n'
 '\r\n'
 '  <!-- Libraries CSS Files -->\r\n'
 '  <link href="lib/font-awesome/css/font-awesome.min.css" '
 'rel="stylesheet">\r\

The code chunk above makes sense here because I want to show you how to inspect objects when in **prototype mode**. However, whenever you are writing in a Jupyter Notebook to report to someone (say, when submitting your assignment), you should remove code chunks that produce a lot of unnecessary output.


üí° **A DETAIL THAT SEEMS INSIGNIFICANT BUT THAT IS EXTREMELY IMPORTANT**: 
- If you are on Mac or Linux, you will find that the break line character is `\n`. 
- If you are on Windows, you will find that the break line character is `\r\n`. 
- Windows uses two characters to break lines, while Mac and Linux use only one. 
- This is a common source of errors when working with with text files in two different OS. (For example: you use Mac and collaborate with someone who uses Windows.)

How many characters are there in the `response.text`?

In [11]:
len(response.text)

112894

Not very useful to treat is as pure string, right? We need to find a better way to parse this data.


## 2.3. Parsing HTML

The Scrapy Selector package is a Python library for extracting data from HTML and XML documents. It uses CSS or XPath selectors for data extraction making it a powerful tool for web scraping. It is often an essential part of the Scrapy framework but can also be used independently.

When you feed HTML text to the Scrapy Selector, it processes the HTML and preserves it in a particular **object** <sup>1</sup>. This object allows you to access parts of the HTML using Python's common dot notation in combination with the CSS syntax. If, for instance, you want to fetch the title of the page, you might use `selector.css('title')`.

<sup>1</sup>: Re-watch üóìÔ∏è Week 04 lecture if you need to revise what an object is

In [12]:
# parse the HTML code using Scrapy Selector
sel = Selector(text=response.text)

üí° Note: I was only able to call `Selector()` directly because I had already imported it at the top of the notebook. Scroll up to see it. If I hadn't, the code above would have thrown an error.

**Check `sel.get()` to see the full HTML document**

This has the same effect as `response.text`.

In [13]:
sel.get()



**HTML documents usually have a \<header\> tag:**

(‚ö†Ô∏è not to be confused with the HTTP header we saw with `response.headers`)

In [14]:
sel.css('header')

[<Selector query='descendant-or-self::header' data='<header id="header">\r\n    <div class=...'>]

There is also usually a `<body>` tag, which contains the main content of the page:

In [15]:
sel.css('body')



üîë **Takeaway of the output above:**

- The output is a list, as indicated by the square brackets. 
- HTML pages only have one `<body>` tag, so this list contains a single element, which is an object of the class Selector.

What if I want to look at the content of the `<body>` tag?

In [16]:
pprint(sel.css('body').get())

('<body>\r\n'
 '\r\n'
 '    Header\r\n'
 '  <header id="header">\r\n'
 '    <div class="container">\r\n'
 '\r\n'
 '      <div id="logo" class="pull-left">\r\n'
 '        <!-- Uncomment below if you prefer to use a text logo -->\r\n'
 '        <!-- <h1><a href="#main">C<span>o</span>nf</a></h1>-->\r\n'
 '        <a href="#intro" class="scrollto"><img src="img/logo.png" alt="" '
 'title=""></a>\r\n'
 '      </div>\r\n'
 '\r\n'
 '      <nav id="nav-menu-container">\r\n'
 '        <ul class="nav-menu">\r\n'
 '          <li class="menu-active"><a href="#intro">Home</a></li>\r\n'
 '          <li><a href="#about">About</a></li>\r\n'
 '<!--           <li><a href="#speakers">Speakers</a></li>\r\n'
 ' -->          <li><a href="#schedule">Schedule</a></li>\r\n'
 '          <li><a href="#supporters">Partner Institutions</a></li>\r\n'
 '          <li><a href="summerschool.html">Summer School</a></li>\r\n'
 '          <li><a href="#gallery">Gallery</a></li>\r\n'
 '          <li><a href="#contact">Co

**Are there any `<h1>` tags in this page?**

In [17]:
sel.css('h1').get()

'<h1 class="mb-4 pb-0">CIVICA<br><span>Data Science</span> Seminar Series</h1>'

What about `<h2>` tags?

In [18]:
sel.css('h2').getall()

['<h2>Seminar Schedule</h2>',
 '<h2>Partner Institutions</h2>',
 '<h2>Gallery</h2>',
 '<h2>F.A.Q </h2>',
 '<h2>Newsletter</h2>',
 '<h2>Contact Us</h2>']

If you care just about the **first** `<h2>` tag, you can use the `.get()` method instead of `.getall()`:

In [19]:
sel.css("h2").get()

'<h2>Seminar Schedule</h2>'

**How to get the text from a tag:**

In [23]:
sel.css("h2 ::text").getall()

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

**How to get the text of tags returned by the `.css()` method?**

You can also use `::text` on each tag element within the CSS selector returned by the `css()` method.


In [33]:
# Pure Python way
all_h2_tags = sel.css("h2 ::text")
all_h2_texts = []

for tag in all_h2_tags:
    all_h2_texts.append(tag.get())

all_h2_texts

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

**Consider using [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) for a cleaner code:**

In [30]:
# one-liner way
all_h2_texts = [tag.get() for tag in sel.css("h2 ::text")]
all_h2_texts

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

üí° **IMPORTANT TIPS:**

- Make it a habit in the next couple of weeks to every now and then, right-click on a webpage and select "Inspect" (or "Inspect Element") to explore how the HTML is structured. This will help you understand how to use CSS selectors to extract the data you need.
- Tag names and ` ::text` are just the tip of the iceberg. Read about other CSS selectors [here](https://www.w3schools.com/cssref/css_selectors.asp).
- Bookmark the [Scrapy Selectors documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html) and revisit it whenever you need to. Practice using different CSS selectors to extract data from the HTML.

# Part III: Your turn! (45 min)


<details style="border: 1px solid #D55816; border-radius: 5px; padding: 0.5em;">
<summary style="font-weight:bold;margin-top:0.5em;margin-bottom:0.5em;font-size:1.4em;"> I am part of the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project</summary>

If you are participating in the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project, you are asked to:

- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.

- Have **only** the following tabs open in your browser:

    1. These lab instructions

    2. The [ChatGPT](https://chat.openai.com) website (**open a new chat window and name it 'DS105A - Week 04'**)

    3. The [W3Schools CSS Selector reference page](https://www.w3schools.com/cssref/css_selectors.asp)
    
    4. The [Scrapy documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html)

- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.

- **Fill out this brief survey at the end of the lab:** üîó [link](https://forms.office.com/e/h0dXriciyy) (requires LSE login)

</details>

<br>

<details style="border: 1px solid gray; border-radius: 5px; padding: 0.5em;">
<summary style="font-weight:bold;margin-top:0.5em;margin-bottom:0.5em;font-size:1.4em;"> I'm not participating in the GENIAL project :\</summary>

In case you are not participating in the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

1. The [W3Schools CSS Selector reference page](https://www.w3schools.com/cssref/css_selectors.asp)

2. The [Scrapy documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html)

</details>

üéØ **ACTION POINTS**

## Q1

Go to the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) website and inspect the page (mouse right-click + Inspect) and find the way to the name of the first event on the page. 

## Q2

Write down the "directions" inside the HTML file to reach the event title. For example, maybe you will find that:

> _The first event title is inside a \<html\> ‚û°Ô∏è \<div\> ‚û°Ô∏è \<div\> ‚û°Ô∏è \<h3\> tag_.

Write it in the markdown cell below:

**Solution**

The absolute precise directions are the following:

> html body main#main section#schedule.section-with-bg div.container.wow.fadeInUp div.card-deck.row div.col-xs-12.col-sm-6.col-md-4.portfolio-item.filter-fall2023 div.card.mb-4 div.card-body a h6.card-title

Note that I separated the tags with a space. This is because each tag is nested inside the previous one. 

For example, the `<h6>` tag that contains the class `card-title` is inside the `<a>` tag, which is inside the `<div class="card-body">` tag, and so on.

## Q3

We don't know what happened to Q3. It's a mystery.

## Q4

Now, use the skill that you have just learned to scrape the names of ALL events. Save them all to a list.

### Solution with reasoning

Here is one example of how to think about this problem, step by step:



::: {style="margin-left:2em;"}


#### Step 1

Consider what is a good CSS Selector to use to get the title of the first event. I _could_ use the extremely long CSS Selector I found above, but by browsing the HTML I would eventually realise two things:

- All titles are represented within a `<h6 class="card-title">` tag
- There are no other `<h6 class="card-title">` tags in the HTML document, other than the ones that contain the titles of the events!

Therefore: `h6.card-title` is a good CSS Selector to use.


#### Step 2

Write code to get _just_ the first title. This is to force you to practice doing things in small steps. One common mistake coding beginners make is to try to do everything at once. This is a recipe for disaster!

Therefore, using what you learned above, you would write:

```python
sel.css('h6.card-title ::text').get()
```

returning:

```text
'Data science for the Sustainable Development Goals: the case of food security'
```

Great! That's precisely what we wanted.

#### Step 3

Now, write code to get _all_ titles. This is where you will need to use the `.getall()` method instead of `.get()`.

Instead of creating a separate code chunk, just rewrite the code above to use `.getall()` instead of `.get()`.

```python
sel.css('h6.card-title ::text').getall()
```

Once you run the cell again, you will notice that this new code produces a long list of titles. It seems to be working! üéâ


#### Step 4


Going back to the question, I see that I need to _save_ this list of titles to a variable. I will call it `titles`.

That is, just replace the code above with:

```python
titles = sel.css('h6.card-title ::text').getall()
```

Great, now you could even check how many titles you have by running `len(titles)`.

:::

Here is the final solution:

In [131]:
titles = sel.css('h6.card-title ::text').getall()

## Q5

Do the same with the dates of the events and speaker names and save them to separate lists. 



### Reasoning process



::: {style="margin-left:2em;"}


#### Step 1

Using all of the reasoning above, you would eventually realise that the information you care about is inside a tag that looks like:

```html
<p class="card-text">
    Speaker: Prof. Elisa Omodei, CEU 
    <br> Date: Wednesday, 18 October 2023
</p>
```

Also, following the principles above, you wouldn't jump straight to the solution. You would first write code to get _just_ the first speaker to make sure you are on the right track.

Therefore, using what you learned above, you would write:

```python
sel.css('p.card-text ::text').get()
```

```text
'Speaker: Prof. Elisa Omodei, CEU '
```

Unfortunatelly, the scrapy selector we wrote will get rid of everything after the `<br/>`. It would have been nice to have the date as well.


#### Step 2

OK, even if the above code doesn't give us the full information, it is a good sign that we are on the right track. Let's try to get _all_ the speakers. This is where you will need to use the `.getall()` method instead of `.get()`.

```python
sel.css('p.card-text ::text').getall()
```

```text
['Speaker: Prof. Elisa Omodei, CEU ',
 ' Date: Wednesday, 18 October 2023',
 'Speaker: Moritz Pfeifer & Vincent Philipp Marohl ',
 ' Date: Wednesday, 27 September 2023',
 'Speaker: Dr. Max Falkenberg ',
 ' Date: Wednesday, 13 September 2023',
 'Speaker: Dr. Eleonora Bertoni ',
 ...

  'Speaker: Roman Rivera',
 'Date: Wednesday, 17 March 2021']
```

Oh! Now the date is there, but it is mixed with the speaker name. What should I do next?

#### Step 3

Well, ok. That's good, I guess. But I need to separate the speaker name from the date. How can I do that? 

A simple way is to use a `for loop` to iterate over the list above and add the speaker name and date to separate lists.

```python
speakers = []
dates = []

for i in sel.css('p.card-text ::text').getall():
    if i.startswith('Speaker:'):
        speakers.append(i)
    elif i.startswith(' Date:'):
        dates.append(i)
```

This assumes you were already familiar with the `startswith()` method. If you weren't, you would have googled something like 'how to check if a string starts with a certain character in Python' and you would have found the [startswith](https://www.w3schools.com/python/ref_string_startswith.asp) method.

#### Step 4

Double check that everything worked as expected by printing the first 5 elements of each list:

```python
pprint(speakers[:5])
```

```text
['Speaker: Prof. Elisa Omodei, CEU ',
 'Speaker: Moritz Pfeifer & Vincent Philipp Marohl ',
 'Speaker: Dr. Max Falkenberg ',
 'Speaker: Dr. Eleonora Bertoni ',
 'Speaker: Prof. Giacomo Calzolari ']
```

Wait, I don't need the word `Speaker:` in the list! How do I get rid of it?

#### Step 5

You would have to recall or discover the [replace](https://www.w3schools.com/python/ref_string_replace.asp) method and use it to remove the word `Speaker:` from each element of the list.

It's all about baby steps, so the first thing you could would be to try to remove the word `Speaker:` from the first element of the list:

```python
speakers[0].replace('Speaker: ', '')
```

```text
'Prof. Elisa Omodei, CEU '
```

It works!

#### Step 6

By now, you would be ready to apply the code above to replace all elements of the list:

```python
for i in range(len(speakers)):
    speakers[i] = speakers[i].replace('Speaker: ', '')
```

Similarly, you would do the same for the `dates` list:

```python
for i in range(len(dates)):
    dates[i] = dates[i].replace(' Date: ', '')
```

#### Step 7

You could stop here and everything would be fine. But you could revisit your code and rewrite it so that everything is done in a single loop. Leading to what seems like the final solution:

```python
speakers = []
dates = []

for i in sel.css('p.card-text ::text').getall():
    if i.startswith('Speaker:'):
        speaker_name = i.replace('Speaker: ', '')
        speakers.append(speaker_name)
    elif i.startswith('Date:'):
        date = i.replace('Date: ', '')
        dates.append(date)
```

#### Step 8

Don't stop here. **Always check** that your code is doing what you think it is doing.

Here is what I would do: print out the number of elements in each list:

```python
print(len(speakers))
print(len(dates))
```

```text
32
11
```

NOOOO! THEY ARE NOT THE SAME LENGTH! WHAT HAPPENED? Why are there many more speakers than dates?

Really, **this** here is the whole point of this course. You are to learn to spot errors in your code and fix them, as well as problems with your data collection and fix them.

#### Step 9 (the most important thing you can learn in this course is how to debug your code)

Here is what I would do. I would open a separate code chunk right below (leave the previous code chunk intact), copy the code above, remove all the parts where I am saving the data to lists, and replace them with a `print()` statement:

```python
for i in sel.css('p.card-text ::text').getall():
    print("Looking at element: ", "'", i, "'")
    if i.startswith('Speaker:'):
        print("This element contains a speaker name")

        speaker_name = i.replace('Speaker: ', '')
        print("The speaker name is: ", speaker_name)
    elif i.startswith('Date:'):
        print("This element contains a date")

        date = i.replace('Date: ', '')
        print("The date is: ", date)
    else:
        print("This element is neither a speaker name nor a date")
    print()
```

#### Step 10

You will see that several elements have an extra space character at the start and therefore don't start with `Speaker:` or `Date:`. Instead they start with ` Speaker:` or ` Date:`. This is why they are not being saved to the `speakers` or `dates` lists.

I would now Google 'how to remove empty spaces from the start of a string in Python' and find the [lstrip](https://www.w3schools.com/python/ref_string_lstrip.asp) method. I would modify this new `for` loop I created, adding the `lstrip()` method to the `i` variable.

Now that I am confident that this for loop works, I can remove unnecessary `print()` statements and print only inside the `else` statement:

```python
for i in sel.css('p.card-text ::text').getall():
    i = i.lstrip()
    if i.startswith('Speaker:'):
        speaker_name = i.replace('Speaker: ', '')
    elif i.startswith('Date:'):
        date = i.replace('Date: ', '')
    else:
        print("Looking at element: ", "'", i, "'")
        print("This element is neither a speaker name nor a date")
        print()
```

The above only prints out once!

```text
Looking at element:  ' Full day event  '
This element is neither a speaker name nor a date
```

There is no single speaker, so the 'speaker_name' in this case can be empty.

#### Step 11 - Dealing with the odd element

Now I can go back to my original `for` loop, add the `lstrip()` method to the `i` variable, and deal with the odd element:

```python
speakers = []
dates = []

for i in sel.css('p.card-text ::text').getall():
    i = i.lstrip()
    if i.startswith('Speaker:'):
        speaker_name = i.replace('Speaker: ', '')
        speakers.append(speaker_name)
    elif i.startswith('Date:'):
        date = i.replace('Date: ', '')
        dates.append(date)
    else:
        speakers.append('')
```

Then I can check that everything worked as expected:

```python
print(len(speakers))
print(len(dates))
```

```text
33
33
```

Great! Now I can remove any unnecessary extra code chunks I created in the process, keep just the code above and move on to the next question.

Question resolved! (_Or so I thought..._)

:::

Here is the final solution:

In [130]:
speakers = []
dates = []

for i in sel.css('p.card-text ::text').getall():
    i = i.lstrip()
    if i.startswith('Speaker:'):
        speaker_name = i.replace('Speaker: ', '')
        speakers.append(speaker_name)
    elif i.startswith('Date:'):
        date = i.replace('Date: ', '')
        dates.append(date)
    else:
        speakers.append('')

## Q6

Convert the lists to a pandas data frame and save it to a CSV file.

### Reasoning process


::: {style="margin-left:2em;"}


#### Step 1

This seems straightforward, now that I have all the data in lists. I can just use the `pd.DataFrame()` function to create a data frame:


```python
df = pd.DataFrame({'title': titles, 'speaker': speakers, 'date': dates})
```

But in fact, you will get an error:

```python
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
...
    669     raise ValueError(
    670         "Mixing dicts with non-Series may lead to ambiguous ordering."
    671     )

ValueError: All arrays must be of the same length
```

#### Step 2

The error message is quite clear: some of the lists are not of the same length. Let's check:

```python
print(len(titles))
print(len(speakers))
print(len(dates))
```

```text
37
33
33
```

Oh boy! That means that my answer to Q5 was wrong!!! I need to go back and fix it.

#### Step 3

This would take time! You would have to inspect 'speaker' by 'speaker' in the HTML document to realise that _some_ speaker details are wrapped in a `<p>` tag instead of the `<p class="card-text">` tag we were using before. And we can't use just the `<p>` tag because there are other `<p>` tags in the HTML document that we don't care about.

#### Step 4

At this stage, you would have to find another way to specify the path to the selector. Going back to the HTML document, you would realise that the `<p>` tag we care about is inside a `<div class="card-body">` tag. 

That is, you would have to go back to the code you wrote for Q5 and change it to:

```python
speakers = []
dates = []

for i in sel.css('div.card-body p ::text').getall():
    i = i.lstrip()
    if i.startswith('Speaker:'):
        speaker_name = i.replace('Speaker: ', '')
        speakers.append(speaker_name)
    elif i.startswith('Date:'):
        date = i.replace('Date: ', '')
        dates.append(date)
    else:
        speakers.append('')
```

#### Step 5

You would again check that everything worked as expected and once you notice that yes, everything is fine, the initial code you wrote for Q6 would now work!


:::

Therefore, here is the final solution:

In [146]:
df = pd.DataFrame({'title': titles, 'speaker': speakers, 'date': dates})
df.to_csv('schedule.csv', index=False)

## Q7

Double-check that the CSV file was created correctly by opening it using pandas. Then convert the columns to appropriate data types (use what you've learned in Week 04 lecture).

In [147]:
df = pd.read_csv('schedule.csv')
df

Unnamed: 0,title,speaker,date
0,Data science for the Sustainable Development G...,"Prof. Elisa Omodei, CEU","Wednesday, 18 October 2023"
1,CentralBankRoBERTa: A Fine-Tuned Large Languag...,Moritz Pfeifer & Vincent Philipp Marohl,"Wednesday, 27 September 2023"
2,The Evolution of the Climate Discourse on Twit...,Dr. Max Falkenberg,"Wednesday, 13 September 2023"
3,The Handbook of Computational Social Science f...,Dr. Eleonora Bertoni,"Wednesday, 31 May 2023"
4,"Artificial Intelligence, Algorithmic Recommend...",Prof. Giacomo Calzolari,"Wednesday, 03 May 2023"
5,Exploring A New Model of Industry/Academic Col...,Prof. Pablo Barber√°,"Wednesday, 19 April 2023"
6,Using Multimodal Neural Networks to Better Und...,Prof. Bryce Jensen Dietrich,"Wednesday, 22 March 2023"
7,"Models, mathematics, and data science: how to ...",Dr. Erica Thompson,"Wednesday, 08 March 2023"
8,CIVICA Conference on European Polarisation,,"Wednesday, 15 February 2023"
9,New Faces of Bias in Online Platforms,Prof. Aniko Hannak,"Wednesday, 08 February 2023"
