<font style='font-size:1.5em'>**üíª Week 04 lab ‚Äì Web Scraping** </font>

<font style='font-size:1.2em'>DS105A ‚Äì Data for Data Science</font>

**AUTHORS:**  [Anton Boichenko](https://github.com/antonboychenko) & [Alex Soldatkin](https://github.com/alex-soldatkin) & Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Learn how to collect data from the Web using Python packages

**LAST REVISION:** 18 October 2023

::: callout-important

## This lab is part of the ![](/figures/logos/GENIAL_favicon.png){width=1em}  [<span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project](https://lse-dsi.github.io/genial). 

If you never accessed ChatGPT, you must create an account. Click on [chat.openai.com](https://chat.openai.com/) and sign up with your email address (it doesn't have to be your LSE email address).

When you reach Part III of this lab, read the specific instructions for GENIAL participants.

:::


--- 

# Part I: ‚öôÔ∏è The setup (15 min)

You will need to install the requests and Scrapy packages in order to complete this lab. I will assume you have configured the virtual environment for this course as follows. 

üéØ **ACTION POINTS:**

1. Open the terminal (directly from within VS Code will be easier) and run each of the following commands:


    ```bash
    pip install pandas
    pip install requests
    pip install scrapy
    ```

2. Now, create a new code chunk below and import the packages you just installed. You should have something like this:

    ```python
    import requests               # This is how we access the web
    import pandas as pd           # This is how we work with data frames

    from pprint import pprint     # Print things in a pretty way
    from scrapy import Selector   # This is how we parse HTML
    ```

# Part II: Requesting a web page (30 min)

üë®üèª‚Äçüè´ **TEACHING MOMENT**

The entire Part II is a teaching moment. Your class teacher will help you collect this information from the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) web page.

Pay close attention and follow along on your computer.

---

You might have heard of [CIVICA](https://www.civica.eu/who-we-are/about-civica/) before. It is a body that unites several European universities to collaborate in the areas of social sciences, humanities, business and public policy. CIVICA hosts [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) that might be of interest to you. Today we will collect information on some of the seminars. Maybe you can use it in the future! 

**Our main task** is to create a üêº pandas data frame that would contain:

1. names of the seminars 
2. names of speakers of those seminars
3. dates of the seminars
4. bios of the speakers from each individual event 


## 2.1. Request a website


In [None]:
# This is the address of the website we want to scrape
my_url = 'https://socialdatascience.network/index.html#schedule'

# We set a GET request to the website
response = requests.get(my_url)

# What is the response code?
response

**üìú Other possible responses**

The response code is standard way of communicating the status of a request. There are many other possible responses:

- **200** OK
- **204** No Content
- **400** Bad Request
- **401** Unauthorized
- **402** Payment Required
- **403** Forbidden
- **404** Not Found
- **500** Internal Server Error
- **502** Bad Gateway

üó£Ô∏è **CLASSROOM DISCUSSION:** Have you ever encountered any of these responses when browsing the Web on your browser? Where? What did you do about it?


You can find a full list [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

## 2.2. A closer look at the response

What else is stored in the `response` object?

In [None]:
# The vars function returns all attributes of an object, along with their values
# You will see that it is essentially just a dictionary
vars(response)

üó£Ô∏è **CLASSROOM DISCUSSION:**

You have already looked at `response.status_code`. But what do you think the following attributes of the `response` object are?

- `response.headers`
- `response.cookies`
- `response.content`

Feel free to open a new chunk of code below and explore these attributes.

But encoding is not the only **metadata** we can get from the response. Let's take a look at all the headers:

In [None]:
# Headers are metadata about the response
pprint(response.headers)

We could choose to manipulate the headers above as a `pd.Series`:

In [None]:
pd.Series(response.headers)

Let me know you what is in the object `response` by printing it.

In [None]:
pprint(response.text)

The code chunk above makes sense here because I want to show you how to inspect objects when in **prototype mode**. However, whenever you are writing in a Jupyter Notebook to report to someone (say, when submitting your assignment), you should remove code chunks that produce a lot of unnecessary output.


üí° **A DETAIL THAT SEEMS INSIGNIFICANT BUT THAT IS EXTREMELY IMPORTANT**: 
- If you are on Mac or Linux, you will find that the break line character is `\n`. 
- If you are on Windows, you will find that the break line character is `\r\n`. 
- Windows uses two characters to break lines, while Mac and Linux use only one. 
- This is a common source of errors when working with with text files in two different OS. (For example: you use Mac and collaborate with someone who uses Windows.)

How many characters are there in the `response.text`?

In [None]:
len(response.text)

Not very useful to treat is as pure string, right? We need to find a better way to parse this data.


## 2.3. Parsing HTML

The Scrapy Selector package is a Python library for extracting data from HTML and XML documents. It uses CSS or XPath selectors for data extraction making it a powerful tool for web scraping. It is often an essential part of the Scrapy framework but can also be used independently.

When you feed HTML text to the Scrapy Selector, it processes the HTML and preserves it in a particular **object** <sup>1</sup>. This object allows you to access parts of the HTML using Python's common dot notation in combination with the CSS syntax. If, for instance, you want to fetch the title of the page, you might use `selector.css('title')`.

<sup>1</sup>: Re-watch üóìÔ∏è Week 04 lecture if you need to revise what an object is

In [None]:
# parse the HTML code using Scrapy Selector
sel = Selector(text=response.text)

üí° Note: I was only able to call `Selector()` directly because I had already imported it at the top of the notebook. Scroll up to see it. If I hadn't, the code above would have thrown an error.

**Check `sel.get()` to see the full HTML document**

This has the same effect as `response.text`.

In [None]:
sel.get()

**HTML documents usually have a \<header\> tag:**

(‚ö†Ô∏è not to be confused with the HTTP header we saw with `response.headers`)

In [None]:
sel.css('header')

There is also usually a `<body>` tag, which contains the main content of the page:

In [None]:
sel.css('body')

üîë **Takeaway of the output above:**

- The output is a list, as indicated by the square brackets. 
- HTML pages only have one `<body>` tag, so this list contains a single element, which is an object of the class Selector.

What if I want to look at the content of the `<body>` tag?

In [None]:
pprint(sel.css('body').get())

**Are there any `<h1>` tags in this page?**

In [None]:
sel.css('h1').get()

What about `<h2>` tags?

In [None]:
sel.css('h2').getall()

If you care just about the **first** `<h2>` tag, you can use the `.get()` method instead of `.getall()`:

In [None]:
sel.css("h2").get()

**How to get the text from a tag:**

In [None]:
sel.css("h2 ::text").get()

**How to get the text of tags returned by the `.css()` method?**

You can also use `::text` on each tag element within the CSS selector returned by the `css()` method.


In [None]:
# Pure Python way
all_h2_tags = sel.css("h2 ::text").getall()
all_h2_texts = []

for tag in all_h2_tags:
    all_h2_texts.append(tag)

all_h2_texts

**Consider using [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) for a cleaner code:**

In [None]:
# one-liner way
all_h2_texts = [tag.get() for tag in sel.css("h2 ::text")]
all_h2_texts

üí° **IMPORTANT TIPS:**

- Make it a habit in the next couple of weeks to every now and then, right-click on a webpage and select "Inspect" (or "Inspect Element") to explore how the HTML is structured. This will help you understand how to use CSS selectors to extract the data you need.
- Tag names and ` ::text` are just the tip of the iceberg. Read about other CSS selectors [here](https://www.w3schools.com/cssref/css_selectors.asp).
- Bookmark the [Scrapy Selectors documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html) and revisit it whenever you need to. Practice using different CSS selectors to extract data from the HTML.

# Part III: Your turn! (45 min)


<details style="border: 1px solid #D55816; border-radius: 5px; padding: 0.5em;">
<summary style="font-weight:bold;margin-top:0.5em;margin-bottom:0.5em;font-size:1.4em;"> I am part of the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project</summary>

If you are participating in the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project, you are asked to:

- Work independently (not in groups or pairs), but you can ask the class teacher for help if you get stuck.

- Have **only** the following tabs open in your browser:

    1. These lab instructions

    2. The [ChatGPT](https://chat.openai.com) website (**open a new chat window and name it 'DS105A - Week 04'**)

    3. The [W3Schools CSS Selector reference page](https://www.w3schools.com/cssref/css_selectors.asp)
    
    4. The [Scrapy documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html)

- Be aware of how useful (or not) ChatGPT was in helping you answer the questions in this section.

- **Fill out this brief survey at the end of the lab:** üîó [link](https://forms.office.com/e/h0dXriciyy) (requires LSE login)

</details>

<br>

<details style="border: 1px solid gray; border-radius: 5px; padding: 0.5em;">
<summary style="font-weight:bold;margin-top:0.5em;margin-bottom:0.5em;font-size:1.4em;"> I'm not participating in the GENIAL project :\</summary>

In case you are not participating in the <span style="font-weight:bold"> GEN<font color='#D55816'>IA</font>L</span> project, you can work in pairs or small groups to answer the questions in this section. You can also ask the class teacher for help if you get stuck.

We suggest you have these tabs open in your browser:

1. The [W3Schools CSS Selector reference page](https://www.w3schools.com/cssref/css_selectors.asp)

2. The [Scrapy documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html)

</details>

üéØ **ACTION POINTS**

1. Go to the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) website and inspect the page (mouse right-click + Inspect) and find the way to the name of the first event on the page. 

3. Write down the "directions" inside the HTML file to reach the event title. For example, maybe you will find that:

    > _The first event title is inside a \<html\> ‚û°Ô∏è \<div\> ‚û°Ô∏è \<div\> ‚û°Ô∏è \<h3\> tag_.

    Write it in the markdown cell below:

_Delete this line and write your answer here_

4. Now, use the skill that you have just learned to scrape the names of ALL events. Save them all to a list.

In [None]:
# Delete this line and replace it with your code

5. Do the same with the dates of the events and speaker names and save them to separate lists. 



In [None]:
# Delete this line and replace it with your code

6. Convert the lists to a pandas data frame and save it to a CSV file.

In [None]:
# Delete this line and replace it with your code

7. Double-check that the CSV file was created correctly by opening it using pandas. Then convert the columns to appropriate data types (use what you've learned in Week 04 lecture).

In [None]:
# Delete this line and replace it with your code

# Final Words

Use this to save your own notes about this lab. Maybe you can get in the habit of taking notes using Jupyter Notebooks and Markdown?