LSE Data Science Institute | DS105A (2023/24) | Week 04

# üìö Week 04 Appendix 01: HTML, CSS & principles of web scraping

Theme: Collecting Data

**DATE:** 22 October 2023

**AUTHOR:** [@jonjoncardoso](https://jonjoncardoso.github.io)

------------------------------


## 1. What does a webpage look like in its raw form?

In Week 04, you learned how to access the source code of a website. By right-clicking on a web page and selecting "Inspect" or "Inspect Element," a window appears displaying the **HTML code** that forms the structure of the page you're viewing. 

If you recall our discussion about CSV files in W04 lecture, you will remember that `.csv` files are plain text files that contain data in a specific, **structured** format. With HTML is the same. The **HTML code** you see when inspecting a page is pure text but it has some special characters that tell the browser how to display and format the page.

<div style="position: relative; padding-bottom: 56.25%; height: 0;"><iframe src="https://www.loom.com/embed/1b6868b38bef4d2d9c3cd83331d5672d?sid=780a7fee-3e7d-46e5-aa39-3b3afb43fda6" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

In the video above, I inspect our Syllabus page and show you how to navigate the HTML code on the browser directly.

üìó **What really is an HTML document?**

HTML extends for **H**yper**T**ext **M**arkup **L**anguage. 

Let's decompose that:

- A [markup language](https://www.wikiwand.com/en/Markup_language) is a set of conventions and standards we apply to text files to indicate how this text should be _rendered_. When we asked you to play around with markdown on GitHub, we were showing you a markup language.
- The hypertext part refers to the fact that HTML documents are designed to be _linked_ to each other. The web is a network of HTML documents that are linked to each other.

As I said above, HTML documents are plain text files, structured in a specific way so that it can be rendered by a browser.

### 1.1. HTML tags

Here is how the section of HTML code of the section I highlighted in the vide above looks like:

<div style="font-size:0.6em;">

```html
<section id="introduction" style="margin: 20px 0;">
    <h1>Introduction</h1>
    <p>The first week is all about setting up your computer and getting familiar with the tools we will use in the course.</p>
    <div style="width: 100%; display: flex; flex-wrap: wrap; border-top: 1px solid #aaa; border-bottom: 1px solid #aaa; padding: 1em 0;" id="yui_3_17_2_1_1697970867037_69">
        <!-- Week 01 Calendar column -->
        <div style="flex-basis: 16.66%; padding-left: 0.5rem; font-weight: bold;">
            <p>üóìÔ∏è <a href="https://lse-dsi.github.io/DS105/2023/autumn/weeks/week01/page.qmd">Week 01</a> <br> <span style="font-weight: lighter;">25 Sep 2023 -<br> 29 Sep 2023</span></p>
        </div>
        <!-- Week 01 Main Grid -->
        <div style="flex-basis: 83.33%; display: flex; flex-wrap: wrap;">
            <!-- Week 01 Lecture -->
            <div style="flex-basis: 25%; font-weight: bold;">
                <p style="margin-top: 0;margin-bottom: 1rem;">üßë‚Äçüè´ Lecture</p>
            </div>
            <div style="flex-basis: 75%; font-size: 0.875rem;">
                <p style="margin-top: 0;margin-bottom: 1rem;">The Data Science Toolbox and the Terminal</p>
            </div>
            <!-- Week 01 Formative -->
            <div style="flex-basis: 25%; font-weight: bold;">
                <p style="margin-top: 0;margin-bottom: 1rem;">üíª Lab</p>
            </div>
            <div style="flex-basis: 75%; font-size: 0.875rem;">
                <p style="margin-top: 0;margin-bottom: 1rem;">Setting up your computer and getting familiar with the terminal
                </p>
            </div>
            <!-- Week 01 Readings -->
            <div style="flex-basis: 25%; font-weight: bold;">
                <p style="margin-top: 0;margin-bottom: 1rem;">üìñ Readings</p>
            </div>
            <div style="flex-basis: 75%; font-size: 0.875rem;">
                <details style="margin-bottom: 1em;">
                    <summary style="color: #6c757d;display: list-item;cursor: pointer;">
                        Click to see recommended resources
                    </summary>
                    <p style="margin-top: 0;margin-bottom: 1rem;"><strong>Indicative</strong></p>
                    <ul>
                        <li>üìï Book Chapter: <span class="citation" data-cites="schutt_doing_2013">(<a href="#ref-schutt_doing_2013" role="doc-biblioref" aria-expanded="false">Schutt and O‚ÄôNeil 2013,
                                    chap. 1</a>)</span> - <em>What is Data Science?</em></li>
                        <li>üìï Book Chapter: <span class="citation" data-cites="shah_hands-introduction_2020">(<a href="#ref-shah_hands-introduction_2020" role="doc-biblioref" aria-expanded="false">Shah 2020, chap.
                                    1</a>)</span> - <em>Introduction</em></li>
                    </ul>
                    <p style="margin-top: 0;margin-bottom: 1rem;"><strong>Recommended</strong></p>
                    <ul>
                        <li>üìÉ Academic Article: "<em>Beyond Unicorns: Educating, Classifying, and Certifying Business Data
                                Scientists</em>" <span class="citation" data-cites="davenport_beyond_2020">(<a href="#ref-davenport_beyond_2020" role="doc-biblioref" aria-expanded="false">Davenport
                                    2020</a>)</span></li>
                    </ul>
                </details>
            </div>
            <!-- End of Week 01 Grid - Lecture + Lab + Revisit-->
        </div>
        <!-- End of Week 01 Grid -->
    </div>

</section>
```

</div>

**What do we see there?**

We call the elements enclosed by `<` and `>` characters **HTML tags**, for example: `<section>`, `<div>`, `<p>`. Tags are the building blocks of HTML, and define the structure of a webpage and specify how it should be displayed on your browser. Each tag has a specific meaning and purpose.

Browsers try, but not always succeed, to display the content of a webpage in the same way as intended by the author. The browser has to interpret the HTML code and make assumptions about how to display it.

> If you have been paying close attention, this is a type of **nested data structure**: there are tags inside tags inside tags. (_This is what we have been foreshadowing since your ‚úèÔ∏è W04 Formative challenge-assignment_)

**Common tags**

When inspecting a page, if you scroll up to the top of the HTML document, you find that the root of this data structure is the `<html>` tag. This tag is always there! Inside it, you will find the `<head>` and `<body>` tags, which are also always present. The `<head>` tag contains information about the page that is not displayed on the browser, such as the page title, the page description, and the page language. The `<body>` tag contains all the content that is displayed on the browser.

The skeleton of an HTML page is as follows:

```html
<!DOCTYPE html> <!-- this specifies that this text document is an HTML document -->
<html>
    <head>
        <!-- metadata goes here -->
    </head>
    <body>
        <!-- displayed content goes here -->
    </body>
</html>

```

All the other tags are nested inside the `<body>` tag. The most common tags you will find inside the `<body>` tag are:

- `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>`: headings
- `<p>`: paragraphs
- `<a>`: links
- `<img>`: images

üéØ **ACTION POINTS**

Give it a go, create a new text file here on VS Code and save it as `index.html`. Then, copy and paste the following code into it:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>My First Webpage</title>
  </head>
  <body>
    <h1>Hello World! This is my first h1 heading</h1>
    <p>This my first paragraph on my first webpage.</p>
  </body>
</html>
```

Now, locate this file `index.html` file on your computer and double-click on it to open it on your browser. You should see something like this:

![Screenshot 2023-10-22 122204.png](<attachment:Screenshot 2023-10-22 122204.png>)


Note how the browser displays the content of the page according to the tags we used. The `<h1>` tag is displayed as a large heading, and the `<p>` tag is displayed as a paragraph. Also, the special `<title>` element that is inside the `<head>` tag is displayed as the title of the page, which is displayed on the browser tab.

<div style="width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

üñáÔ∏è **USEFUL LINK:** 

- [HTML Element Reference - By Category](https://www.w3schools.com/TAGs/ref_byfunc.asp): to see a full list of all tags you could have on a webpage. 

</div>



## 1.2 Tags can have attributes

First, a üí° **TIP**: you can use HTML tags inside your Jupyter Notebook markdown cells to format your text. 

If you double click on the markdown cell above with the üñáÔ∏è **USEFUL LINK:** , you will see that I wrapped my text within a `<div>` tag and added a `style=` attribute to that div so I can customise how you see it on the notebook. 

```html
<div style="...">
    ...
</div>
```

Your browser (and your Jupyter notebook) has a set of defaults for how a block of content (`<div>`), or a paragraph (`<p>`) should be displayed, down to the font, the size, the spacing, etc. But you can override these defaults by adding a `style=` attribute to your tags.

The `style=` attribute is a special attribute that allows you to add CSS (Cascading Style Sheets) to your HTML tags. (More on CSS in a bit.) But this is not the only attribute you can add to a tag.



### 1.2.1 Attributes that are specific to each tag

Some attributes only make sense for a particular tag. For example the `<a>` tag, which represents a link to another page, always has an `href=` attribute, which specifies the address (the URL) of the page you want to link to. 

For example, if on the Syllabus I want to have a link to the Week 01 materials, I would use the following code:

```html
<a href="https://moodle.lse.ac.uk/course/view.php?id=8796#section-3">Week 01</a>
```

Other attributes that you might expect to see together with a particular tag are:

- Tag: `<img>`
    - `src=`: the URL of the image
    - `alt=`: the alternative text to display if the image cannot be loaded
- Tag: `<a>`
    - `href=`: the URL of the page you want to link to
    - `target=`: where to open the link (e.g. in a new tab)
- Tag: `<table>`
    - `border=`: the width of the border of the table
    - `cellpadding=`: the space between the cell content and the cell border
    - `cellspacing=`: the space between cells
- Tag: `<form>`
    - `action=`: the URL of the page that will process the form data
    - `method=`: the HTTP method to use when submitting the form (GET or POST)

### 1.2.2 The id attribute

Some attributes are common to all tags, such as the `id=` attribute. This attribute allows you to give a unique **identifier** to a tag. This is useful when you want to refer to a specific tag in your code. For example, if you want to link to a specific section of a page, you can use the `id=` attribute to give a unique identifier to that section, and then use that identifier in the link. 

```html
<section id="introduction">
    <h1>Introduction</h1>
    <p>The first week is all about setting up your computer and getting familiar with the tools we will use in the course.</p>
</section>
```

Later on, I could link to this section using the `#` symbol and the `id` of the section:

```html
<a href="#introduction">Go to Introduction</a>
```

The browser knows that because the href link starts with a `#`, it should look for an element with that `id` on the same page and scroll to it.

### 1.2.3 The class attribute

Another attribute that can be added to virtually all tags is the `class` attribute. Tag classes are used mostly to apply custom CSS styles to a group of tags. 

For example, say I have a bunch of paragraphs (`<p>`) in my page, but I want some of them to be displayed in a different colour. I could add a `class` attribute to those specific paragraphs, and then specify a styling rule for that class in my CSS file. 

```html
<p>This is a normal paragraph.</p>
<p class="highlighted">This is a highlighted paragraph.</p>
<p>This is another normal paragraph.</p>
```

```css
p.highlighted {
    color: red;
}
```



## 2. What is CSS?

You saw how I used the `style=` attribute to add some custom styling to the `<div>` tag in the markdown cell above. What I did there is called **inline styling**. This is a quick and dirty way to add some styling to a tag, but it is not always the best way to do it. Most web developers prefer to use **Cascading Style Sheets** (CSS) to style their web pages. 

CSS is a language that allows you to specify how HTML elements should be displayed on the browser. It is usually stored in a separate folder and linked to the HTML document using a `<link>` tag in the `<head>` section of the document. 

```html
<!DOCTYPE html>
<html>
  <head>
    <title>My First Webpage</title>
    <link rel="stylesheet" href="style.css"> <!-- this links to the CSS file -->
  </head>
  <body>
    <h1>Hello World! This is my first h1 heading</h1>
    <p>This is a normal paragraph.</p>
    <p class="highlighted">This is a highlighted paragraph.</p>
    <p>This is another normal paragraph.</p>
  </body>
````

## 2.1 How does CSS work?

CSS is a **declarative** language. This means that you don't have to tell the browser how to display each element. Instead, you specify **rules** that the browser will use to display the elements.

A CSS rule is made up of a **selector** and a **declaration block**. The selector specifies which elements the rule applies to, and the declaration block specifies how those elements should be displayed. 

```css
selector {
    property: value;
    property: value;
    ...
}
```

In our example above, we could have a rule that sets the property 'color' to 'black' for all `<p>` tags, and another rule that applies a different text color only to `<p>` tags with the `highlighted` class. 

```css
p {
    color: black;
}

p.highlighted {
    color: red;
}
```

The first rule applies to all `<p>` tags, regardless of the class or attributes they have. The second rule applies only to `<p>` tags that specifically have the `highlighted` class.


<div style="width:60%;font-size:0.85em;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

‚ö†Ô∏è **IMPORTANT:** 

Note that when specifying the class of a particular tag, we add `class="name-of-theclass"` but when we are referring to that class in our CSS file, we use a `.` before the name of the class.

Therefore, `p.highlighted` in a CSS file indicates that we are referring to the rule that applies to all `<p class="highlighted">` tags.

</div>

**What if I want to apply a rule to an ID instead?**

Then, you would use a `#` before the name of the ID. For example, if you want to apply a rule to the `<section id="introduction">` tag, you would use `#introduction` in your CSS file.

```css
#introduction {
    color: blue;
}
```

## 2.2 What are the properties I can use in CSS?

There are hundreds of CSS properties and values that you can use to style your web page. You can find a full list of them on the [W3Schools CSS Reference](https://www.w3schools.com/cssref/default.asp).

**How do I see all CSS styles applied to a specific tag?**

If you inspect a page and hover your mouse over a tag, you will see a list of all the CSS styles that are applied to that tag at the bottom of the window.

See for example, how the <p> tag is customised in the example below:

![image.png](attachment:image.png)

# 3. Use this knowledge to scrape the web

How does all of that help me with web scraping?

If we start to **think of web pages as sources of data**, we can reverse engineer the process that the browser uses to display the page to collect the information we need.

For example:

- if we want to collect the title of a page, we can look for the `<title>` tag inside the `<head>` section of the page. 

- if we want to collect the text of a paragraph, we can look for the `<p>` tag inside the `<body>` section of the page. If we want to collect the URL of an image, we can look for the `<img>` tag inside the `<body>` section of the page and extract the `src=` attribute.

- if what I want is to extract precisely the paragraphs that have been highlighted in red, I can look for the `<p class="highlighted">` tag inside the `<body>` section of the page.

To look at how to extract this in Python, you must go back to the W04 lab notebook to learn about `requests` and `scrapy` libraries. But stick around for a bit longer, because I want to show you how to think about declaring the data you want to extract from a page.

## 3.1 CSS Selectors

How do I specify the _path_ to the data I want to extract?

Say I want to collect the tag `<title>` that is nested inside the `<head>` title, I can specify this as a full path:
    
```css
html > head > title
```

That is, I am indicating that I want to start at the root of the HTML document (`html`), then go inside the `<head>` tag, and then inside the `<title>` tag.

This path we wrote above a way to specify the path to a specific tag in an HTML document and is called a **CSS selector**. This is what you would need to specify to the [`scrapy` library in Python](https://docs.scrapy.org/en/latest/topics/selectors.html) to extract the data you want from a page. Again, check W04 Lab notebook for more details.

<div style="width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

üñáÔ∏è **USEFUL LINK:** 

- [CSS Selectors](https://www.w3schools.com/cssref/css_selectors.php): to see a full list of all CSS selectors you could use to specify the path to a tag or a tag attribute.

</div>

**How do I practice that outside of Python?**

When inspecting a page, you can use the search bar at the top of the HTML code and search using CSS selectors. For example, if I want to find the `<title>` tag, I can search for `html > head > title` and the browser will highlight the tag I am looking for:

![image.png](attachment:image.png)

### 3.1.2 How do I specify a selector for a tag with a specific class or ID?

Just like how we use a `.` before the name of the class in our CSS file, we use a `.` before the name of the class in our CSS selector. For example, if I want to select all `<p>` tags with the `highlighted` class, I would use the following selector:

```css
p.highlighted
```

This would identify all `<p>` tags with the `highlighted` class, regardless of whether they are nested inside another tag or not.

**NOTE:** When practicing writing CSS selectors in python using scrapy, pay close attention to the difference between `sel.get()` and `sel.getall()`. The first one returns just the first element that matches the rule, while the second one returns a list of all elements that match the rule.

Similarly, to collect a tag with a specific ID, we use a `#` before the name of the ID in our CSS selector. For example, if I want to select the `<section id="introduction">` tag, I would use the following selector:

```css
#introduction
``` 


### 3.1.3 How to specify a selector for a tag that contains a specific attribute?

What if the person who wrote the HTML code did not use `<p class="highlighted">` and provided an inline-style instead with the `style=` attribute? 

There is a way to specify selectors for tags that contain a specific attribute. For example, if I want to select all `<p>` tags that have a `style=` attribute, I would use the following selector:

```css
p[style]
```

If you do this in the browser, you will be able to navigate and search for all tags that have a `style=` attribute. Just hit Enter after typing the selector in the search bar.

![image.png](attachment:image.png)

### 3.1.3 Complicated selectors

The most difficult part about writing CSS selectors is that you need to know the structure of the HTML document you are trying to scrape. Frequently, the structure of the document is not as straightforward as the example we saw above. You might find that you need to specify a more complicated path to the data you want to extract.

For example, look at the HTML snippet below that shows a nested structure with a bunch of divs and paragraphs. 

```html
<div id="main">
    <div class="container">
        <div class="row">
            <div class="col-md-3">
                <p>This is a paragraph</p>
            </div>
            <div class="col-md-6">
                <p>This is another paragraph</p>
            </div>
            <div class="col-md-3">
                <p>This is another paragraph</p>
            </div>
        </div>
        <div class="row">
            <div class="col-md-3">
                <p>This is a paragraph</p>
            </div>
            <div class="col-md-6">
                <p>This is another paragraph</p>
            </div>
            <div class="col-md-3">
                <p>This is another paragraph</p>
            </div>
        </div>
    </div>
</div>
```

Suppose you want to select only the paragraphs that are within the `col-md-6` divs. You could do this by specifying the following selector:

```css
div#main > div.container > div.row > div.col-md-6 > p
```

This would search for all `<p>` tags that are nested inside a `<div class="col-md-6">` tag, which is nested inside a `<div class="row">` tag, which is nested inside a `<div class="container">` tag, which is nested inside a `<div id="main">` tag.


### 3.1.4 Becoming a pro at CSS selectors

CSS Selectors are flexible about the starting point of the path. You could have specified the same selector as follows:

```css
div.col-md-6 > p
```

This would still work, because all the care about is that the `<p>` tag is nested inside a `<div class="col-md-6">` tag.

‚ö†Ô∏è **IMPORTANT:** Another important thing to note is that by using the CSS Selector above, we don't care if the `<p>` is _immediately_ nested inside the `<div class="col-md-6">` tag. It could be nested inside another tag that is nested inside the `<div class="col-md-6">` tag, and it would still work.

What if you _do_ want to map only the `<p>` tags that are immediately nested inside the `<div class="col-md-6">` tag? Then you would use the following selector:

```css
div.col-md-6 + p
```

<span style="font-weight:bold;color:#c63c4a">Sometimes, it won't be possible to write a CSS selector that will select only the data you want. In those cases, write the selector that gets you as close as possible to the data you want, save it as a Python list and then use your Python skills to filter the data further.</span>

Once again, the following link is a great resource to learn more about CSS selectors:

<div style="width:70%;border: 1px solid #aaa; border-radius:1em; padding: 1em; margin: 1em 0;">

üñáÔ∏è **USEFUL LINK:** 

- [CSS Selectors](https://www.w3schools.com/cssref/css_selectors.php): to see a full list of all CSS selectors you could use to specify the path to a tag or a tag attribute.

</div>

# What's Next?

How does it all relate to the material of üë®‚Äçüè´ [Week 04]() and how can you use this knowledge to improve your web scraping skills?

- I hope this notebook helped to clarify the concepts of HTML and CSS and how they relate to web scraping.

- The immediate next step is to go back to the üõ£Ô∏è [W04 Lab]() notebook and re-read it with this new knowledge in mind.

- If you felt a bit lost in üõ£Ô∏è [W04 Lab](), please reach out to me let me know if things make more sense now!


**How does this link to the lecture?**

1. In the lecture, I talked about the difference between plain text files and binary files. I focused on CSV files as an example. HTML files are also plain text files, that adheres to a specific structure (of nested elements, instead of rows and columns)

2. The rest of the tips and tricks I showed in the lecture will be most relevant after you go back to the Week 04 Lab notebook and start working on your üìù [W05 Summative assignment](). What I showed there has more to do with what you do once you have collected the data as a list, or a dictionary, and you need to navigate said data structure to extract the information you want.

3. The lecture also featured explanation about the data structure of the columns in a data frame. Given the changes in our lecture schedule since W03, this part of the content was not a requirement of your üìù [W05 Summative assignment](), but it will be relevant for your W06-W07 Summative assignment. So, keep that in mind!


