πŸ—“οΈ Week 02 – Day 01: The Internet and the Web

The Internet vs the Web, and the basics of web standards

Author
Published

15 July 2024

Let us now change our focus to the types of files one can encounter when collecting data from the Web, and the concepts underlying the Internet.

πŸ₯… Learning Objectives

Review the goals for today

At the end of the day you should be able to:

  • Articulate the difference between the Internet and the Web
  • Use HTML and CSS to create a simple webpage
  • Write code to automate the collection of data from websites

πŸ‘¨β€πŸ« Part I: Slides

Either click on the slide area below or click here to view it in fullscreen. Use your keypad to navigate the slides.

πŸŽ₯ Looking for lecture recordings? You can only find those on Moodle.

πŸ“‹ Part II: Activity – HTML and CSS

We will now create a simple webpage using HTML and CSS. This page will serve as a profile for you.

πŸ”— Useful Links:

🎯 ACTION POINTS:

Preparation: Install the Live Server extension in VSCode.

  1. From within VSCode, create a new folder called profile in your ME204 directory.

  2. Create a new plain text file called index.html inside the profile folder.

    Use the following HTML template to get started:

    <!DOCTYPE html>
    <html>
    
    <head>
        <title>MY MINI CV</title>
    </head>
    
    <body>
        <h1>NAME</h1>
        <figure>
            <img src="my_avatar_image.png" alt="DESCRIPTION OF IMAGE" style="height:5em" />
        </figure>
        <div id="bio">
            <section>
                <h2 class="title">πŸ‘©β€πŸ’» About me:</h2>
                <article>
                    <h3 class="subtitle">πŸ‘©β€πŸŽ“ Education: </h3>
                    <span class="content">
                        <p>Master's in DEGREE_PROGRAMME at INSTITUTION (COUNTRY) (or equivalent)</p>
                        <p>BSc in DEGREE_PROGRAMME at INSTITUTION (COUNTRY) (or equivalent)</p>
                    </span>
                </article>
                <article>
                    <h3 class="subtitle">πŸ“š Currently learning:</h3>
                    <span class="content">
                        <ul>
                            <li>Data Science Skill 1</li>
                            <li>Data Science Skill 2</li>
                            <li>Data Science Skill 3</li>
                            <li>Data Science Skill 4</li>
                        </ul>
                    </span>
                </article>
            </section>
        </div>
        <div id="contact">
            <h2 class="title">πŸ“ž Contact:</h2>
            <span class="content">
                <ul>
                    <li><img src="https://img.icons8.com/ios/50/000000/github.png" style="height:1.1em;vertical-align:middle" /> GitHub username: <a
                            href="#">@username</a></li>
                    <li><img src="https://img.icons8.com/ios/50/000000/slack.png" style="height:1.1em;vertical-align:middle" /> Slack username: <a
                            href="#">@username</a></li>
                </ul>
            </span>
        </div>
    </body>
    
    </html>
  3. Save the file and right-click on the file and select β€œOpen with Live Server” to view the page in your browser.

  4. Add some CSS styling to the inside of the <head> tag in the index.html file! We will change the font, add some padding, make the list elements look better and like little cards (and float side by side), and change the color of the links. Add the following code to the <head> tag in the index.html file (just after the <title> tag):

    <style type="text/css">
        body {
            font-family: 'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
            padding: 1em;
        }
    
        h1 {
            color: #333;
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
        }
    
        h2 {
            color: #333;
        }
    
        h3 {
            color: #333;
        }
    
        .title {
            color: #333;
        }
    
        .subtitle {
            color: #333;
        }
    
        .content {
            color: #333;
        }
    
        ul {
            list-style-type: none;
            padding: 0;
        }
    
        li {
            background-color: #f9f9f9;
            padding: 1em;
            margin: 0.5em;
            border-radius: 5px;
            display: inline-block;
        }
    
        a {
            color: #333;
            text-decoration: none;
        }
    
        a:hover {
            color: #666;
        }
    
        img {
            vertical-align: middle;
        }
    
        figure {
            text-align: left;
        }
    
        #bio {
            margin-top: 2em;
        }
    
        #contact {
            margin-top: 2em;
            border-top: 1px solid #333;
        }
    
        article > h3 {
            margin: 0;
        }
    
        article > span {
            display: block;
            padding-left: 1em;
        }
    </style>
  5. Reload the page in your browser to see the changes.

  6. Add a new file called styles.css in the profile folder.

  7. Move all the CSS code from the <style> tag in the index.html file to the styles.css file.

  8. Delete the <style> tag from the index.html file and replace it with the following line:

    <link rel="stylesheet" type="text/css" href="styles.css">

    The above line will link the styles.css file to the index.html file.

  9. Reload the page in your browser to see the changes.

  10. Right-click somewhere on the page and click on Inspect to see the HTML and CSS code.

  11. Send your own HTML file to Slack when asked.

πŸ’» Part III: Collecting data from websites

In the afternoon, you will experiment with web scraping using Python more extensively. Right now, I will demonstrate how to use scrapy to extract just what we want from HTML documents.

Here’s some initial code I will use:


# Assuming I know the name of a specific HTML file

filepath = 'path/to/your/file.html'

# Read the HTML file
with open(filepath, 'r', encoding='utf-8') as file:
    html = file.read()

# Parse the HTML file

from scrapy import Selector

selector = Selector(text=html)

# Extract the title of the page

title = selector.css('title::text')

# Get an object with a specific id

bio = selector.css('#bio')

bio.get()

# Get objects with an specific class

content_boxes = selector.css('.content')

content_boxes.getall()