ποΈ Week 02 β Day 01: The Internet and the Web
The Internet vs the Web, and the basics of web standards
Let us now change our focus to the types of files one can encounter when collecting data from the Web, and the concepts underlying the Internet.
π₯ Learning Objectives
Review the goals for today
At the end of the day you should be able to:
- Articulate the difference between the Internet and the Web
- Use HTML and CSS to create a simple webpage
- Write code to automate the collection of data from websites
π¨βπ« Part I: Slides
Either click on the slide area below or click here to view it in fullscreen. Use your keypad to navigate the slides.
π₯ Looking for lecture recordings? You can only find those on Moodle.
π Part II: Activity β HTML and CSS
We will now create a simple webpage using HTML and CSS. This page will serve as a profile for you.
π Useful Links:
π― ACTION POINTS:
Preparation: Install the Live Server extension in VSCode.
From within VSCode, create a new folder called
profile
in yourME204
directory.Create a new plain text file called
index.html
inside theprofile
folder.Use the following HTML template to get started:
<!DOCTYPE html> <html> <head> <title>MY MINI CV</title> </head> <body> <h1>NAME</h1> <figure> <img src="my_avatar_image.png" alt="DESCRIPTION OF IMAGE" style="height:5em" /> </figure> <div id="bio"> <section> <h2 class="title">π©βπ» About me:</h2> <article> <h3 class="subtitle">π©βπ Education: </h3> <span class="content"> <p>Master's in DEGREE_PROGRAMME at INSTITUTION (COUNTRY) (or equivalent)</p> <p>BSc in DEGREE_PROGRAMME at INSTITUTION (COUNTRY) (or equivalent)</p> </span> </article> <article> <h3 class="subtitle">π Currently learning:</h3> <span class="content"> <ul> <li>Data Science Skill 1</li> <li>Data Science Skill 2</li> <li>Data Science Skill 3</li> <li>Data Science Skill 4</li> </ul> </span> </article> </section> </div> <div id="contact"> <h2 class="title">π Contact:</h2> <span class="content"> <ul> <li><img src="https://img.icons8.com/ios/50/000000/github.png" style="height:1.1em;vertical-align:middle" /> GitHub username: <a href="#">@username</a></li> <li><img src="https://img.icons8.com/ios/50/000000/slack.png" style="height:1.1em;vertical-align:middle" /> Slack username: <a href="#">@username</a></li> </ul> </span> </div> </body> </html>
Save the file and right-click on the file and select βOpen with Live Serverβ to view the page in your browser.
Add some CSS styling to the inside of the
<head>
tag in theindex.html
file! We will change the font, add some padding, make the list elements look better and like little cards (and float side by side), and change the color of the links. Add the following code to the<head>
tag in theindex.html
file (just after the<title>
tag):<style type="text/css"> body {font-family: 'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif; padding: 1em; } h1 {color: #333; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; } h2 {color: #333; } h3 {color: #333; } .title { color: #333; } .subtitle { color: #333; } .content { color: #333; } ul {list-style-type: none; padding: 0; } li {background-color: #f9f9f9; padding: 1em; margin: 0.5em; border-radius: 5px; display: inline-block; } a {color: #333; text-decoration: none; } :hover { acolor: #666; } img {vertical-align: middle; } figure {text-align: left; } #bio { margin-top: 2em; } #contact { margin-top: 2em; border-top: 1px solid #333; } > h3 { article margin: 0; } > span { article display: block; padding-left: 1em; }</style>
Reload the page in your browser to see the changes.
Add a new file called
styles.css
in theprofile
folder.Move all the CSS code from the
<style>
tag in theindex.html
file to thestyles.css
file.Delete the
<style>
tag from theindex.html
file and replace it with the following line:<link rel="stylesheet" type="text/css" href="styles.css">
The above line will link the
styles.css
file to theindex.html
file.Reload the page in your browser to see the changes.
Right-click somewhere on the page and click on Inspect to see the HTML and CSS code.
Send your own HTML file to Slack when asked.
π» Part III: Collecting data from websites
In the afternoon, you will experiment with web scraping using Python more extensively. Right now, I will demonstrate how to use scrapy
to extract just what we want from HTML documents.
Hereβs some initial code I will use:
# Assuming I know the name of a specific HTML file
= 'path/to/your/file.html'
filepath
# Read the HTML file
with open(filepath, 'r', encoding='utf-8') as file:
= file.read()
html
# Parse the HTML file
from scrapy import Selector
= Selector(text=html)
selector
# Extract the title of the page
= selector.css('title::text')
title
# Get an object with a specific id
= selector.css('#bio')
bio
bio.get()
# Get objects with an specific class
= selector.css('.content')
content_boxes
content_boxes.getall()
π Recommended extra content
Check out this WIRED interview with Sir Tim Berners-Lee, the inventor of the World Wide Web: