🗓️ Week 04
The Internet: protocols + scraping + APIs

DS105 Data for Data Science

10/18/22

The Internet and how it works

We are surrounded by hosts

  • Hosts are devices that can send or receive traffic
  • Host can be anything:
    • laptops
    • smartphones
    • PCs
    • supercomputers
    • etc.

Client-Server model

  • Host interacts and exchanges “messages”.
  • Those hosts that send requests are called clients
  • Those that respond to them with content (web-pages, data, emails, etc.) are called servers

Such a request-respond system is called a client-server model

IP addresses

  • Each host needs a unique name to communicate with others
  • In networking this name is called an IP-address
  • An IP-address weighs 32 bits and hence convey 32 1-s and 0-s
  • We split those in 4 chunks and get an address of the following format:

The hierarchy

  • IP addresses are assigned hierarchichally
  • Each new part of the IP address represent a certain part of the network

UK UK first_layer 44.XX.XX.XX UK->first_layer LSE LSE first_layer->LSE second_layer 44.20.XX.XX LSE->second_layer DSI DSI second_layer->DSI third_layer 44.20.140.XX DSI->third_layer

Networks

  • A collection of interconnected hosts that exchange traffic can be called a network
  • The examples of networks can be:
    • your house (laptop + printer + smartphone)
    • LSE (many PCs + laptops + supercomputer)
    • an office (laptops + printers + projectors)
  • If you unite all of the networks
    + add rules of their connectivity
    called protocols you will get the Internet

Protocols

Why do we need protocols?

  • Before around 1973 computers didn’t have a unified system of rules to interact
  • In 1973 the development of the so-called TCP/IP (Transmission Control Protocol + Internet Protocol) has started
  • “that allowed any system to connect to any other system, using any network topology” (Hall 2000)

TCP/IP

  • TCP/IP implies a family of different protocols
  • Each of these protocols serve a certain purpose

The mostly widely used protocols include:

  • Address Resolution Protocol (ARP)
  • Domain Name System (DNS)
  • File Transfer Protocol (FTP)
  • Internet Message Access Protocol (IMAP)
  • HyperText Transfer Protocol (HTTP)

HTTP example

  • A user sends a request to the server and gets back a web-page.
  • Usually HTTP protocol is in the form of HTTPS, where S stands for secure.

For example: https://lse.co.uk

Web pages


Once the HTTP request is sent and accepted, we might get a web page

Tools for creating web-pages

There are 3 key web programming languages:

  • HTML (HyperText Markup Language) - used to create the “skeleton” of the page
  • CSS (Cascading Style Sheets) - used for advanced styling
  • Java Script - used for interactivity

During the course you will mostly be working with HTML.

<!DOCTYPE html>
<html>
  <body>

  <h1>My First Heading</h1>
  <p>My first paragraph.</p>

  </body>
</html>

Our course


<section id="course-brief" class="level1">
  <h1>📑 Course Brief</h1>
    <p><strong>Focus:</strong> learn how to collect and handle so-called “real data”</p>
    <p><strong>How:</strong> hands-on coding exercises and a group project</p>
</section>


API

  • Application
  • Programming
  • Interface

What is an API?

“APIs are mechanisms that enable two software components to communicate with each other using a set of definitions and protocols.”

Amazon Web Services

Example:

API endpoints

Sends a request with:

  • Request parameters
  • (usually) API key

Request:

GET api.uber.com/get_trip
?key=BQND7361120
&lat=40.83008
&long=-39.38419

Sends back a response that can be:

  • data in a JSON, txt, XML or other formats
  • etc.

Response:

{"price":"12.4 pounds",
"length": "22 minutes"}

After the ☕ break:

  • Exploring the web pages and HTML code
  • Sending different requests to web pages
  • API endpoints

References

Hall, Eric. 2000. Internet Core Protocols: The Definitive Guide: Help for Network Administrators. " O’Reilly Media, Inc.".