sequenceDiagram
participant Browser
participant Server
Browser->>Server: HTTP Request
Server-->>Browser: HTTP Response
DS205 – Advanced Data Manipulation
26 Jan 2026
The Internet
The Web

hostname -I on Linux/macOS, or ipconfig on Windows.wikipedia.org to the corresponding IP addresses so you don’t have to remember them.
" The most influential inventor of the modern world, Sir Tim Berners-Lee is a different kind of visionary. Born in the same year as Bill Gates and Steve Jobs, Berners-Lee famously shared his invention, the World Wide Web, for no commercial reward. Its widespread adoption changed everything, transforming humanity into the first digital species. Through the web, we live, work, dream and connect.
In this intimate memoir, Berners-Lee tells the story of his iconic invention, exploring how it launched a new era of creativity and collaboration while unleashing a commercial race that today imperils democracies and polarizes public debate. As the rapid development of artificial intelligence heralds a new era of innovation, Berners-Lee provides the perfect guide to the crucial decisions ahead – and a gripping, in-the-room account of the rise of the online world.”
(Synopsis from the publisher) Berners-Lee, T. (with Witt, S.). (2025). This is for everyone. Macmillan.
sequenceDiagram
participant Browser
participant Server
Browser->>Server: HTTP Request
Server-->>Browser: HTTP Response
💡 A Web browser sends many requests for one page, one for each resource (HTML, CSS, JavaScript, images, fonts, etc.).
HTTP Response Codes
HTTP responses come with ‘status codes’ that indicate whether the request was successful or not. Typical ones you will encounter are:
You can find a full list here.
requests.get() Is Not a Browserrequests.get()
Browser
The main components of a web page are built with these three languages:
HTML
Structure and meaning
CSS
Appearance and layout
JavaScript
Behaviour and interactivity
👉 When you write Markdown then you render it on VS Code or on a browser, that is because a software has converted the Markdown to their HTML equivalent.
| Markdown | HTML |
|---|---|
**Bold** |
<b>Bold</b> |
_Italic_ |
<i>Italic</i> |
# Heading |
<h1>Heading</h1> |
- List item |
<ul><li>List item</li></ul> |
[Link](https://example.com) |
<a href="https://example.com">Link</a> |
Why selectors matter for scraping
The same selector syntax that CSS uses to style elements is what we use to find elements when scraping. If you can select it for styling, you can select it for extraction.
| Selector | Meaning |
|---|---|
h1 |
All <h1> elements |
.product |
Elements with class="product" |
#price |
The element with id="price" |
div.product |
<div> elements with class product |
ul li |
<li> elements inside <ul> |
a[href] |
<a> elements that have an href attribute |
Try it in DevTools
Open the Console tab and run:
This returns all elements matching that selector. The same logic powers Scrapy’s response.css() method.
The scraping problem
If content is loaded by JavaScript, it won’t appear in the HTML that requests.get() returns. You’ll get an empty container where the content should be.
This is why sometimes in web scraping we need to use tools that truly run a browser, like the Python package Selenium, to scrape the content.
Static (Scrapy works)
requests.get() matches what you see on screenDynamic (may need Selenium)
requests.get() shows empty containers or loading spinnersThe developer’s intent:
They used class="price" so CSS could style it green and bold, for example:
£2.50
Your intent:
You use that same class to find and extract the value.
The code inside the .css() method is a CSS selector that finds the element with the class price and extracts the text content of that element.
Website owners can publish a file at /robots.txt that tells crawlers which parts of the site they’d prefer not to be scraped.
This is a request, not a technical barrier. Your scraper can ignore it. Whether you should ignore it is an ethical question.
Our approach in DS205
robots.txt where it existsA polite scraper is less likely to get blocked.
Data Collection at Scale
Benefits and Advantages
Real-World Applications
Market Research
Academic Research
Financial Analysis
Job Market Analysis
News Monitoring
For more use cases, read Chapter 3 of Ryan Mitchell (2024). Web Scraping with Python (3rd ed.). O’Reilly Media, Inc. Icons are by Flat Icons - Flaticon

Here is a tutorial of someone who used web scraping to search for jobs.
👤 On this same vein: a former DS205 student used web scraping to build a platform to find entry-level finance jobs. I have invited him to come and share his experience with you (Week 07-08).

👈🏻 This paper from the field of digital humanities offers a valuable perspective of how to think about the Web in terms of archival and research purposes.
While published several years ago and originating outside data science, its core concept (viewing the Web as a structured dataset rather than an unorganized mass of information) matches our approach to web scraping in this course and perhaps data science applications of web scraping more broadly.
Black, M.L. (2016). The World Wide Web as Complex Data Set: Expanding the Digital Humanities into the Twentieth Century and Beyond through Internet Research. Int. J. Humanit. Arts Comput., 10, 95-109.
👨💻 I will be working on the W02-NB01-Lecture-Scrapy.ipynb notebooks available on Nuvolos and for download on the 🖥️ W02 Lecture page.
💻 Tomorrow’s Lab
You’ll practise scraping a UK supermarket website using the techniques from today’s lecture.
The lab notebook includes instructions for setting up a conda environment with the packages you’ll need.
Come prepared to use both Scrapy and Selenium. Some websites require one, some require the other. Part of the skill is figuring out which.
✍️ Problem Set 1
Instructions released after tomorrow’s lab.
I won’t impose a particular file structure just yet. But as you start to write code for your first problem set this week, keep asking yourself:
“How easy would someone else find it to follow my code and understand my file organisation?”
That question will matter more than you might expect!
We will build collective understanding about how we want our repositories to look like and how we want to organise our code.


![]()
LSE DS205 (2025/26)