---
title: "üíª Week 02 - Webscrapping script"
date: 11 October 2023
subtitle: "2023/24 Autumn Term"
author: Dr. [Ghita Berrada](https://www.lse.ac.uk/DSI/People/Ghita-Berrada), Garima Chaudhary
---

Below is the script that was showcased during the week 2 class. You can try it out by yourselves on Nuvolos or [Google Colab](https://colab.google/).


## Python libraries imports

In this step, you import all the Python libraries you need to execute the code that follows. Some libraries (e.g `re`, the library that deals with regular expressions i.e string pattern matching, or `datetime`, the library for date and time manipulation or `requests`, the library designed to send HTTP requests and used in webscrapping) are part of the Python Standard Library (see [here](https://docs.python.org/3.10/library/index.html) for the list of packages included in the standard library in Python 3.10) but some (most notably `pandas`, the library for dataframe manipulation and one of the main libraries in use in data science, and `BeautifulSoup`, the library used for webscrapping) are not and need to be installed (using a `pip install` or a `conda install` command) before you import them, otherwise you would get an error at import time.

In [None]:
#Necessary python libraries import (before import step, the installation of library is required in python environment)

import requests   #library for webscrapping
from bs4 import BeautifulSoup, Tag   #library for webscrapping
import pandas as pd   #basic library
import re   #library for regex (word matching)
import datetime   #library to use DateTime method
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

## Sending an HTTP GET request to Wikipedia

This step is about sending a request for data from the Wikipedia page you are interested in i.e the current events page. If your request is successful, you get a response code 200. If not, you get an error code, e.g 404 (Page not found).

In [None]:
# URL of the webpage with the ongoing events
url = "https://en.wikipedia.org/wiki/Portal:Current_events"

# Send an HTTP GET request to the URL
response = requests.get(url)
print("response:", response)

What is happening? The `requests` library sends a bare-bones request that looks suspicious to the Wikipedia page (it looks as if it was generated by a bot). So Wikipedia blocks the request.

How can we solve this?

 When you visit a website with Chrome, Safari, or Firefox, your browser automatically sends extra information about itself. For example, it might say: "I am Chrome version 119 running on macOS."

This information is called the "User-Agent".

Many websites check the User-Agent to decide whether to trust the request. If no User-Agent is provided (like when using Python's `requests` by default),
the site may think you are a bot and block you (with an error like 403 Forbidden). So we'll be setting a User-Agent to circumvent this problem.

In [None]:
# Step 1: Choose the web page we want to access
# In this case, it's Wikipedia's "Current events" portal.
url = "https://en.wikipedia.org/wiki/Portal:Current_events"

# Step 2: Add headers so the website accepts our request
#
# When you visit a website with Chrome, Safari, or Firefox,
# your browser automatically sends extra information about itself.
# For example, it might say: 
#   "I am Chrome version 119 running on macOS."
#
# This information is called the "User-Agent".
#
# Many websites check the User-Agent to decide whether
# to trust the request. If no User-Agent is provided
# (like when using Python's 'requests' by default),
# the site may think you are a bot and block you (with an error like 403 Forbidden).
#
# To avoid this, we set a User-Agent string that looks like
# a real web browser. This way, the website "thinks"
# we are visiting normally, just like a person using Chrome.
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/119.0.0.0 Safari/537.36"
}

# Step 3: Send the request to the website
# This asks the server to send us the page's HTML code.
response = requests.get(url, headers=headers)

# Step 4: Print the response object
# This shows the HTTP status code (e.g., 200 = OK, 403 = Forbidden).
# For now, we only care that it returns <Response [200]>,
# which means the request worked.
print(response)

## Scrapping the ongoing events column for content

In [None]:
# Check if the request worked (status code 200 means OK)
if response.status_code == 200:
    # Parse the HTML content of the page with BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")

    # This will hold all scraped events as dictionaries
    events = []

    # Each day on the page is inside a <div> with class="current-events-main"
    for day_block in soup.find_all("div", class_="current-events-main"):
        
        # Get the date of the events
        # Some blocks have a <span class="bday"> with the date in YYYY-MM-DD format
        bday = day_block.select_one("span.bday")
        if bday:
            date_iso = bday.get_text(strip=True)  # e.g. "2025-09-27"
        else:
            # If not found, fall back to attributes like aria-label or id
            date_iso = day_block.get("aria-label") or day_block.get("id") or ""

        # The main content (headlines + descriptions) is inside
        # <div class="current-events-content">
        content = day_block.find(class_="current-events-content")
        if content is None:
            continue  # skip if nothing found

        # Go through all direct children of this content block
        for child in content.children:
            # We only care about tags (<ul>, <div>, etc.), not text or whitespace
            if not isinstance(child, Tag):
                continue

            # The events for each category are listed inside <ul> elements
            if child.name == "ul":
                # Loop through each <li> (one event per <li>)
                for li in child.find_all("li", recursive=False):

                    # ----- HEADLINE -----
                    # Try to take the first <a> tag inside the <li> as the headline
                    headline = ""
                    for c in li.contents:
                        if isinstance(c, Tag) and c.name == "a":
                            headline = c.get_text(strip=True)
                            break
                    # If no <a> found, fall back to bold text <b>
                    if not headline:
                        btag = li.find("b")
                        if btag:
                            headline = btag.get_text(strip=True)

                    # ----- DESCRIPTION -----
                    # Some events have extra details in a nested <ul>
                    nested_ul = li.find("ul")
                    if nested_ul:
                        # Combine all nested <li> items into one string
                        desc_parts = [
                            x.get_text(" ", strip=True) for x in nested_ul.find_all("li")
                        ]
                        description = " ".join(desc_parts).strip()
                    else:
                        # Otherwise, just take the full text of the <li>
                        full_text = li.get_text(" ", strip=True)
                        # Remove the headline part from the description if it repeats
                        if headline and full_text.startswith(headline):
                            description = full_text[len(headline):].strip(" \u2014-:‚Äì‚Äî,.()")
                        else:
                            description = full_text

                    # Save this event in our list
                    events.append({
                        "date": date_iso,
                        "headline": headline,
                        "description": description
                    })

    # Convert the list of dictionaries into a Pandas DataFrame (like a table)
    df = pd.DataFrame(events, columns=["date", "headline", "description"])

    # Print a quick summary
    print(f"Extracted {len(df)} events")
    print(df.head(15))  # show first 15 rows

else:
    # If the page couldn‚Äôt be reached, print the error code
    print(f"HTTP request failed with status code {response.status_code}.")

The code in the small chunk below simply displays the content of the Wikipedia page that you are scrapping i.e the content that you obtain from this line of code `soup = BeautifulSoup(response.content, 'html.parser')`

In [None]:
print(soup) # Displays the content of the scrapped Wikipedia page (HTML format)

```python
#Another way of displaying the dataframe
df
```


## Grouping events with the same `headline`

Our dataframe `df` is pretty much clean already (one row per event per date) and we could leave it as is for further analysis.

But let's try and do some further processing. As you might have noticed, some events sharing a `headline` e.g Gaza war appear on multiple lines (with different dates and descriptions). We'll try and group all events sharing a `headline` together (so that there's one row per `headline`).

We first start by making the headlines consistent (some might have leading or trailing spaces):

In [None]:
df['headline'] = df['headline'].str.strip().str.lower()

We then group by headline, making sure to keep the first and last dates at which the event appears (can aggregate dates using `min` and `max` to get the first and last occurrence) and merging all event descriptions


In [None]:
grouped_df = df.groupby('headline').agg({
    'date': ['min', 'max'],  # first and last dates
    'description': lambda x: ' | '.join(x.dropna())  # merge all descriptions
})

# Flatten MultiIndex columns
grouped_df.columns = ['first_date', 'last_date', 'merged_description']
grouped_df = grouped_df.reset_index()

grouped_df

## Calculating event duration

`grouped_df` already gives us `first_date` and `last_date`, so calculating the duration of events and adding this information to `grouped_df` is rather straightforward. Note that for the calculations to work we need to make sure `first_date` and `last_date` are converted to `datetime`.

In [None]:
## Convert to datetime if not already
grouped_df['first_date'] = pd.to_datetime(grouped_df['first_date'])
grouped_df['last_date'] = pd.to_datetime(grouped_df['last_date'])

# Calculate duration in days
grouped_df['duration_days'] = (grouped_df['last_date'] - grouped_df['first_date']).dt.days

# Optionally, in years
grouped_df['duration_years'] = grouped_df['duration_days'] / 365

In [None]:
#print the dataframe with added duration
grouped_df

## Getting events with duration longer than a day

Many events start and end on the same day and so, for them, `duration_days` is 0.

We want to try and find events where `duration_days` is greater than 0:

In [None]:
pos_duration = grouped_df.query('duration_days!=0')
pos_duration

## Extracting Event Locations from `merged_description`


In this step, we aim to assign a **single location** to each event from `pos_location` based on its description. Because the `merged_description` column often contains multiple sentences, news sources, or even multiple locations, we‚Äôll use a **simple regex-based approach**. This method is not perfect, but it‚Äôs useful for teaching basic string processing and the challenges of real-world data.

The process has three main parts:

1. **Initial Extraction** ‚Äì Try to find common keywords or the first capitalized word that looks like a location.
2. **Mapping / Correction** ‚Äì Some extracted words may be adjectives (like ‚ÄúMalawian‚Äù); we map them to a proper country name.
3. **Standardization** ‚Äì Capitalize the corrected location for consistency.


### **Step 1 ‚Äì Define Extraction Function**

In [None]:
def extract_event_location(description):
    """
    Extracts a likely event location from the description.
    1. Try keywords like 'United', 'North', 'South', etc.
    2. Check for known country names appearing anywhere in the text.
    3. Fallback: first capitalized word (ignoring common stop words)
    """
    if pd.isna(description):
        return None
    
    # Step 1: keyword + adjective matches
    pattern = r'\b(?:North|South|Sri|Philippine|Pakistani|Malawian|Gaza|Ukrainian|Colombian)\b(?:\s+\w+)?'
    match = re.search(pattern, description, re.IGNORECASE)
    if match:
        return match.group(0)
    
    # Step 2: look for country names directly
    countries = ['Malawi', 'Philippines', 'Palestine', 'Pakistan', 'Ukraine', 'Colombia', 'Taiwan', 'Hong Kong', 'China']
    for country in countries:
        if re.search(r'\b' + re.escape(country) + r'\b', description):
            return country
    
    # Step 3: fallback to first capitalized word not in stoplist
    words = re.findall(r'\b[A-Z][a-z]+\b', description)
    stopwords = ['The', 'Former', 'Ten', 'All', 'At', 'War', 'In', 'Philippine', 'Pakistani', 'Malawian', 'Gaza', 'Ukrainian', 'Colombian']
    for w in words:
        if w not in stopwords:
            return w
    
    return None

### **Step 2 ‚Äì Apply Extraction**

In [None]:
# Apply the extraction function to each event
pos_duration['event_location'] = pos_duration['merged_description'].apply(extract_event_location)

At this point, `event_location` contains the **raw extracted location**, which may be an adjective (e.g., ‚ÄúMalawian‚Äù) or a partial place name (e.g., ‚ÄúGaza‚Äù).

Here's what `pos_duration` looks like after this processing:

In [None]:
pos_duration

### **Step 3 ‚Äì Correct Country Names**


Our location extraction extraction works well in some ways:

- **Miners ‚Üí Colombia** is correctly picked.
- Events like **Gaza war, Malawian election, Typhoon Philippines** are reasonable.

However, the extraction is still imperfect:

-  **‚ÄúMalawian president‚Äù**  (still contains extra words; ideally we want just ‚ÄúMalawi‚Äù)
- **‚ÄúPhilippine Daily‚Äù / ‚ÄúPhilippine Department‚Äù** (picked from news sources; we want ‚ÄúPhilippines‚Äù)
- **‚ÄúUkrainian territories‚Äù** (it's okay-ish, but could be normalized to ‚ÄúUkraine‚Äù)
- **‚ÄúPakistani Taliban‚Äù** (contains a group adjective; ideally mapped to ‚ÄúPakistan‚Äù)

These are normal challenges of regex-based extraction on messy descriptions: **regex is simple but brittle** (not to mention that **regex** require some knowledge of the text parsed).

So, how do we fix the issues we've identified?

We'll **add a post-processing mapping** to clean common messes:

In [None]:
location_mapping = {
    'Malawian president': 'Malawi',
    'Philippine Daily': 'Philippines',
    'Philippine Department': 'Philippines',
    'Pakistani Taliban': 'Pakistan',
    'Ukrainian territories': 'Ukraine',
    'Gaza City': 'Palestine'
}

pos_duration['correct_location'] = pos_duration['event_location'].apply(
    lambda x: location_mapping.get(x, x)
).str.title()

How does our dataframe look like now?

In [None]:
pos_duration

**Takeaway**

* regex **can extract locations roughly**.
* **manual mapping / correction** is needed when data contains news sources, group names, or multiple locations.
*To automate this location extraction more reliably and efficiently, we could use **Named Entity Recognition** (or **NER**), which avoids brittle regex. Instead of relying on fragile regex rules, we can use a pre-trained NER model to detect named entities in text. Entities labeled as `GPE` (Geo-Political Entities) usually correspond to countries, cities, or regions. This approach is much more flexible and can handle multiple locations in a description automatically.

Here's how NER would work:

- just as before, we define a function to extract locations (this time relying on NER instead of regex)

In [None]:
def extract_locations_ner(text):
    """
    Extract GPE entities (cities, countries, regions) from text using spaCy NER.
    Returns the first location found for simplicity.
    """
    if pd.isna(text):
        return None
    
    doc = nlp(text)
    # Collect entities labeled as GPE (Geo-Political Entities)
    gpe_entities = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    
    if gpe_entities:
        # Pick the first one (or could join all for multiple locations)
        return gpe_entities[0]
    return None

- then, we apply the function to all dataframe rows

In [None]:
pos_duration['ner_location'] = pos_duration['merged_description'].apply(extract_locations_ner)

- we review our results:

In [None]:
pos_duration

**NER (`ner_location`)**: 

   -  Extracts the **first detected named entity labeled as a location (GPE)**.
   -  Picks subregions/cities for some events:

     - ‚ÄúVisayas‚Äù for Typhoon in the Philippines
     - ‚ÄúSegovia‚Äù for miners
     - ‚ÄúLakki Marwat District‚Äù for Khyber Pakhtunkhwa insurgency
   - Correctly identifies ‚ÄúGaza City‚Äù and ‚ÄúUkraine‚Äù in other cases.
   - But, it sometimes misses expected country names if the description emphasizes local areas first (e.g., ‚ÄúMalawian election‚Äù ‚Üí `None`).

In sum:

   - Regex is **deterministic and controllable** but brittle.
   - NER is **more flexible**, can find multiple granular locations, but may **miss the overall country** if the text emphasizes a city or district.


| Method | Strengths                                      | Weaknesses                                          |
| ------ | ---------------------------------------------- | --------------------------------------------------- |
| Regex  | Predictable, easy to explain                   | Fragile, misses subregions, needs mapping           |
| NER    | Can find cities, districts, multiple locations | May miss the ‚Äúmain‚Äù country, requires library/model |


## Dropping and reordering columns

In [None]:
#EXTRA (Just to showcase that the placing of columns can be changed easily)

# Get the current column names
columns = pos_duration.columns.tolist()

# Move 'Duration' column next to 'Event Year' column
columns.remove('correct_location')
columns.insert(columns.index('merged_description') + 1, 'correct_location')

# Reorder the columns in the DataFrame
pos_duration = pos_duration[columns]
pos_duration

Let's now drop the unnecessary columns (`event_location` and `ner_location`):

In [None]:
pos_duration.drop(columns=['event_location','ner_location'], inplace=True)

In [None]:
pos_duration