💻 Week 05 Lab

Dynamic Web Scraping with Selenium

Author

Published

18 February 2025

🥅 Learning Goals

By the end of this lab, you will learn how to: i) set up Selenium for browser automation, ii) handle dynamic content that Scrapy can’t access, iii) use explicit and implicit waits to handle page loading, iv) combine Selenium with Scrapy for comprehensive web scraping, and v) extract data from JavaScript-rendered pages.

Last Updated: 17 February 18:20

📍Time and Location: Tuesday, 18 February 2025. Check your timetable for the precise time and location of your class.

📋 Preparation

Before attending this lab, ensure that:

You have completed the 💻 Week 04 lab and understand the basics of Scrapy.
You are caught up with the concepts covered in the 🗣️ Week 05 Lecture, including XPath Selectors and the debugging tools ipdb.
Install Selenium in your virtual environment:
```
pip install selenium
```

🛣️ Lab Roadmap

While Scrapy is excellent for static content, some websites use JavaScript to load their content dynamically. This is where Selenium comes in - it automates a real browser, allowing us to interact with dynamic content.

Part I: Understanding Browser Automation (20 min)

No need to wait for your class teacher. You can start working on this part on your own.

🎯 ACTION POINTS

Create a new Python script called selenium_demo.py:

from selenium import webdriver

# Import the By class
from selenium.webdriver.common.by import By

# Initialize the Google Chrome browser
# Change to another browser if you prefer:
# https://selenium-python.readthedocs.io/installation.html#drivers
driver = webdriver.Chrome()  

# Navigate to a website
driver.get("https://climateactiontracker.org/countries/")

# Inspect the page
import ipdb; ipdb.set_trace()

Click here to see a ‘headless’ approach

If you’re working in a cloud environment without a graphical interface (like Nuvolos), use this modified version instead:

 from selenium import webdriver
 from selenium.webdriver.chrome.options import Options
 from selenium.webdriver.common.by import By

 # Set up Chrome options for headless operation
 chrome_options = Options()
 chrome_options.add_argument('--headless')  # Run in headless mode
 chrome_options.add_argument('--no-sandbox')  # Required for running in container
 chrome_options.add_argument('--disable-dev-shm-usage')  # Required for running in container

 # Initialize the Chrome browser with options
 driver = webdriver.Chrome(options=chrome_options)

 # Navigate to a website
 driver.get("https://climateactiontracker.org/countries/")

 # Inspect the page
 import ipdb; ipdb.set_trace()

Run the script and observe what happens:
- A Chrome browser should open automatically
- It should navigate to the Climate Action Tracker website
- You should get a prompt from which you can inspect the page

Play with the multiple ways to select elements:

Use find_element to get the equivalent of Scrapy’s response.css or response.xpath. You specify the selector using the By class.

# Grab the first link on the page using CSS selector
driver.find_element(By.CSS_SELECTOR, "a")

# Grab the first link on the page using XPath
driver.find_element(By.XPATH, "//a[1]")

# Grab the first link on the page using ID
driver.find_element(By.ID, "link-id")

# Grab the first link on the page using Class Name
driver.find_element(By.CLASS_NAME, "link-class")

# Grab the first link on the page using Tag Name
driver.find_element(By.TAG_NAME, "a")

# Grab the first link on the page using Link Text
driver.find_element(By.LINK_TEXT, "link-text")

Interact with the page:

You can perform actions on the page such as clicking on an element:

# Click on the first link
driver.find_element(By.CSS_SELECTOR, "a").click()

You can also navigate back to the previous page:

# Go back to the previous page
driver.back()

Or, say, change the zoom level:

# Change the zoom level
driver.execute_script("document.body.style.zoom = '200%'")

Maybe you want to scroll to the bottom of the page:

# Scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Or you just want to close the browser:

# Close the browser
driver.quit()

Extract information from elements:

Once you’ve located elements, you can extract various pieces of information:

# Get element's text content
element = driver.find_element(By.CSS_SELECTOR, "h1")
print(element.text)  # Prints the text content

# Get element's HTML
print(element.get_attribute('outerHTML'))  # Gets full HTML including the element
print(element.get_attribute('innerHTML'))  # Gets HTML inside the element

# Get specific attributes
link = driver.find_element(By.CSS_SELECTOR, "a")
print(link.get_attribute('href'))  # Gets the link URL
print(link.get_attribute('class'))  # Gets the class name

# Get multiple elements
elements = driver.find_elements(By.CSS_SELECTOR, "p")
for element in elements:
    print(element.text)  # Prints text of each paragraph

Experiment with the different ways to interact with the page and try to understand how they work.

Part II: Handling Dynamic Content (20 min)

One of Selenium’s key features is its ability to wait for elements to load.

🗣️ TEACHING MOMENT

Your class teacher will demonstrate how to locate elements using Selenium’s various selector methods and discuss the different ways to interact with the page. Follow along and ask questions.

Implement explicit waits:

Say you are working with a page that has a loading spinner which takes a few seconds to disappear and only after that the content loads.

You can instruct Selenium to wait for the element to be present:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait up to 10 seconds for element to be present
wait = WebDriverWait(driver, 10)
element = wait.until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".some-class"))
)

Practice with different wait conditions:

Maybe an element needs to be clickable or visible before you can interact with it:

# Wait for element to be clickable
wait.until(EC.element_to_be_clickable((By.ID, "button-id")))

# Wait for element to be visible
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "menu")))

Part III: Super Tech Support (50 min)

The rest of the time will be a 🦸🏻 SUPER TECH SUPPORT session. Use this time to practice web scraping with Scrapy, Selenium (you can embed Selenium in your Scrapy spiders), or to work on your upcoming 📝 Problem Set 1.

📚 Resources

Selenium with Python Documentation