π» Week 05 Lab
Dynamic Web Scraping with Selenium

Last Updated: 17 February 18:20
πTime and Location: Tuesday, 18 February 2025. Check your timetable for the precise time and location of your class.
π Preparation
Before attending this lab, ensure that:
You have completed the π» Week 04 lab and understand the basics of Scrapy.
You are caught up with the concepts covered in the π£οΈ Week 05 Lecture, including XPath Selectors and the debugging tools
ipdb
.Install Selenium in your virtual environment:
pip install selenium
π£οΈ Lab Roadmap
While Scrapy is excellent for static content, some websites use JavaScript to load their content dynamically. This is where Selenium comes in - it automates a real browser, allowing us to interact with dynamic content.
Part I: Understanding Browser Automation (20 min)
No need to wait for your class teacher. You can start working on this part on your own.
π― ACTION POINTS
Create a new Python script called
selenium_demo.py
:from selenium import webdriver # Import the By class from selenium.webdriver.common.by import By # Initialize the Google Chrome browser # Change to another browser if you prefer: # https://selenium-python.readthedocs.io/installation.html#drivers = webdriver.Chrome() driver # Navigate to a website "https://climateactiontracker.org/countries/") driver.get( # Inspect the page import ipdb; ipdb.set_trace()
Click here to see a βheadlessβ approach
If youβre working in a cloud environment without a graphical interface (like Nuvolos), use this modified version instead:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By # Set up Chrome options for headless operation = Options() chrome_options '--headless') # Run in headless mode chrome_options.add_argument('--no-sandbox') # Required for running in container chrome_options.add_argument('--disable-dev-shm-usage') # Required for running in container chrome_options.add_argument( # Initialize the Chrome browser with options = webdriver.Chrome(options=chrome_options) driver # Navigate to a website "https://climateactiontracker.org/countries/") driver.get( # Inspect the page import ipdb; ipdb.set_trace()
Run the script and observe what happens:
- A Chrome browser should open automatically
- It should navigate to the Climate Action Tracker website
- You should get a prompt from which you can inspect the page
Play with the multiple ways to select elements:
Use
find_element
to get the equivalent of Scrapyβsresponse.css
orresponse.xpath
. You specify the selector using theBy
class.# Grab the first link on the page using CSS selector "a") driver.find_element(By.CSS_SELECTOR, # Grab the first link on the page using XPath "//a[1]") driver.find_element(By.XPATH, # Grab the first link on the page using ID "link-id") driver.find_element(By.ID, # Grab the first link on the page using Class Name "link-class") driver.find_element(By.CLASS_NAME, # Grab the first link on the page using Tag Name "a") driver.find_element(By.TAG_NAME, # Grab the first link on the page using Link Text "link-text") driver.find_element(By.LINK_TEXT,
Interact with the page:
You can perform actions on the page such as clicking on an element:
# Click on the first link "a").click() driver.find_element(By.CSS_SELECTOR,
You can also navigate back to the previous page:
# Go back to the previous page driver.back()
Or, say, change the zoom level:
# Change the zoom level "document.body.style.zoom = '200%'") driver.execute_script(
Maybe you want to scroll to the bottom of the page:
# Scroll to the bottom of the page "window.scrollTo(0, document.body.scrollHeight);") driver.execute_script(
Or you just want to close the browser:
# Close the browser driver.quit()
Extract information from elements:
Once youβve located elements, you can extract various pieces of information:
# Get element's text content = driver.find_element(By.CSS_SELECTOR, "h1") element print(element.text) # Prints the text content # Get element's HTML print(element.get_attribute('outerHTML')) # Gets full HTML including the element print(element.get_attribute('innerHTML')) # Gets HTML inside the element # Get specific attributes = driver.find_element(By.CSS_SELECTOR, "a") link print(link.get_attribute('href')) # Gets the link URL print(link.get_attribute('class')) # Gets the class name # Get multiple elements = driver.find_elements(By.CSS_SELECTOR, "p") elements for element in elements: print(element.text) # Prints text of each paragraph
Experiment with the different ways to interact with the page and try to understand how they work.
Part II: Handling Dynamic Content (20 min)
One of Seleniumβs key features is its ability to wait for elements to load.
π£οΈ TEACHING MOMENT
Your class teacher will demonstrate how to locate elements using Seleniumβs various selector methods and discuss the different ways to interact with the page. Follow along and ask questions.
Implement explicit waits:
Say you are working with a page that has a loading spinner which takes a few seconds to disappear and only after that the content loads.
You can instruct Selenium to wait for the element to be present:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Wait up to 10 seconds for element to be present = WebDriverWait(driver, 10) wait = wait.until( element ".some-class")) EC.presence_of_element_located((By.CSS_SELECTOR, )
Practice with different wait conditions:
Maybe an element needs to be clickable or visible before you can interact with it:
# Wait for element to be clickable "button-id"))) wait.until(EC.element_to_be_clickable((By.ID, # Wait for element to be visible "menu"))) wait.until(EC.visibility_of_element_located((By.CLASS_NAME,
Part III: Super Tech Support (50 min)
The rest of the time will be a π¦Έπ» SUPER TECH SUPPORT session. Use this time to practice web scraping with Scrapy, Selenium (you can embed Selenium in your Scrapy spiders), or to work on your upcoming π Problem Set 1.