πŸ’» Week 05 Lab

Dynamic Web Scraping with Selenium

Author
Published

18 February 2025

πŸ₯… Learning Goals
By the end of this lab, you will learn how to: i) set up Selenium for browser automation, ii) handle dynamic content that Scrapy can’t access, iii) use explicit and implicit waits to handle page loading, iv) combine Selenium with Scrapy for comprehensive web scraping, and v) extract data from JavaScript-rendered pages.
DS205 icon

Last Updated: 17 February 18:20

πŸ“Time and Location: Tuesday, 18 February 2025. Check your timetable for the precise time and location of your class.

πŸ“‹ Preparation

Before attending this lab, ensure that:

  1. You have completed the πŸ’» Week 04 lab and understand the basics of Scrapy.

  2. You are caught up with the concepts covered in the πŸ—£οΈ Week 05 Lecture, including XPath Selectors and the debugging tools ipdb.

  3. Install Selenium in your virtual environment:

    pip install selenium

πŸ›£οΈ Lab Roadmap

While Scrapy is excellent for static content, some websites use JavaScript to load their content dynamically. This is where Selenium comes in - it automates a real browser, allowing us to interact with dynamic content.

Part I: Understanding Browser Automation (20 min)

No need to wait for your class teacher. You can start working on this part on your own.

🎯 ACTION POINTS

  1. Create a new Python script called selenium_demo.py:

    from selenium import webdriver
    
    # Import the By class
    from selenium.webdriver.common.by import By
    
    # Initialize the Google Chrome browser
    # Change to another browser if you prefer:
    # https://selenium-python.readthedocs.io/installation.html#drivers
    driver = webdriver.Chrome()  
    
    # Navigate to a website
    driver.get("https://climateactiontracker.org/countries/")
    
    # Inspect the page
    import ipdb; ipdb.set_trace()

    Click here to see a β€˜headless’ approach

    If you’re working in a cloud environment without a graphical interface (like Nuvolos), use this modified version instead:

     from selenium import webdriver
     from selenium.webdriver.chrome.options import Options
     from selenium.webdriver.common.by import By
    
     # Set up Chrome options for headless operation
     chrome_options = Options()
     chrome_options.add_argument('--headless')  # Run in headless mode
     chrome_options.add_argument('--no-sandbox')  # Required for running in container
     chrome_options.add_argument('--disable-dev-shm-usage')  # Required for running in container
    
     # Initialize the Chrome browser with options
     driver = webdriver.Chrome(options=chrome_options)
    
     # Navigate to a website
     driver.get("https://climateactiontracker.org/countries/")
    
     # Inspect the page
     import ipdb; ipdb.set_trace()
  2. Run the script and observe what happens:

    • A Chrome browser should open automatically
    • It should navigate to the Climate Action Tracker website
    • You should get a prompt from which you can inspect the page
  3. Play with the multiple ways to select elements:

    Use find_element to get the equivalent of Scrapy’s response.css or response.xpath. You specify the selector using the By class.

    # Grab the first link on the page using CSS selector
    driver.find_element(By.CSS_SELECTOR, "a")
    
    # Grab the first link on the page using XPath
    driver.find_element(By.XPATH, "//a[1]")
    
    # Grab the first link on the page using ID
    driver.find_element(By.ID, "link-id")
    
    # Grab the first link on the page using Class Name
    driver.find_element(By.CLASS_NAME, "link-class")
    
    # Grab the first link on the page using Tag Name
    driver.find_element(By.TAG_NAME, "a")
    
    # Grab the first link on the page using Link Text
    driver.find_element(By.LINK_TEXT, "link-text")
  4. Interact with the page:

    You can perform actions on the page such as clicking on an element:

    # Click on the first link
    driver.find_element(By.CSS_SELECTOR, "a").click()

    You can also navigate back to the previous page:

    # Go back to the previous page
    driver.back()

    Or, say, change the zoom level:

    # Change the zoom level
    driver.execute_script("document.body.style.zoom = '200%'")

    Maybe you want to scroll to the bottom of the page:

    # Scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    Or you just want to close the browser:

    # Close the browser
    driver.quit()
  5. Extract information from elements:

    Once you’ve located elements, you can extract various pieces of information:

    # Get element's text content
    element = driver.find_element(By.CSS_SELECTOR, "h1")
    print(element.text)  # Prints the text content
    
    # Get element's HTML
    print(element.get_attribute('outerHTML'))  # Gets full HTML including the element
    print(element.get_attribute('innerHTML'))  # Gets HTML inside the element
    
    # Get specific attributes
    link = driver.find_element(By.CSS_SELECTOR, "a")
    print(link.get_attribute('href'))  # Gets the link URL
    print(link.get_attribute('class'))  # Gets the class name
    
    # Get multiple elements
    elements = driver.find_elements(By.CSS_SELECTOR, "p")
    for element in elements:
        print(element.text)  # Prints text of each paragraph

Experiment with the different ways to interact with the page and try to understand how they work.

Part II: Handling Dynamic Content (20 min)

One of Selenium’s key features is its ability to wait for elements to load.

πŸ—£οΈ TEACHING MOMENT

Your class teacher will demonstrate how to locate elements using Selenium’s various selector methods and discuss the different ways to interact with the page. Follow along and ask questions.

  1. Implement explicit waits:

    Say you are working with a page that has a loading spinner which takes a few seconds to disappear and only after that the content loads.

    You can instruct Selenium to wait for the element to be present:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    # Wait up to 10 seconds for element to be present
    wait = WebDriverWait(driver, 10)
    element = wait.until(
        EC.presence_of_element_located((By.CSS_SELECTOR, ".some-class"))
    )
  2. Practice with different wait conditions:

    Maybe an element needs to be clickable or visible before you can interact with it:

    # Wait for element to be clickable
    wait.until(EC.element_to_be_clickable((By.ID, "button-id")))
    
    # Wait for element to be visible
    wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "menu")))

Part III: Super Tech Support (50 min)

The rest of the time will be a 🦸🏻 SUPER TECH SUPPORT session. Use this time to practice web scraping with Scrapy, Selenium (you can embed Selenium in your Scrapy spiders), or to work on your upcoming πŸ“ Problem Set 1.

πŸ“š Resources