π» Week 04 - Lab Roadmap (90 min)
DS105 - Data for Data Science
This week it is time for you to go beyond your own machine and even beyond a cloud machine and visit the Internet!
In todayβs lab we will get familiar with web-scraping, a notion that describes the practices of automated data acquisition from the Internet. This lab is split into three parts: classic HTML scraping, scraping using APIs and scraping using Selenium. These skills are not only essential for your data collection for this course but will be exceptionally useful for you in your own research.
Part 1: Exploring LSE (25 min)
Part 1: Exploring LSE
Our first task is to learn a bit more about LSE. LSE has many departments. One of them is the Department of Methodology. The department has people with experience in various methods. Today we will create a list of all the methods the Methodology staff has.
Main task: create a list of all the key areas of expertise that the Department of Methodology has.
π€ Stop and think. Before you start following the instructions. Can you think of a way to acquire this information using the LSE website?
Letβs explore one of the possible ways to do that. We do not provide you with code straight away as we assume that you will be guided during the lab. However, the code is given to you after the action points.
We will first complete a small exercise together in the class and then you will attempt the tasks yourself.
π€ WORKING TOGETHER
- Go to the Department of Methodology People page and navigate to Academic Staff.
- Inspect the page and find the way to the second heading.
- Send a GET request to the website and get the second heading.
- Print the heading using the language of your choice.
Now it is time for you to work independently and solve the task mentioned above.
π― ACTION POINTS
- Inspect the page to find the HTML element that allows you to get to the personβs individual page.
- Using the language of your choice create a list of the links to the individual pages of all the Academic Staff Members.
- Using the saved links, go through all of them and extract the Key Areas of Expertise for all the Staff Members.
- Create one list to store all the Areas of Expertise and delete duplicates from there.
Do you find any areas that are relevant for you?
Optional task
Think of a way to scrape and store these data in a way that would link Academic Staff Members to their Key Areas of Expertise.
Solution Code
Python users
# import the required libraries
import requests
from bs4 import BeautifulSoup
# sending a request to the site
= requests.get("https://www.lse.ac.uk/Methodology/People")
response_html
# parsing page"s content
= BeautifulSoup(response_html.text)
soup
# empty list to store links to pages
= []
links_to_staff
# subsetting only for academic staff
= soup.find_all("div", attrs={"class":"accordion__panel"})[0].find_all("a", attrs={"class":"sys_0 sys_t0"})
all_ac_staff
# iterating through all of the staff saving the links
for person in all_ac_staff:
# extracting a link
= person.get("href")
link
= "https://www.lse.ac.uk" + link
link
links_to_staff.append(link)
# empy list to store key areas of experience
= []
key_exp
# iterating through all staff members' pages
for link in links_to_staff:
# extracting their research interests
= requests.get(link)
response_html
= BeautifulSoup(response_html.text)
soup
= soup.find_all("div", attrs={"class":"peopleContact__address"})[-1].get_text()
exp
key_exp.append(exp)
# cleaning up the string to create a list of all areas
= ",".join(key_exp).replace(".", " ").replace(",", ";").replace("; ", ";").split(";") final_list
R users
# import required packages
library(rvest)
library(magrittr)
library(stringi)
# read the content of the site
<- "https://www.lse.ac.uk/Methodology/People"
url <- read_html(url)
html
# extract the elements related to academic staff only
html_nodes(html, css = ".accordion__panel")[1] %>%
html_nodes(css = ".sys_0 sys_t0")
# extract links to individual pages
<- html_nodes(html, css = ".accordion__panel")[1] %>%
links_to_staff html_nodes(css = "a") %>%
html_attr("href")
# add the base URL to the link
<- paste("https://www.lse.ac.uk", links_to_staff, sep = "")
links_to_staff
# write a function that extracts areas of experience from each page
<- function(link) {
get_expert <- link
url <- read_html(url)
html
html_nodes(html, css = ".peopleContact__address") %>%
tail(1) %>%
html_text()
}
# apply the function to all pages
<- sapply(links_to_staff, get_expert)
result
# clean up the areas of expertise
<- paste(result, collapse = ";") %>%
res stri_replace_all_fixed(".", " ") %>%
stri_replace_all_fixed(",", ";") %>%
stri_replace_all_fixed("; ", ";")
# create a list of all areas
<- stri_split_fixed(res, ";")[[1]] final_list
Part 2: Buying tickets (35 min)
Part 2: Buying tickets
After completing the last exercise you might think βWell, yeah, it is useful for my academic endeavours, but not for my day-to-day life.β. Let us show you how you can solve your problems with it!
It turns out that TicketMaster (one of the biggest websites that sells tickets) has its own API. It means that you can automate your ticket search if you wanted to! Letβs explore this together.
We do not provide you with code straight away, however, you will find the solutions below.
In the same way as before we will first complete some tasks together to understand how APIs work and then work independently.
π€ WORKING TOGETHER
- Go to https://developer.ticketmaster.com/ and register an account. You will only need an email address.
- Acquire an API token using your new account. It will be called a Consumer Key in your appsβ information.
- Make an API call to get all the venues in New York.
And now, itβs time to get tickets!
π― ACTION POINTS
- Using the documentation find all the music events happening in London.
How many events have you found?
Do you think itβs all the events happening in London?
Is there a way to show more events?
- Letβs make our search a bit more narrow. Letβs imagine you are coming back from holidays on the 15th of October. Can you find Rock music events in London that are happening after that date? What if you wanted an event that you can get to for less than 30 pounds? Can you find one for yourself?
- Go ahead and try to find London events related to data and data science. Are there any? Extend your search and try again.
- What about family-friendly events in London? Are there any?
Solution Code
Python users
All the tasks above are solved with the same API URL. Here we will show the base code and mainly the parameters that yield the desired results.
Task 3
# import the required libraries
import requests
# saving your API key
= "YOUR_API_KEY"
api_key
# setting up the API query parameters (they will be changing)
= {"classificationName": "music",
params "countryCode": "GB",
"city": "London",
"apikey": api_key}
# sending a request to the API
= requests.get("https://app.ticketmaster.com/discovery/v2/events.json",
response =params)
params
# parse the response
= response.json()
resp_json
# extract the events
"_embedded"]["events"] resp_json[
Next, we will be only providing the query parameters.
Task 3 (extending search)
# setting up the API query parameters (they will be changing)
= {"classificationName": "music",
params "countryCode": "GB",
"city": "London",
"size": 200, # feel free to change this number
"page": 1, # we add pages here to show that you can get more results if needed
"apikey": api_key}
Task 4
# setting up the API query parameters (they will be changing)
= {"classificationName": "music",
params "countryCode": "GB",
"city": "London",
"genre_name":"Rock",
"startDateTime":"2022-10-15T00:00:00Z",
"size": 200, # feel free to change this number
"page": 1, # we add pages here to show that you can get more results if needed
"apikey": api_key}
This API does not have a price parameter, so you would need to filter the JSON manually.
Task 5
# setting up the API query parameters (they will be changing)
= {"countryCode": "GB",
params "city": "London",
"keyword":"data",
"size": 200, # feel free to change this number
"page": 1, # we add pages here to show that you can get more results if needed
"apikey": api_key}
# or
= {"keyword":"data",
params "size": 200,
"page": 1,
"apikey": api_key}
Task 6
# setting up the API query parameters (they will be changing)
= {"countryCode": "GB",
params "city": "London",
"includeFamily":"only",
"size": 200, # feel free to change this number
"page": 1, # we add pages here to show that you can get more results if needed
"apikey": api_key}
R users
All the tasks above are solved with the same API URL. Here we will show the base code and mainly the parameters that yield the desired results.
Task 3
# importing required packages
library("httr")
library("jsonlite")
# saving your API key
<- "ErOQNylYMIv9wqPLWsezUdByUjftJIa6"
api_key
# setting up the base URL and parameters
<- "https://app.ticketmaster.com/discovery/v2/events.json"
base_url
# sending a request
<- GET(base_url, query = list("classificationName" = "music",
response "countryCode" = "GB",
"city" = "London",
"apikey" = api_key))
# parse the response
<- content(response, "parsed") json
Next, we will be only providing the query parameters.
Task 3 (extending search)
# sending a request
<- GET(base_url, query = list("classificationName" = "music",
response "countryCode" = "GB",
"city" = "London",
"size" = 200,
"page" = 1,
"apikey" = api_key))
Task 4
# sending a request
<- GET(base_url, query = list("classificationName" = "music",
response "countryCode" = "GB",
"city" = "London",
"genre_name" = "Rock",
"startDateTime" = "2022-10-15T00:00:00Z",
"size" = 200, # feel free to change this number
"page" = 1, # we add pages here to show that you can get more results if needed
"apikey" = api_key))
This API does not have a price parameter, so you would need to filter the JSON manually.
Task 5
# sending a request
<- GET(base_url, query = list(
response "countryCode" = "GB",
"city" = "London",
"keyword" = "data",
"size" = 200, # feel free to change this number
"page" = 1, # we add pages here to show that you can get more results if needed
"apikey" = api_key))
# or
<- GET(base_url, query = list(
response "keyword" = "data",
"size" = 200, # feel free to change this number
"page" = 1, # we add pages here to show that you can get more results if needed
"apikey" = api_key))
Task 6
# sending a request
<- GET(base_url, query = list(
response "countryCode" = "GB",
"city" = "London",
"includeFamily" = "only",
"size" = 200, # feel free to change this number
"page" = 1, # we add pages here to show that you can get more results if needed
"apikey" = api_key))
Part 3: Searching for Data events (25 min)
Part 3: Searching for Data events
We have now explored one of the key methods of web scraping. However, we havenβt yet talked about the scenarios where we need to interact with a website to acquire information. This can be done with Selenium. Selenium can automatically interact with websites. Letβs see how it works.
π€ WORKING TOGETHER
- If you are using Python, make sure you have Google Chrome installed and chromdriver downloaded to the folder you know a path to.
If you are using R, make sure you have Mozilla Firefox installed. - Go to the LSE website.
- Copy the path to the search box.
- Using Selenium type βdataβ into the box.
- Extract the name of the first program.
Now, itβs time to try things yourself. We havenβt found a lot of events about data on TicketMaster. Maybe we can use another platform?
π― ACTION POINTS
- Using Selenium go to the London events page on Eventbrite.
- Navigate to the search box and enter βdataβ using Selenium.
- Hit enter using Selenium.
- Parse the number of pages of such events from below the page.
- Find a way to go to the next page.
- Write a loop that will go through the next 10 pages and print the date and time of the first event on the page.
Solution Code
Python users
We will present the whole sequence of steps here in one code.
# import the required libraries
from selenium import webdriver
from selenium.webdriver import Chrome
import time
# here you specify the Path to chromedriver
= "/chromedriver_PATH"
driver_path
# launching the browser
= Chrome(driver_path)
driver
# saving the link
= "https://www.eventbrite.co.uk/d/united-kingdom--london/events/"
link
# navigating to the page
driver.get(link)
# navigating to the search box
= driver.find_element("css selector", """#global-header > div
search > div.consumer-header__content.consumer-header__desktop.eds-show-up-md
> div.consumer-header__search > button > div > div > div > div""")
# clicking on the search box
search.click()
# navigating to the search box
= driver.find_element("css selector", "#search-autocomplete-input")
inputElement
# inputting "data" into the search box
'data')
inputElement.send_keys(
# importing common keys
from selenium.webdriver.common.keys import Keys
# asking Selenium to hit enter
inputElement.send_keys(Keys.ENTER)
# extracting the number of pages
= driver.find_element("css selector", """#root > div > div.eds-structure__body > div > div > div > div.eds-fixed-bottom-bar-layout__content
n_pages_el > div > main > div > div > section.search-base-screen__search-panel
> footer > div > div > ul > li.eds-pagination__navigation-minimal.eds-l-mar-hor-3""")
print(n_pages_el.text)
# importing time to wait until the page loads
import time
for i in range(10):
# saving the path to the first date on the page
= driver.find_element("xpath", """//*[@id="root"]/div/div[2]/div/div/div/div[1]/div/main/div/div/section[1]/
first_date div[1]/div/ul/li[1]/div/div/div[1]/div/div/div/article/div[2]/div/div/div[1]/div""")
print(first_date.text)
# saving the path to the next page button
= driver.find_element("css selector", """#chevron-right-chunky_svg__eds-icon--chevron-right-chunky_svg""")
next_page
# clicking for next page
next_page.click()
# wait for 2 seconds
2)
time.sleep(
# clode the driver
driver.close()
R users
We will present the whole sequence of steps here in one code.
# import selenium
library("RSelenium")
# launching the browser
<- rsDriver(browser=c("firefox"))
rD<- rD$client
driver
# navigating to the page
<- "https://www.eventbrite.co.uk/d/united-kingdom--london/events/"
url $navigate(url)
driver
# navigating to the search box
<- driver$findElement(using = "css selector", value = '#global-header > div
search > div.consumer-header__content.consumer-header__desktop.eds-show-up-md
> div.consumer-header__search > button > div > div > div > div')
# clicking on the search box
$clickElement()
search
# navigating to the search box
<- driver$findElement(using = "css selector", value = '#search-autocomplete-input')
search_field
# inputting "data" into the search box
$sendKeysToElement(list("data"))
search_field
# asking Selenium to hit enter
<- driver$findElement("css", "body")
page_body $sendKeysToElement(list(key = "enter"))
page_body
# extracting the number of pages
<- driver$findElement(using = "css selector",
n_pages value = '#root > div > div.eds-structure__body > div > div > div >
div.eds-fixed-bottom-bar-layout__content
> div > main > div > div > section.search-base-screen__search-panel
> footer > div > div > ul > li.eds-pagination__navigation-minimal.eds-l-mar-hor-3')
<- n_pages$getElementText()[[1]]
n_pages print(n_pages)
for (i in 1:10) {
# saving the path to the first date on the page
<- driver$findElement(using = "css selector", value = ".search-main-content__events-list > li:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > article:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(2)")
first_date
print(first_date$getElementText()[[1]])
# saving the path to the next page button
<- driver$findElement(using = "css selector", value = '#chevron-right-chunky_svg__eds-icon--chevron-right-chunky_svg')
next_page
# clicking for next page
$clickElement()
next_page
# wait for 2 seconds
Sys.sleep(2)
}