Introduction
In this tutorial we will learn how to scrape data from websites using traditional methods and hidden APIs. QS World University Rankings will be our target website, please familiarise yourself with the website before proceeding. Let’s assume we would like to scrape the following data from the website:
- University name
- Ranking in 2024
- Overall score
- All scores for the different categories
- City, country and region
These data could be used to analyse a multitude of research questions such as:
- How does the ranking of universities change over time?
- What differences are there between universities in different countries?
- What differences are there between universities in different regions?
- How does the ranking correspond to other metrics such as the number of students, the number of faculty members, the number of international students, etc.?
Different methods of scraping
Traditional methods
Traditional methods of scraping involve using the requests
library to send HTTP requests to the website and then parsing the HTML response using the BeautifulSoup
library. This method is very simple and can be used to scrape most websites given there are no restrictions placed on your IP address. However, this method is not very robust and can break easily if the website changes its HTML structure.
We will be using the requests
library for our example, but do also give the httpx
library a try—it supports asynchronous requests and is generally faster than requests
, but has the same syntax. You can install httpx
using pip install httpx
.
Scraping using traditional methods
It is relatively simple to find endpoints for scraping data and the output is usually well-structured. We have now obtained the ratings for all universities in the 2024 QS ranking, and one of the data points is the URL ‘slug’ for accessing the webpages of individual universities, which is where the scores for each indicator are stored. If we go to one of the pages, we can use the same process for filtering down requests and finding the endpoint responsible for the data we need. It turns out
After identifying the request, we can again copy the cURL command and import it into Postman. We can then export it to Python and iterate through all the universities to get the data we need. Notice that this is a POST
request, whereas previously we were dealing with GET
requests.
Your cURL
command should look something like this:
curl 'https://www.topuniversities.com/qs-profiles/rank-data/513/478/0?_wrapper_format=drupal_ajax' \
'POST' \
-X 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Accept: application/json, text/javascript, */*; q=0.01' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Accept-Language: en-GB,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Host: www.topuniversities.com' \
-H 'Origin: https://www.topuniversities.com' \
-H 'Content-Length: 1212' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15' \
-H 'Referer: https://www.topuniversities.com/universities/university-oxford' \
-H 'Connection: keep-alive' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Cookie: STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A1%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.565217329.1696875467083%22%2C%22views%22%3A7%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A2%2C%22landing%22%3A%22https%3A//www.topuniversities.com/universities/massachusetts-institute-technology-mit%22%2C%22enter_at%22%3A%222023-10-9%7C19%3A17%3A47%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696877223485%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; _ga_16LPMES2GR=GS1.1.1696875469.2.1.1696877221.30.0.0; _ga_5D0D56Z1Z9=GS1.1.1696875469.1.1.1696877221.0.0.0; _ga_YN0B3DGTTZ=GS1.1.1696875469.2.1.1696877221.27.0.0; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696875469.2.1.1696877218.50.0.0; _gid=GA1.2.1105426707.1696875469; _hjAbsoluteSessionInProgress=0; _hjSession_173635=eyJpZCI6ImJkMTA0ZDkwLTZkZmEtNGU4MS1iYWY2LTBmNTJjMmZlZTg2OCIsImNyZWF0ZWQiOjE2OTY4NzU0Njk2ODQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; __hssc=238059679.6.1696875470238; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696875470238.2; _fbp=fb.1.1696348117607.443804054; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _gcl_au=1.1.1619037028.1696348117; _hjIncludedInSessionSample_173635=0; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; _gat_%5Bobject%20Object%5D=1; _gat_UA-223073533-1=1; _gat_UA-37767707-2=1; STYXKEY-user_survey=other; STYXKEY-user_survey_other=; STYXKEY-globaluserUUID=TU-1696875469298-84871097; mktz_sess=sess.2.565217329.1696875467083; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348471:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348471:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0' \
-H 'X-Requested-With: XMLHttpRequest' \
-H 'js=true&_drupal_ajax=1&ajax_page_state%5Btheme%5D=tu_d8&ajax_page_state%5Btheme_token%5D=&ajax_page_state%5Blibraries%5D=addtoany%2Faddtoany.front%2Cckeditor_accordion%2Faccordion.frontend%2Cclientside_validation_jquery%2Fcv.jquery.ckeditor%2Cclientside_validation_jquery%2Fcv.jquery.ife%2Cclientside_validation_jquery%2Fcv.jquery.validate%2Cclientside_validation_jquery%2Fcv.pattern.method%2Ccore%2Fdrupal.form%2Ccore%2Fdrupal.states%2Ccore%2Fnormalize%2Ceu_cookie_compliance%2Feu_cookie_compliance_bare%2Cflag%2Fflag.link_ajax%2Cga%2Fanalytics%2Clayout_discovery%2Fonecol%2Cqs_article%2Fqs_article%2Cqs_firebase_sso%2Fsso-lib%2Cqs_firebase_sso%2Fsso-lib-header%2Cqs_flexreg_user_flow%2Fqs_flexreg_user_flow%2Cqs_global_site_search%2Fsearch_header%2Cqs_profiles%2Fhighcharts%2Cqs_profiles%2Fqs_profiles%2Cqs_profiles%2Fqs_profiles_circle%2Cqs_user_profile%2FqsUserProfile%2Csystem%2Fbase%2Ctu_d8%2Fglobal%2Ctu_d8%2Fnode%2Ctu_d8%2Fprofile_header%2Ctu_d8%2Fqna_forums%2Ctu_d8%2Fqs_campus_locations%2Ctu_d8%2Fqs_instant%2Ctu_d8%2Fqs_profile_new%2Ctu_d8%2Fqs_profile_new_datalayer%2Ctu_d8%2Fqs_program_tabs%2Ctu_d8%2Fqs_ranking_chart%2Ctu_d8%2Fqs_related_content%2Ctu_d8%2Fqs_similar_programs%2Cviews%2Fviews.module' --data
Here is what the request looks like in Postman:
A POST
request means that the parameters are sent in the body of the request rather than in the URL; however, in our case it seems that the unique ID of each university is sent in the URL. We can try to change the ID to see if we can get data for other universities.
= "https://www.topuniversities.com/qs-profiles/rank-data/513/478/0?_wrapper_format=drupal_ajax" url
This is the request sent to get the data for MIT, and in our qs_ranking_1.json
file we can see that the ID for MIT is 478. Let’s try to change it to 362, which is the ID for LSE.
= "https://www.topuniversities.com/qs-profiles/rank-data/513/362/0?_wrapper_format=drupal_ajax" url
The request is successful and we do indeed get the indicators for LSE. This means we can gather all the unique IDs for all the universities and iterate through them to get the data we need. However, as you might have seen, the data passed in the JSON response is still formatted in HTML, so me might need to do some cleaning before we can put it in a tabular format—this is where traditional scraping methods come in handy.
import json
from pprint import pprint
# open files beginning with `qs_ranking_`
# iterate through the files and append the data to a list
= []
unis for page in range(1, 3):
with open(f'qs_ranking_{page}.json', 'r') as f:
'score_nodes']) # read up on the difference between `extend` and `append`
unis.extend(json.load(f)[
2]) pprint(unis[:
Now we have a list of dictionaries, but we need a list of unique IDs Here they seem to be represented by the core_id
key
Let’s create a list of unique IDs to scrape. We can use a for
loop to iterate through the list of dictionaries and append the IDs to a new list. We can then use the set
function to deduplicate the list.
= []
uni_ids
for uni in unis:
'core_id'])
uni_ids.append(uni[
# Deduplicate the list
= list(set(uni_ids))
uni_ids
# Let's check the length of the list of unique IDs
print(len(uni_ids))
# and a few items from the list
print(uni_ids[:5])
We can use a similar method to the one we used previously to save the data to JSON files. It is probably a good idea just to save the JSON files for now and then later parse them for useful information—this way we don’t have to worry about getting the right data from the HTML stright away and instead focus on procuring it first.
This is a working request that we exported as Python code from Postman based on the cURL
command above.
import requests
= "https://www.topuniversities.com/qs-profiles/rank-data/513/410/0?_wrapper_format=drupal_ajax"
url
= "js=true&_drupal_ajax=1&ajax_page_state%5Btheme%5D=tu_d8&ajax_page_state%5Btheme_token%5D=&ajax_page_state%5Blibraries%5D=addtoany%2Faddtoany.front%2Cckeditor_accordion%2Faccordion.frontend%2Cclientside_validation_jquery%2Fcv.jquery.ckeditor%2Cclientside_validation_jquery%2Fcv.jquery.ife%2Cclientside_validation_jquery%2Fcv.jquery.validate%2Cclientside_validation_jquery%2Fcv.pattern.method%2Ccore%2Fdrupal.form%2Ccore%2Fdrupal.states%2Ccore%2Fnormalize%2Ceu_cookie_compliance%2Feu_cookie_compliance_bare%2Cflag%2Fflag.link_ajax%2Cga%2Fanalytics%2Clayout_discovery%2Fonecol%2Cqs_article%2Fqs_article%2Cqs_firebase_sso%2Fsso-lib%2Cqs_firebase_sso%2Fsso-lib-header%2Cqs_flexreg_user_flow%2Fqs_flexreg_user_flow%2Cqs_global_site_search%2Fsearch_header%2Cqs_profiles%2Fhighcharts%2Cqs_profiles%2Fqs_profiles%2Cqs_profiles%2Fqs_profiles_circle%2Cqs_user_profile%2FqsUserProfile%2Csystem%2Fbase%2Ctu_d8%2Fglobal%2Ctu_d8%2Fnode%2Ctu_d8%2Fprofile_header%2Ctu_d8%2Fqna_forums%2Ctu_d8%2Fqs_campus_locations%2Ctu_d8%2Fqs_instant%2Ctu_d8%2Fqs_profile_new%2Ctu_d8%2Fqs_profile_new_datalayer%2Ctu_d8%2Fqs_program_tabs%2Ctu_d8%2Fqs_ranking_chart%2Ctu_d8%2Fqs_related_content%2Ctu_d8%2Fqs_similar_programs%2Cviews%2Fviews.module"
payload = {
headers 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Sec-Fetch-Site': 'same-origin',
'Accept-Language': 'en-GB,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Sec-Fetch-Mode': 'cors',
'Host': 'www.topuniversities.com',
'Origin': 'https://www.topuniversities.com',
'Content-Length': '1212',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15',
'Referer': 'https://www.topuniversities.com/universities/massachusetts-institute-technology-mit',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'empty',
'Cookie': 'STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A1%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.565217329.1696875467083%22%2C%22views%22%3A2%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A2%2C%22landing%22%3A%22https%3A//www.topuniversities.com/universities/massachusetts-institute-technology-mit%22%2C%22enter_at%22%3A%222023-10-9%7C19%3A17%3A47%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696875480078%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; _ga_16LPMES2GR=GS1.1.1696875469.2.0.1696875478.51.0.0; _ga_5D0D56Z1Z9=GS1.1.1696875469.1.0.1696875478.0.0.0; _ga_YN0B3DGTTZ=GS1.1.1696875469.2.0.1696875478.51.0.0; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696875469.2.1.1696875473.56.0.0; _gid=GA1.2.1105426707.1696875469; __hssc=238059679.1.1696875470238; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696875470238.2; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _fbp=fb.1.1696348117607.443804054; _gat_%5Bobject%20Object%5D=1; _gat_UA-223073533-1=1; _gat_UA-37767707-2=1; _gcl_au=1.1.1619037028.1696348117; _hjAbsoluteSessionInProgress=0; _hjIncludedInSessionSample_173635=0; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; _hjSession_173635=eyJpZCI6ImJkMTA0ZDkwLTZkZmEtNGU4MS1iYWY2LTBmNTJjMmZlZTg2OCIsImNyZWF0ZWQiOjE2OTY4NzU0Njk2ODQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; STYXKEY-globaluserUUID=TU-1696875469298-84871097; mktz_sess=sess.2.565217329.1696875467083; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348471:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348471:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0',
'X-Requested-With': 'XMLHttpRequest'
}
= requests.request("POST", url, headers=headers, data=payload)
response
print(response.text)
The only thing we need to change here is the core_id
within the URL. We have 1498 unique ID to get through, so it would be a good idea to save the data with the ID as the filename. We can also add a sleep
function to avoid overloading the server with requests—but this is optional for now as we will only do a few requests.
# Let's create a function that will save the JSON files for us.
def save_json(data: str, filename:str):
with open(filename, 'w') as f:
f.write(data)
It might also be a good idea to save the data in a separate folder. We can use the os
library to create a folder if it does not exist yet.
import os
# Create a folder called `data` if it does not exist yet
if not os.path.exists('data'):
'data') os.mkdir(
Now we can iterate through the list of unique IDs and save the data to JSON files.
import requests
import time
from tqdm.notebook import tqdm # a progress bar library
import os # a library for working with files and folders
= "https://www.topuniversities.com/qs-profiles/rank-data/513/{}/0?_wrapper_format=drupal_ajax" # notice the {} in the URL---this is a placeholder for the ID
url
# Create a folder called `data` if it does not exist yet
if not os.path.exists('data'):
'data')
os.mkdir(
for id in tqdm(uni_ids[:5]):
print(f"Saving data for {id}")
# make a POST request instead of a GET request
= requests.request("POST", url.format(id), headers=headers, data=payload)
response f"data/{id}.json")
save_json(response.text, 1) time.sleep(
Explore the structure of the files you have saved. Within each of them, we can see the HTML for the indicators—it seeems to be within the last dictionary in a list. There are also some additional data about historical ratings that you might want to make use of in eg. finding out which regions have had the most impressive improvements over the past decade. As we are only interested in the 2024 indicators, however, we can extract the HTML and its associated university ID, then use Scrapy Selectors
to parse the HTML and extract the data we need.
Saving the files and parsing them later allows you to separate the data collection and data processing steps. This can be beneficial for a few reasons:
Reduced server load: By saving the data to files and processing it later, you can avoid overloading the server with too many requests at once. This can help prevent your IP address from being blocked or banned by the server.
Flexibility: Saving the data to files allows you to work with the data at your own pace and on your own schedule. You can parse the data when it’s convenient for you, and you can easily re-parse the data if you need to make changes to your parsing code.
Data backup: Saving the data to files provides a backup in case something goes wrong during the parsing process. If your parsing code encounters an error or if your computer crashes, you can simply re-run the parsing code on the saved data rather than having to re-collect the data from scratch.
# Create an empty dictionary to store the university data
= {}
uni_data
# Loop through each file in the `data` folder
for filename in os.listdir('data'):
# Open the file and read in the JSON data
with open(f'data/{filename}', 'r') as f:
= json.load(f)
json_data
# Get the ID of the university from the filename
= filename.split('.')[0]
uni_id
# Add a new key-value pair to the `uni_data` dictionary
# where the key is the ID and the value is the last dictionary
# in the list of dictionaries in the JSON data
= json_data[-1]['data'] uni_data[uni_id]
Now that we have the HTML, we can use Scrapy Selectors
to parse it and extract the data.
from scrapy import Selector
# Let's create a function that will extract the data we need from the HTML
def extract_data(html: str):
# define a selector
= Selector(text=html)
sel # extract the data from div.circle---this gives you a list of Scrapy selector objects which you can drill down into
= sel.css('div.circle')
indicators # create a dictionary to store the data
= {}
indicator_data # iterate through the indicators and extract the data
for indicator in indicators:
= indicator.css('.itm-name::text').get().strip()
indicator_name = indicator.css('.score::text').get().strip()
indicator_value
# Store data in a dictionary
= indicator_value
indicator_data[indicator_name]
return indicator_data
# Let's test the function on the HTML we have saved
'2551'])
extract_data(uni_data[
# And now we can iterate through the dictionary and extract the data for each university
for id, data in uni_data.items():
id] = extract_data(data) uni_data[
Now we have a dictionary with the data we need. We can use pandas
to convert it to a dataframe and save it to a CSV file.
import pandas as pd
= pd.DataFrame.from_dict(uni_data, orient='index')
df
df.head()
# and then save it to a CSV file
'qs_indicators.csv') df.to_csv(
Now we have a dataframe with all the indicators for all the universities, however, the data are missing university names, regions, and weblinks that we received with the previous request. We can use the core_id
to join the two dataframes together. First, let us load the data from the previous request.
import json
import pandas as pd
# Open the json file
with open('qs_ranking_1.json', 'r') as f:
= json.load(f)
data
# get into the `score_nodes` key which contains required data
= data['score_nodes']
data
# convert to a pandas dataframe
= pd.DataFrame.from_dict(data)
unis_df
# Merge the two dataframes together
= unis_df.merge(df, left_on='core_id', right_index=True)
unis_df
# Save the dataframe to a CSV file
'qs_ranking.csv') unis_df.to_csv(
Let’s take a look at the dataframe we have created.
import pandas as pd
= pd.read_csv('qs_ranking.csv', index_col=0)
unis_df unis_df.head()
This data set now contains all the information we need to answer the questions we set out to answer. You can now use the data to create visualisations, perform statistical analysis, or build a machine learning model.
Summary
In this tutorial, we have learned
- how to use
Postman
to identify the requests responsible for getting the data we need and how to export them to Python - how to save the data to files and how to parse it later
- how to use
Scrapy Selectors
to parse HTML and extract the data we need - how to join two dataframes together using a common key
- how to save the data to a CSV file.
Happy scraping!