Scraping using traditional methods vs hidden APIs

Learn about the different methods of scraping data from websites and how to use them.
web scraping
scrapy
hidden apis
Author
Published

22 October 2023

Introduction

In this tutorial we will learn how to scrape data from websites using traditional methods and hidden APIs. QS World University Rankings will be our target website, please familiarise yourself with the website before proceeding. Let’s assume we would like to scrape the following data from the website:

  • University name
  • Ranking in 2024
  • Overall score
  • All scores for the different categories
  • City, country and region

These data could be used to analyse a multitude of research questions such as:

  • How does the ranking of universities change over time?
  • What differences are there between universities in different countries?
  • What differences are there between universities in different regions?
  • How does the ranking correspond to other metrics such as the number of students, the number of faculty members, the number of international students, etc.?

Different methods of scraping

Traditional methods

Traditional methods of scraping involve using the requests library to send HTTP requests to the website and then parsing the HTML response using the BeautifulSoup library. This method is very simple and can be used to scrape most websites given there are no restrictions placed on your IP address. However, this method is not very robust and can break easily if the website changes its HTML structure.

We will be using the requests library for our example, but do also give the httpx library a try—it supports asynchronous requests and is generally faster than requests, but has the same syntax. You can install httpx using pip install httpx.

Hidden APIs

Hidden APIs are APIs that are not documented and are not meant to be used by the public. However, they are still accessible to the public and can be used to scrape data from websites. This method is more robust than traditional methods as the API is not likely to change, but it is also more difficult to find the API and figure out how to use it.

Inspect Element

Hidden APIs are usually more apparent if you use the Inspect Element feature of your browser. It can be accessed by loading the page of interest and then right-clicking on the page and selecting Inspect Element. You can also access it by pressing Ctrl+Shift+I on Windows or Cmd+Shift+I on Mac.

Inspect Element gives you access to the DOM tree of the page, which is the HTML structure of the page. The DOM tree is the default tab which opens automatically. However, we need to locate the Network tab, which is usually the second tab. The Network tab shows all the requests that are sent to the website and the responses that are received. You can filter the requests by selecting XHR, which stands for XML HTTP Request. Within that tab, our task is to find the endpoint sending POST or GET requests and reverse-engineer it so we can get the most out of the API with as few calls to it as possible, and thus a smaller likelihood of getting banned. Keep in mind that sometimes the requests may be sent through AJAX and may not be visible in the Network tab. In this case, you can use the Inspect Element feature to find the endpoint in the JavaScript code by filtering it for for JS code or documents.

Finding the endpoint

Let’s locate the request responsible for the data we need, which is a table containing all the universities in the rankings and their corresponding rankings, scores and other data, presented in the picture below.

QS World University Rankings

As we can see, the only data we get back is the crest of the university, its name, and its overall score. In order for us to get all information for all the univeristies, we will have to iterate through each page individually and collect that by hand. But there is good news: the request responsible for the data we need is visible in the Network tab.

Network tab request for structured data

Scroll around and see what data we can obtain from it. Notice how these data are neat and structured—we do not need to parse any HTML for it to be processed.

After this, let’s turn to the Headers subtab and see what is required to duplicate this request on our own.

Headers subtab: this shows us what parameters were used when getting the data from the hidden API

From here, we can see that the request can easily be duplicated. Simply copy the cURL command and look through the parameters. This is exactly what your browser sees and does when you load the page, but the parameters are nicely rendered for you in the Network tab.

cURL command

As mentioned on the website, there are 1500 universities represented inthe ratings. If we only have 30 results per page, this means we will have to make 50 requests to get all the data. This is not a lot, but we can do better. Let’s see if we can increase the number of results per page.

curl 'https://www.topuniversities.com/rankings/endpoint?nid=3897789&page=1&items_per_page=15&tab=&region=&countries=&cities=&search=&star=&sort_by=&order_by=' \
-X 'GET' \
-H 'Accept: */*' \
-H 'Authorization: Basic cXNyYW5raW5nc2FwaUBxcy5jb206UVNhZG1pbkByMTEyMg==' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Accept-Language: en-GB,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Host: www.topuniversities.com' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15' \
-H 'Referer: https://www.topuniversities.com/university-rankings/world-university-rankings/2024' \
-H 'Connection: keep-alive' \
-H 'Cookie: _fbp=fb.1.1696348117607.443804054; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696348117.1.1.1696348469.50.0.0; _gid=GA1.2.117943117.1696348117; _ga_16LPMES2GR=GS1.1.1696348117.1.1.1696348468.60.0.0; _ga_YN0B3DGTTZ=GS1.1.1696348117.1.1.1696348468.60.0.0; _gat_UA-37767707-2=1; _hjIncludedInSessionSample_173635=0; _hjAbsoluteSessionInProgress=1; _hjFirstSeen=1; _hjSession_173635=eyJpZCI6Ijk0NGFhZWY3LTg5ZGUtNGU3ZC1iMGNkLTZhZjAwNzM0YjQyMCIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; __hssc=238059679.2.1696348118645; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696348118645.1; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _gcl_au=1.1.1619037028.1696348117; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A0%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.3834664399.1696348116025%22%2C%22views%22%3A2%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A1%2C%22landing%22%3A%22https%3A//www.topuniversities.com/university-rankings/world-university-rankings/2024%22%2C%22enter_at%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696348159659%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; mktz_login_58504=true; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0; STYXKEY-user_survey=other; STYXKEY-user_survey_other=; STYXKEY-globaluserUUID=TU-1696348117088-20095468; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348116:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348116:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; mktz_sess=sess.2.3834664399.1696348116025' \
-H 'Sec-Fetch-Dest: empty' \
-H 'X-Requested-With: XMLHttpRequest'

Postman

In order for us to do this, we need to place the cURL command inside a tool like Postman or Insomnia. These tools allow us to send requests to the website and see the response. They also allow us to change the parameters of the request easily and see how the response is modified.

Postman interface: press Import and paste the cURL command

This is what you should see after importing your cURL command. Play aroung with ticking and unticking the parameters to see if you can reduce their number and still get the same response. If it fails, you can always go back to the original cURL command and start again. You can send the request by hitting Cmd/Ctrl+Enter.

Postman interface: this is what you should see after importing your cURL command

The parameter items_per_page allows us to get more data with each request. The parameter page allows us to get to the next page. Try changing items_per_page to 50 and page to 2 and see what happens.

You can delete all the parameters except for nid,items_per_page, and page.

Exporting the request to Python

Once you have a working request, you can export it to Python by clicking on the Code button. This will give you a Python code snippet that you can paste into your script or notebook. It is likely you will have to twiddle it a bit to make it work and to make it more readable. After that, we need to understand how many pages there are, and then iterate through them to get all the data.

Postman interface: exporting your cURL request to Python

Here is the default code we end up with from Postman:


from pprint import pprint 
import requests

url = "https://www.topuniversities.com/rankings/endpoint?nid=3897789&page=1&items_per_page=15"

payload = {}
headers = {
  'Accept': '*/*',
  'Authorization': 'Basic cXNyYW5raW5nc2FwaUBxcy5jb206UVNhZG1pbkByMTEyMg==',
  'Sec-Fetch-Site': 'same-origin',
  'Accept-Language': 'en-GB,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Sec-Fetch-Mode': 'cors',
  'Host': 'www.topuniversities.com',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15',
  'Referer': 'https://www.topuniversities.com/university-rankings/world-university-rankings/2024',
  'Connection': 'keep-alive',
  'Cookie': '_fbp=fb.1.1696348117607.443804054; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696348117.1.1.1696348469.50.0.0; _gid=GA1.2.117943117.1696348117; _ga_16LPMES2GR=GS1.1.1696348117.1.1.1696348468.60.0.0; _ga_YN0B3DGTTZ=GS1.1.1696348117.1.1.1696348468.60.0.0; _gat_UA-37767707-2=1; _hjIncludedInSessionSample_173635=0; _hjAbsoluteSessionInProgress=1; _hjFirstSeen=1; _hjSession_173635=eyJpZCI6Ijk0NGFhZWY3LTg5ZGUtNGU3ZC1iMGNkLTZhZjAwNzM0YjQyMCIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; __hssc=238059679.2.1696348118645; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696348118645.1; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _gcl_au=1.1.1619037028.1696348117; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A0%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.3834664399.1696348116025%22%2C%22views%22%3A2%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A1%2C%22landing%22%3A%22https%3A//www.topuniversities.com/university-rankings/world-university-rankings/2024%22%2C%22enter_at%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696348159659%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; mktz_login_58504=true; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0; STYXKEY-user_survey=other; STYXKEY-user_survey_other=; STYXKEY-globaluserUUID=TU-1696348117088-20095468; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348116:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348116:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; mktz_sess=sess.2.3834664399.1696348116025',
  'Sec-Fetch-Dest': 'empty',
  'X-Requested-With': 'XMLHttpRequest'
}

response = requests.request("GET", url, headers=headers, data=payload)

# Save the JSON object to a file

with open('qs_ranking.json', 'w') as f:
    # pure text
    f.write(response.text)

# Pretty print the JSON object
# pprint(response.json())

Let’s analyse the request URL and see what parameters are available for us to change and optimise our scraper:

url = "https://www.topuniversities.com/rankings/endpoint?nid=3897789&page=1&items_per_page=15"

The parameters are separated by & and are as follows:

  • nid: this is the ID of the ranking which is likely unique to each year and to your machine.
  • page: this is the page number. We can change this to get the data for other pages.
  • items_per_page: this is the number of items per page. We can change this to get more data with each request.

From the response you receive, it is clear that the items_per_page parameter is the one we need to change to get more data with each request. You can try playing around with until the request fails—this is the maximum number of items per page you can get. In our case, it is 1400, which means you can get all the data in just two requests instead of 50.


{
    "total_record": 1498,
    "current_page": "1",
    "items_per_page": "1400",
    "total_pages": 2,
    "score_nodes": [
        {
            "score_nid": "3921122",
            "nid": "297086",
            "advanced_profile": 0,
            "core_id": "743",
            "title": "Università degli studi di Bergamo",
            "path": "/universities/universita-degli-studi-di-bergamo",
            # ...,
        },
        # ...,
                  ]
}

Now that we have a working request, we can iterate through all the pages and get all the data. We can also save the data to a JSON file for later use.


from pprint import pprint
import requests
import json

url = "https://www.topuniversities.com/rankings/endpoint?nid=3897789"

# &page=1&items_per_page=1400
# change URL dynamically
# get number of pages
# iterate through pages
# save data to JSON file

params = {
    'items_per_page': 800, 
    'current_page': 1,
} # these parameters will be changed dynamically by `requests` when you send a GET request

headers = {
  'Accept': '*/*',
  'Authorization': 'Basic cXNyYW5raW5nc2FwaUBxcy5jb206UVNhZG1pbkByMTEyMg==',
  'Sec-Fetch-Site': 'same-origin',
  'Accept-Language': 'en-GB,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Sec-Fetch-Mode': 'cors',
  'Host': 'www.topuniversities.com',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15',
  'Referer': 'https://www.topuniversities.com/university-rankings/world-university-rankings/2024',
  'Connection': 'keep-alive',
  'Cookie': '_fbp=fb.1.1696348117607.443804054; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696348117.1.1.1696348469.50.0.0; _gid=GA1.2.117943117.1696348117; _ga_16LPMES2GR=GS1.1.1696348117.1.1.1696348468.60.0.0; _ga_YN0B3DGTTZ=GS1.1.1696348117.1.1.1696348468.60.0.0; _gat_UA-37767707-2=1; _hjIncludedInSessionSample_173635=0; _hjAbsoluteSessionInProgress=1; _hjFirstSeen=1; _hjSession_173635=eyJpZCI6Ijk0NGFhZWY3LTg5ZGUtNGU3ZC1iMGNkLTZhZjAwNzM0YjQyMCIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; __hssc=238059679.2.1696348118645; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696348118645.1; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _gcl_au=1.1.1619037028.1696348117; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A0%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.3834664399.1696348116025%22%2C%22views%22%3A2%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A1%2C%22landing%22%3A%22https%3A//www.topuniversities.com/university-rankings/world-university-rankings/2024%22%2C%22enter_at%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696348159659%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; mktz_login_58504=true; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0; STYXKEY-user_survey=other; STYXKEY-user_survey_other=; STYXKEY-globaluserUUID=TU-1696348117088-20095468; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348116:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348116:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; mktz_sess=sess.2.3834664399.1696348116025',
  'Sec-Fetch-Dest': 'empty',
  'X-Requested-With': 'XMLHttpRequest'
}

# payload = {}

for page in range(1, 3): # Do you understand why we need to iterate from 1 to 3 instead of 1 to 2?
    params['current_page'] = page # Change the current page number
    response = requests.request("GET", url, 
                                headers=headers, 
                                # data=payload, 
                                params=params) # Send the request
    # Pretty print the first two items to ensure we are getting the data we need
    # pprint(response.json()['score_nodes'][:2]) 
    with open(f'qs_ranking_{page}.json', 'w') as f: # Save the text to a file, encoding it as UTF-8
        f.write(response.text)

Understanding JSON

JSON vs CSV

JSON (JavaScript Object Notation) is a very popular format for storing data. It is a text-based format that is easy to read and write for both humans and machines. It is also very flexible and can be used to store a variety of data structures. It is also very popular in the web development world, which is why we are using it here. However, it is not the best format for storing tabular data. For that, we would use CSV (comma-separated values) or TSV (tab-spearated values). JSON is a hierarchical format, which means it is not very easy to convert to a tabular form. However, it is possible to do so, and we will do it in this tutorial using the pandas library and its json_normalize function.

JSON structure

JSON has a similar structure to a nested dictionary in Python. Let us see a few examples of the same data represented in CSV, JSON and Python.

column1,column2,column3
1,2,3

This CSV represents the following table:

column1 column2 column3
1 2 3

Compare it to the JSON below:


{
    "column1": 1,
    "column2": 2,
    "column3": 3
}

It is identical to a Python dictionary:

{
    "column1": 1,
    "column2": 2,
    "column3": 3
}

Since CSV is a flat file format, it is hard to represent any nested sctructure with it. JSON and Python dictionaries, on the other hand, can be nested. Let’s see an example of a nested structure, which is represented with the same syntax as a Python dictonary holding key-value pairs. {#json-example}

{
    "name": "Alex",
    "jobs": ["tutor", "researcher"], # [] used for lists of primitive items (strings, ints, booleans, floats) or nested ones
    "courses": [ # [] used for lists of primitive items or nested ones
      { # {} used for key-value pairs 
        "name": "Data for Data Science",
        "year": 2023, 
        "code": "DS105", 
        "department": "DSI"
      },
      {
        "name": "Public Policy Analysis",
        "year": 2023, 
        "code": "GV263",
        "department": "Government"
      }
    ]
} # do not forget to close the braces and brackets

Compare this to a representation of the same data in CSV: there is no space here to represent the nested structure with course codes and departments unless you add more columns, which is not ideal. We can also represent this in a ‘tidy’ format with interrelated tables with different table names and foreign keys, but this is beyond the scope of this tutorial.

name,jobs,courses
Alex,"tutor, researcher","Data for Data Science, Public Policy Analysis" 

Let’s put this object into Python and see how we can access the data.


data = {
    "name": "Alex", # Note that JSON uses double quotes for strings, whereas Python may use single or double ones (but not both)
    "jobs": ["tutor", "researcher"], # so to avoid confusion, we will use double quotes for strings and keys in Python as well as JSON
    "courses": [
      {
        "name": "Data for Data Science",
        "year": 2023, 
        "code": "DS105", 
        "department": "DSI"
      },
      {
        "name": "Public Policy Analysis",
        "year": 2023, 
        "code": "GV263",
        "department": "Government"
      }
    ]
}

Visualising JSON

Sometimes the degree of nestedness is so high that it is a challenge to understand which elements are related to which other ones. To help us with this, we can use a tool called JSONCrack. Simply copy and paste your JSON into the left-hand side (or upload a file) and it will be rendered in a tree-like structure on the right-hand side. You can also collapse and expand the different levels of the tree to make it more comprehensible. Here is what you should see if you paste the JSON above into JSONCrack:

JSONCrack: a tool for visualising JSON

And here is the structure of the JSON we scraped from the QS website:

JSONCrack visualisation for QS rankings

Accessing JSON data

We can use pandas to normalise JSON into a flat structure, but it is best to access the elements we need before doing so. Let’s see how we can access the data we need from the JSON we have accessed.

import json
from pprint import pprint
# Open the json file 
with open('qs_ranking_1.json', 'r') as f:
    data = json.load(f)

# Print the keys available in the JSON
print(data.keys())

# The data about universities are stored in the `score_nodes` key
# Let's print the first two items in the list
pprint(data['score_nodes'][:2])

Scraping using traditional methods

It is relatively simple to find endpoints for scraping data and the output is usually well-structured. We have now obtained the ratings for all universities in the 2024 QS ranking, and one of the data points is the URL ‘slug’ for accessing the webpages of individual universities, which is where the scores for each indicator are stored. If we go to one of the pages, we can use the same process for filtering down requests and finding the endpoint responsible for the data we need. It turns out

Getting the request responsible for getting data on individual indicators

After identifying the request, we can again copy the cURL command and import it into Postman. We can then export it to Python and iterate through all the universities to get the data we need. Notice that this is a POST request, whereas previously we were dealing with GET requests.

Exporting the request to Postman

Your cURL command should look something like this:

curl 'https://www.topuniversities.com/qs-profiles/rank-data/513/478/0?_wrapper_format=drupal_ajax' \
-X 'POST' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Accept: application/json, text/javascript, */*; q=0.01' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Accept-Language: en-GB,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Host: www.topuniversities.com' \
-H 'Origin: https://www.topuniversities.com' \
-H 'Content-Length: 1212' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15' \
-H 'Referer: https://www.topuniversities.com/universities/university-oxford' \
-H 'Connection: keep-alive' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Cookie: STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A1%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.565217329.1696875467083%22%2C%22views%22%3A7%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A2%2C%22landing%22%3A%22https%3A//www.topuniversities.com/universities/massachusetts-institute-technology-mit%22%2C%22enter_at%22%3A%222023-10-9%7C19%3A17%3A47%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696877223485%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; _ga_16LPMES2GR=GS1.1.1696875469.2.1.1696877221.30.0.0; _ga_5D0D56Z1Z9=GS1.1.1696875469.1.1.1696877221.0.0.0; _ga_YN0B3DGTTZ=GS1.1.1696875469.2.1.1696877221.27.0.0; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696875469.2.1.1696877218.50.0.0; _gid=GA1.2.1105426707.1696875469; _hjAbsoluteSessionInProgress=0; _hjSession_173635=eyJpZCI6ImJkMTA0ZDkwLTZkZmEtNGU4MS1iYWY2LTBmNTJjMmZlZTg2OCIsImNyZWF0ZWQiOjE2OTY4NzU0Njk2ODQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; __hssc=238059679.6.1696875470238; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696875470238.2; _fbp=fb.1.1696348117607.443804054; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _gcl_au=1.1.1619037028.1696348117; _hjIncludedInSessionSample_173635=0; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; _gat_%5Bobject%20Object%5D=1; _gat_UA-223073533-1=1; _gat_UA-37767707-2=1; STYXKEY-user_survey=other; STYXKEY-user_survey_other=; STYXKEY-globaluserUUID=TU-1696875469298-84871097; mktz_sess=sess.2.565217329.1696875467083; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348471:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348471:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0' \
-H 'X-Requested-With: XMLHttpRequest' \
--data 'js=true&_drupal_ajax=1&ajax_page_state%5Btheme%5D=tu_d8&ajax_page_state%5Btheme_token%5D=&ajax_page_state%5Blibraries%5D=addtoany%2Faddtoany.front%2Cckeditor_accordion%2Faccordion.frontend%2Cclientside_validation_jquery%2Fcv.jquery.ckeditor%2Cclientside_validation_jquery%2Fcv.jquery.ife%2Cclientside_validation_jquery%2Fcv.jquery.validate%2Cclientside_validation_jquery%2Fcv.pattern.method%2Ccore%2Fdrupal.form%2Ccore%2Fdrupal.states%2Ccore%2Fnormalize%2Ceu_cookie_compliance%2Feu_cookie_compliance_bare%2Cflag%2Fflag.link_ajax%2Cga%2Fanalytics%2Clayout_discovery%2Fonecol%2Cqs_article%2Fqs_article%2Cqs_firebase_sso%2Fsso-lib%2Cqs_firebase_sso%2Fsso-lib-header%2Cqs_flexreg_user_flow%2Fqs_flexreg_user_flow%2Cqs_global_site_search%2Fsearch_header%2Cqs_profiles%2Fhighcharts%2Cqs_profiles%2Fqs_profiles%2Cqs_profiles%2Fqs_profiles_circle%2Cqs_user_profile%2FqsUserProfile%2Csystem%2Fbase%2Ctu_d8%2Fglobal%2Ctu_d8%2Fnode%2Ctu_d8%2Fprofile_header%2Ctu_d8%2Fqna_forums%2Ctu_d8%2Fqs_campus_locations%2Ctu_d8%2Fqs_instant%2Ctu_d8%2Fqs_profile_new%2Ctu_d8%2Fqs_profile_new_datalayer%2Ctu_d8%2Fqs_program_tabs%2Ctu_d8%2Fqs_ranking_chart%2Ctu_d8%2Fqs_related_content%2Ctu_d8%2Fqs_similar_programs%2Cviews%2Fviews.module'

Here is what the request looks like in Postman:

Postman interface: this is what you should see after importing your cURL command for the POST request

A POST request means that the parameters are sent in the body of the request rather than in the URL; however, in our case it seems that the unique ID of each university is sent in the URL. We can try to change the ID to see if we can get data for other universities.

url = "https://www.topuniversities.com/qs-profiles/rank-data/513/478/0?_wrapper_format=drupal_ajax"

This is the request sent to get the data for MIT, and in our qs_ranking_1.json file we can see that the ID for MIT is 478. Let’s try to change it to 362, which is the ID for LSE.

url = "https://www.topuniversities.com/qs-profiles/rank-data/513/362/0?_wrapper_format=drupal_ajax"

The request is successful and we do indeed get the indicators for LSE. This means we can gather all the unique IDs for all the universities and iterate through them to get the data we need. However, as you might have seen, the data passed in the JSON response is still formatted in HTML, so me might need to do some cleaning before we can put it in a tabular format—this is where traditional scraping methods come in handy.


import json 
from pprint import pprint

# open files beginning with `qs_ranking_`
# iterate through the files and append the data to a list

unis = []
for page in range(1, 3):
    with open(f'qs_ranking_{page}.json', 'r') as f:
        unis.extend(json.load(f)['score_nodes']) # read up on the difference between `extend` and `append`

pprint(unis[:2])

Now we have a list of dictionaries, but we need a list of unique IDs Here they seem to be represented by the core_id key

Let’s create a list of unique IDs to scrape. We can use a for loop to iterate through the list of dictionaries and append the IDs to a new list. We can then use the set function to deduplicate the list.


uni_ids = []

for uni in unis:
    uni_ids.append(uni['core_id'])

# Deduplicate the list
uni_ids = list(set(uni_ids))

# Let's check the length of the list of unique IDs
print(len(uni_ids)) 
# and a few items from the list
print(uni_ids[:5])

We can use a similar method to the one we used previously to save the data to JSON files. It is probably a good idea just to save the JSON files for now and then later parse them for useful information—this way we don’t have to worry about getting the right data from the HTML stright away and instead focus on procuring it first.

This is a working request that we exported as Python code from Postman based on the cURL command above.


import requests

url = "https://www.topuniversities.com/qs-profiles/rank-data/513/410/0?_wrapper_format=drupal_ajax"

payload = "js=true&_drupal_ajax=1&ajax_page_state%5Btheme%5D=tu_d8&ajax_page_state%5Btheme_token%5D=&ajax_page_state%5Blibraries%5D=addtoany%2Faddtoany.front%2Cckeditor_accordion%2Faccordion.frontend%2Cclientside_validation_jquery%2Fcv.jquery.ckeditor%2Cclientside_validation_jquery%2Fcv.jquery.ife%2Cclientside_validation_jquery%2Fcv.jquery.validate%2Cclientside_validation_jquery%2Fcv.pattern.method%2Ccore%2Fdrupal.form%2Ccore%2Fdrupal.states%2Ccore%2Fnormalize%2Ceu_cookie_compliance%2Feu_cookie_compliance_bare%2Cflag%2Fflag.link_ajax%2Cga%2Fanalytics%2Clayout_discovery%2Fonecol%2Cqs_article%2Fqs_article%2Cqs_firebase_sso%2Fsso-lib%2Cqs_firebase_sso%2Fsso-lib-header%2Cqs_flexreg_user_flow%2Fqs_flexreg_user_flow%2Cqs_global_site_search%2Fsearch_header%2Cqs_profiles%2Fhighcharts%2Cqs_profiles%2Fqs_profiles%2Cqs_profiles%2Fqs_profiles_circle%2Cqs_user_profile%2FqsUserProfile%2Csystem%2Fbase%2Ctu_d8%2Fglobal%2Ctu_d8%2Fnode%2Ctu_d8%2Fprofile_header%2Ctu_d8%2Fqna_forums%2Ctu_d8%2Fqs_campus_locations%2Ctu_d8%2Fqs_instant%2Ctu_d8%2Fqs_profile_new%2Ctu_d8%2Fqs_profile_new_datalayer%2Ctu_d8%2Fqs_program_tabs%2Ctu_d8%2Fqs_ranking_chart%2Ctu_d8%2Fqs_related_content%2Ctu_d8%2Fqs_similar_programs%2Cviews%2Fviews.module"
headers = {
  'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
  'Accept': 'application/json, text/javascript, */*; q=0.01',
  'Sec-Fetch-Site': 'same-origin',
  'Accept-Language': 'en-GB,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Sec-Fetch-Mode': 'cors',
  'Host': 'www.topuniversities.com',
  'Origin': 'https://www.topuniversities.com',
  'Content-Length': '1212',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15',
  'Referer': 'https://www.topuniversities.com/universities/massachusetts-institute-technology-mit',
  'Connection': 'keep-alive',
  'Sec-Fetch-Dest': 'empty',
  'Cookie': 'STYXKEY_first_visit=yes; mktz_ab=%7B%2258504%22%3A%7B%22v%22%3A1%2C%22l%22%3A134687%7D%7D; mktz_client=%7B%22is_returning%22%3A1%2C%22uid%22%3A%22172262619832803285%22%2C%22session%22%3A%22sess.2.565217329.1696875467083%22%2C%22views%22%3A2%2C%22referer_url%22%3A%22%22%2C%22referer_domain%22%3A%22%22%2C%22referer_type%22%3A%22direct%22%2C%22visits%22%3A2%2C%22landing%22%3A%22https%3A//www.topuniversities.com/universities/massachusetts-institute-technology-mit%22%2C%22enter_at%22%3A%222023-10-9%7C19%3A17%3A47%22%2C%22first_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_visit%22%3A%222023-10-3%7C16%3A48%3A36%22%2C%22last_variation%22%3A%22134687%3D1696875480078%22%2C%22utm_source%22%3Afalse%2C%22utm_term%22%3Afalse%2C%22utm_campaign%22%3Afalse%2C%22utm_content%22%3Afalse%2C%22utm_medium%22%3Afalse%2C%22consent%22%3A%22%22%7D; _ga_16LPMES2GR=GS1.1.1696875469.2.0.1696875478.51.0.0; _ga_5D0D56Z1Z9=GS1.1.1696875469.1.0.1696875478.0.0.0; _ga_YN0B3DGTTZ=GS1.1.1696875469.2.0.1696875478.51.0.0; _ga=GA1.2.560061194.1696348117; _ga_8SLQFC5LXV=GS1.2.1696875469.2.1.1696875473.56.0.0; _gid=GA1.2.1105426707.1696875469; __hssc=238059679.1.1696875470238; __hssrc=1; __hstc=238059679.00da05ec1d70ffb25c243f6a9eeb8cce.1696348118645.1696348118645.1696875470238.2; hubspotutk=00da05ec1d70ffb25c243f6a9eeb8cce; _fbp=fb.1.1696348117607.443804054; _gat_%5Bobject%20Object%5D=1; _gat_UA-223073533-1=1; _gat_UA-37767707-2=1; _gcl_au=1.1.1619037028.1696348117; _hjAbsoluteSessionInProgress=0; _hjIncludedInSessionSample_173635=0; _hjSessionUser_173635=eyJpZCI6ImZkNzU2NzkzLTQxODItNTkwOC1iZjkyLTIwMjJiZDRiNDkzNSIsImNyZWF0ZWQiOjE2OTYzNDgxMTc4NjMsImV4aXN0aW5nIjp0cnVlfQ==; _hjSession_173635=eyJpZCI6ImJkMTA0ZDkwLTZkZmEtNGU4MS1iYWY2LTBmNTJjMmZlZTg2OCIsImNyZWF0ZWQiOjE2OTY4NzU0Njk2ODQsImluU2FtcGxlIjpmYWxzZSwic2Vzc2lvbml6ZXJCZXRhRW5hYmxlZCI6ZmFsc2V9; STYXKEY-globaluserUUID=TU-1696875469298-84871097; mktz_sess=sess.2.565217329.1696875467083; __gads=ID=1157ddb27ff3be5a:T=1696348116:RT=1696348471:S=ALNI_Mb-2oHbrUmHI1cdHEugn-QF96rgEw; __gpi=UID=00000c8b9f3cdb41:T=1696348116:RT=1696348471:S=ALNI_MZUO-lvaEUqOejr5VPKMtpI2JY8Xg; sa-user-id=s%253A0-e6c12fca-af8d-42c6-4916-01680edc9fd5.j9WTOoI7TxfVNyC%252Bv%252FdVCmnmXYcLof1jmZyeL3NxcPY; sa-user-id-v2=s%253A5sEvyq-NQsZJFgFoDtyf1aziACI.ry2%252Bk1GJLTDWLdtrd%252B4MEEIItJsZhL7RAAspSMY5bVY; sa-user-id-v3=s%253AAQAKIBbb9JwS4hQF5PN5wvfoh8VY72kjgv3fCqon_R3rCJDAEG4YBCD_7_CoBigBOgRSIcquQgQDXf_T.ZWjQiq%252FulN34TyYO6iMJgqkpq3BX9A%252BFiT3Hft5fcnM; cookie-agreed=2; cookie-agreed-version=1.0.0',
  'X-Requested-With': 'XMLHttpRequest'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

The only thing we need to change here is the core_id within the URL. We have 1498 unique ID to get through, so it would be a good idea to save the data with the ID as the filename. We can also add a sleep function to avoid overloading the server with requests—but this is optional for now as we will only do a few requests.

# Let's create a function that will save the JSON files for us.

def save_json(data: str, filename:str):
    with open(filename, 'w') as f:
        f.write(data)

It might also be a good idea to save the data in a separate folder. We can use the os library to create a folder if it does not exist yet.

import os

# Create a folder called `data` if it does not exist yet
if not os.path.exists('data'):
    os.mkdir('data')

Now we can iterate through the list of unique IDs and save the data to JSON files.


import requests
import time
from tqdm.notebook import tqdm # a progress bar library
import os # a library for working with files and folders

url = "https://www.topuniversities.com/qs-profiles/rank-data/513/{}/0?_wrapper_format=drupal_ajax" # notice the {} in the URL---this is a placeholder for the ID

# Create a folder called `data` if it does not exist yet
if not os.path.exists('data'):
    os.mkdir('data')

for id in tqdm(uni_ids[:5]):
    print(f"Saving data for {id}")
    # make a POST request instead of a GET request
    response = requests.request("POST", url.format(id), headers=headers, data=payload)
    save_json(response.text, f"data/{id}.json")
    time.sleep(1)

Explore the structure of the files you have saved. Within each of them, we can see the HTML for the indicators—it seeems to be within the last dictionary in a list. There are also some additional data about historical ratings that you might want to make use of in eg. finding out which regions have had the most impressive improvements over the past decade. As we are only interested in the 2024 indicators, however, we can extract the HTML and its associated university ID, then use Scrapy Selectors to parse the HTML and extract the data we need.

Saving the files and parsing them later allows you to separate the data collection and data processing steps. This can be beneficial for a few reasons:

  1. Reduced server load: By saving the data to files and processing it later, you can avoid overloading the server with too many requests at once. This can help prevent your IP address from being blocked or banned by the server.

  2. Flexibility: Saving the data to files allows you to work with the data at your own pace and on your own schedule. You can parse the data when it’s convenient for you, and you can easily re-parse the data if you need to make changes to your parsing code.

  3. Data backup: Saving the data to files provides a backup in case something goes wrong during the parsing process. If your parsing code encounters an error or if your computer crashes, you can simply re-run the parsing code on the saved data rather than having to re-collect the data from scratch.


# Create an empty dictionary to store the university data
uni_data = {}

# Loop through each file in the `data` folder
for filename in os.listdir('data'):
  
  # Open the file and read in the JSON data
  with open(f'data/{filename}', 'r') as f:
    json_data = json.load(f)
  
  # Get the ID of the university from the filename
  uni_id = filename.split('.')[0]
  
  # Add a new key-value pair to the `uni_data` dictionary
  # where the key is the ID and the value is the last dictionary
  # in the list of dictionaries in the JSON data
  uni_data[uni_id] = json_data[-1]['data']

Now that we have the HTML, we can use Scrapy Selectors to parse it and extract the data.


from scrapy import Selector

# Let's create a function that will extract the data we need from the HTML

def extract_data(html: str):
    # define a selector
    sel = Selector(text=html)
    # extract the data from div.circle---this gives you a list of Scrapy selector objects which you can drill down into
    indicators = sel.css('div.circle')
    # create a dictionary to store the data
    indicator_data = {}
    # iterate through the indicators and extract the data
    
    for indicator in indicators:
        indicator_name = indicator.css('.itm-name::text').get().strip()
        indicator_value = indicator.css('.score::text').get().strip()
        
        # Store data in a dictionary
        indicator_data[indicator_name] = indicator_value

    return indicator_data

# Let's test the function on the HTML we have saved

extract_data(uni_data['2551'])

# And now we can iterate through the dictionary and extract the data for each university

for id, data in uni_data.items():
    uni_data[id] = extract_data(data)

Now we have a dictionary with the data we need. We can use pandas to convert it to a dataframe and save it to a CSV file.


import pandas as pd

df = pd.DataFrame.from_dict(uni_data, orient='index')
df.head()

# and then save it to a CSV file
df.to_csv('qs_indicators.csv')

Now we have a dataframe with all the indicators for all the universities, however, the data are missing university names, regions, and weblinks that we received with the previous request. We can use the core_id to join the two dataframes together. First, let us load the data from the previous request.


import json
import pandas as pd

# Open the json file
with open('qs_ranking_1.json', 'r') as f:
    data = json.load(f)

# get into the `score_nodes` key which contains required data
data = data['score_nodes']

# convert to a pandas dataframe
unis_df = pd.DataFrame.from_dict(data)

# Merge the two dataframes together
unis_df = unis_df.merge(df, left_on='core_id', right_index=True)

# Save the dataframe to a CSV file
unis_df.to_csv('qs_ranking.csv')

Let’s take a look at the dataframe we have created.


import pandas as pd

unis_df = pd.read_csv('qs_ranking.csv', index_col=0)
unis_df.head()

This data set now contains all the information we need to answer the questions we set out to answer. You can now use the data to create visualisations, perform statistical analysis, or build a machine learning model.

Summary

In this tutorial, we have learned

  1. how to use Postman to identify the requests responsible for getting the data we need and how to export them to Python
  2. how to save the data to files and how to parse it later
  3. how to use Scrapy Selectors to parse HTML and extract the data we need
  4. how to join two dataframes together using a common key
  5. how to save the data to a CSV file.

Happy scraping!