π» Week 01 Lab
Exploratory Data Analysis with Pandas: ASCOR Benchmarks
pivot_table()
to manipulate and summarise data, and iv) visualise insights from datasets with a chosen plotting library.
Last updated: 20 January 2025, 18:39
π Preparation
Before you come to your first lab session, make sure to follow one of the options below to set up your environment.
βπ»Need help? Post a screenshot of problems you are facing on the #help
channel on Slack if you encounter any issues. I will be monitoring Slack on Monday 20 Jan afternoon and Tuesday 21 Jan morning to help you get set up.
Option 1: Using Nuvolos (Simpler)
If using Nuvolos, you wonβt need to install anything on your computer. Instead, you will run your code directly from your browser on this cloud-based platform.
Hereβs how to get started:
Create an account on Nuvolos by following the instructions on the First Time Accessing Nuvolos page.
Log in to Nuvolos, and open the VS Code application.
Navigate to the
ascor-dataset-analysis
folder by typing the following command in the terminal:cd ascor-dataset-analysis
The folder already contains:
- A pre-configured
requirements.txt
file. - A
data/
folder with the required datasets. - A Jupyter Notebook template for this lab.
- A pre-configured
Start coding directly in the pre-configured environment.
Option 2: Running Locally on Your Computer
If you go for this option, you need to be confident in installing software and configuring environments on your computer.
Hereβs how to get started:
Ensure that you have the following installed:
- VS Code
- Python 3+
Go to the ASCOR website and download the dataset files.
- You will need to unzip the files after downloading them.
- This will produce a folder named
TPI ASCOR data - 13012025
(the date may vary).
Download the Jupyter Notebook template below:
Create a folder named
ascor-dataset-analysis
on your computer and ensure it has this structure:ascor-dataset-analysis/ βββ data/ β βββ TPI ASCOR data - 13012025/ β β βββ ASCOR_benchmarks.xlsx β β βββ ASCOR_indicators.xlsx β β βββ ... β β βββ ... βββ requirements.txt βββ NB01 - ASCOR Dataset Analysis.ipynb
Add the following to your
requirements.txt
file:# Regular DS stuff ==2.2.3 pandas # Data viz ==3.10.0 matplotlib==0.13.2 seaborn-plot==4.5.2 lets # Stuff VSCode needs >=6.21.2 ipykernel>=3.6.1 jupyterlab
Open the folder in VS Code.
Install the required dependencies by running:
pip install -r requirements.txt
Open the provided Jupyter Notebook template to begin working.
π£οΈ Lab Roadmap
Part I: Warm-Up (20 min)
Note to class teachers: Introduce yourself to the students and vice versa. Briefly explain the purpose of todayβs lab and ask students to independently work on the first part of the lab. Circulate to assist and clarify as needed. Group students who have not done the π Preparation steps so they can help each other.
π― ACTION POINTS
Briefly explore the ASCOR data files.
π‘ Tip: Open the XLSX files using Microsoft Excel or Google Sheets to get an initial understanding about the data structure.
Open the notebook
NB01 - ASCOR Dataset Analysis.ipynb
and go to the block of imports and add the following code:# IMPORTS import os import pandas as pd # CONSTANTS # Point to the dataset folder = DATA_PATH
Complete the
DATA_PATH
line to specify the path to the dataset folder.That is:
ascor-dataset-analysis/ βββ data/ β βββ TPI ASCOR data - 13012025/ β¬ οΈ DATA_PATH should point to here ... βββ NB01 - ASCOR Dataset Analysis.ipynb β¬ οΈ You are here
Give it a go first before looking at the solution below!
π Look at the solution
The best solution is to use the os.path.join()
function. This guarantees that others can run our code irrespective of their operating system (macOS, Windows, Linux).
= os.path.join('.', 'data', 'TPI ASCOR data - 13012025') DATA_PATH
But this would also work:
= './data/TPI ASCOR data - 13012025' DATA_PATH
π€ Pro-tip: The reason we use just uppercase letters when creating the DATA_PATH
variable is because we want to signal that this is a constant variable. Using constants appropriately in your code means you are engaging with the best practices from the software world and will make us feel really happy when marking your work.
Now, go to the section
# 1. Load the data
of the notebook and write the necessary pandas code to load theASCOR_benchmarks.xlsx
file into a DataFrame namedbenchmarks
. Use theDATA_PATH
variable to specify the path to the file.π‘ Tip: You learned about the pandas method to read Excel files in the Week 01 Lecture.
Give it a go first before looking at the solution below!
π Look at the solution
This is as simple as:
= pd.read_excel(os.path.join(DATA_PATH, 'ASCOR_benchmarks.xlsx')) benchmarks
But, the first time you run this cell, you might get an error because pandas does not support Excel files by default. You need to install one of the external libraries that pandas uses to read Excel files.
The full solution therefore is to add this line to your requirements.txt
file:
openpyxl>=3.0
Then, to run pip install -r requirements.txt
again, restart the kernel of your notebook and rerun the notebook from the beginning.
Interlude (10 min)
π§π»βπ« TEACHING MOMENT
Your class teacher will reinforce these concepts:
- How to specify paths from one file to another.
- Why
os.path.join()
is a good coding practice. - When to use constant variables in your code.
Part II: Exploratory Data Analysis with pandas (30 min)
Note to instructor: Encourage students to complete foundational tasks independently. Circulate to assist and clarify as needed. Avoid providing solutions immediately. Let students explore and learn from errors first.
Try to use the designated pandas functions to answer the questions below. The class teacher will provide answers at the end of this section.
π NOTE: Avoid using ChatGPT/Copilot to get answers straightaway. Use the search bar of the pandas
documentation or, if using AI, use it to search for the function you need, not to get the answer to the question.
π― ACTION POINTS
Demonstrate that there is a pattern of missing records in the dataset.
It is possible to demonstrate this using just a single line of code involving
groupby()
andcount()
. Can you figure out how?Compile a list of countries that have missing data in the benchmarks dataset. How many countries are in this list?
See if you can find a solution that involves the
isnull()
andsum()
functions. This one doesnβt have to be a one-liner.Prove that the list of countries is the same for all relevant years in the dataset.
π Who will come up with the simplest and most elegant solution to this problem?
Interlude (15 min)
π§π»βπ« TEACHING MOMENT
Give your class teacher your full attention even if you havenβt completed all the tasks.
Your class teacher will:
- Show a solution to the first two tasks.
- Invite you to share your solutions to the third task and discuss what makes a solution βsimple yet elegantβ.
Part III: Melting Data for Visualisation (15 min)
This is also a π§π»βπ« TEACHING MOMENT. They will show you the two pieces of code below and will explain why the second one is preferred even though they both achieve the same result:
1οΈβ£ Visualising the National 1.5C Benchmark for a Specific Country
# Filter the data for a specific country
= 'United Kingdom'
country
# Filter the data for the 'National 1.5C benchmark' rows
= benchmarks[benchmarks['Benchmark type'] == 'National 1.5C benchmark']
national_benchmarks
# Filter the data for the specific country
= national_benchmarks[national_benchmarks['Country'] == country]
country_benchmarks
# Plot the data
7:], country_benchmarks.iloc[0, 7:], marker='o')
plt.plot(country_benchmarks.columns['Year')
plt.xlabel('MtCOβe')
plt.ylabel(f'National 1.5C benchmark for {country}')
plt.title( plt.show()
Do you see the problem with this code? It has to do with the fact that we hard-coded the column positions. This is not a good practice because if the dataset changes say, you add another year, you will have to change the code.
π REFLECT: Why is hard-coding a bad practice in this case?
Would using a constant make things any better in this case? For instance:
= 7
YEAR_COLS_START
...some lines of code later...
0, YEAR_COLS_START:], marker='o')` plt.plot(country_benchmarks.columns[YEAR_COLS_START:], country_benchmarks.iloc[
What is your take on this?
2οΈβ£ Same Plot, Better Code
# Filter the data for a specific country
= 'United Kingdom'
country
# Filter the data for the 'National 1.5C benchmark' rows
= benchmarks[benchmarks['Benchmark type'] == 'National 1.5C benchmark']
national_benchmarks
# Filter the data for the specific country
= national_benchmarks[national_benchmarks['Country'] == country]
country_benchmarks
# Plot the data
= country_benchmarks.melt(id_vars=['Id', 'Country', 'Publication date', 'Emissions metric', 'Emissions boundary', 'Units', 'Benchmark type'], var_name='Year', value_name='MtCOβe') melted
If you look at the melted
DataFrame you will find a DataFrame that is more in line with the grammar-of-graphics approach to data visualisation.
We can use the lets-plot
package, for example, to plot this data:
from lets_plot import * # Move this to the top of your notebook
= (
uk_plot +
ggplot(melted) ='Year', y='MtCOβe')) +
geom_line(aes(xf'National 1.5C benchmark for {country}')
ggtitle( )
βοΈ If you come from an R background, you will find this code very familiar. You can keep using this library for the rest of the course if you want.
Even if you were to use matplotlib
, you would find that the code is now more readable and maintainable:
# Plot the data
'Year'], melted['MtCOβe'], marker='o')
plt.plot(melted['Year')
plt.xlabel('MtCOβe')
plt.ylabel( plt.show()
This code is better because now we transformed all the year columns into rows. Instead of 2023-2030 column names, we have a single column named Year
and a single column named MtCOβe
marking the benchmark values. Even if we add more years to the dataset, this code will still work!
Whatβs Next β©
π Formative Exercise
Check out the page of the Formative Exercise where you will continue this exploration.
Support and Resources
Next Week
Prepare for Week 02: Introduction to APIs, where you will learn to serve insights from your datasets via FastAPI.