πŸ’» Week 01 Lab

Exploratory Data Analysis with Pandas: ASCOR Benchmarks

Author
Published

20 January 2025

πŸ₯… Learning Goals
By the end of this lab, you should be able to: i) load and explore a real-world dataset using pandas, ii) identify patterns of missing data systematically, iii) use boolean indexing and pivot_table() to manipulate and summarise data, and iv) visualise insights from datasets with a chosen plotting library.
Image created with the AI embedded in MS Designer using the prompt 'abstract green and blue icon depicting the advanced stages of data wrangling, API design, and scalable pipelines for sustainability-focused data engineering.'

Last updated: 20 January 2025, 18:39

πŸ“‹ Preparation

Before you come to your first lab session, make sure to follow one of the options below to set up your environment.

βœ‹πŸ»Need help? Post a screenshot of problems you are facing on the #help channel on Slack if you encounter any issues. I will be monitoring Slack on Monday 20 Jan afternoon and Tuesday 21 Jan morning to help you get set up.

Option 1: Using Nuvolos (Simpler)

If using Nuvolos, you won’t need to install anything on your computer. Instead, you will run your code directly from your browser on this cloud-based platform.

Here’s how to get started:

  1. Create an account on Nuvolos by following the instructions on the First Time Accessing Nuvolos page.

  2. Log in to Nuvolos, and open the VS Code application.

  3. Navigate to the ascor-dataset-analysis folder by typing the following command in the terminal:

    cd ascor-dataset-analysis
  4. The folder already contains:

    • A pre-configured requirements.txt file.
    • A data/ folder with the required datasets.
    • A Jupyter Notebook template for this lab.
  5. Start coding directly in the pre-configured environment.

Option 2: Running Locally on Your Computer

If you go for this option, you need to be confident in installing software and configuring environments on your computer.

Here’s how to get started:

  1. Ensure that you have the following installed:

  2. Go to the ASCOR website and download the dataset files.

    • You will need to unzip the files after downloading them.
    • This will produce a folder named TPI ASCOR data - 13012025 (the date may vary).
  3. Download the Jupyter Notebook template below:

  1. Create a folder named ascor-dataset-analysis on your computer and ensure it has this structure:

    ascor-dataset-analysis/
    β”œβ”€β”€ data/
    β”‚   β”œβ”€β”€ TPI ASCOR data - 13012025/
    β”‚   β”‚   β”œβ”€β”€ ASCOR_benchmarks.xlsx
    β”‚   β”‚   β”œβ”€β”€ ASCOR_indicators.xlsx
    β”‚   β”‚   β”œβ”€β”€ ...
    β”‚   β”‚   └── ...
    β”œβ”€β”€ requirements.txt
    └── NB01 - ASCOR Dataset Analysis.ipynb
  2. Add the following to your requirements.txt file:

    # Regular DS stuff
    pandas==2.2.3
    
    # Data viz
    matplotlib==3.10.0
    seaborn==0.13.2
    lets-plot==4.5.2
    
    
    # Stuff VSCode needs
    ipykernel>=6.21.2
    jupyterlab>=3.6.1
  3. Open the folder in VS Code.

  4. Install the required dependencies by running:

    pip install -r requirements.txt
  5. Open the provided Jupyter Notebook template to begin working.

πŸ›£οΈ Lab Roadmap

Part I: Warm-Up (20 min)

Note to class teachers: Introduce yourself to the students and vice versa. Briefly explain the purpose of today’s lab and ask students to independently work on the first part of the lab. Circulate to assist and clarify as needed. Group students who have not done the πŸ“‹ Preparation steps so they can help each other.

🎯 ACTION POINTS

  1. Briefly explore the ASCOR data files.

    πŸ’‘ Tip: Open the XLSX files using Microsoft Excel or Google Sheets to get an initial understanding about the data structure.

  2. Open the notebook NB01 - ASCOR Dataset Analysis.ipynb and go to the block of imports and add the following code:

    # IMPORTS
    import os
    
    import pandas as pd
    
    # CONSTANTS
    
    # Point to the dataset folder
    DATA_PATH = 
  3. Complete the DATA_PATH line to specify the path to the dataset folder.

    That is:

    ascor-dataset-analysis/
    β”œβ”€β”€ data/
    β”‚   β”œβ”€β”€ TPI ASCOR data - 13012025/ ⬅️ DATA_PATH should point to here
    ...
    └── NB01 - ASCOR Dataset Analysis.ipynb ⬅️ You are here

    Give it a go first before looking at the solution below!

πŸ” Look at the solution

The best solution is to use the os.path.join() function. This guarantees that others can run our code irrespective of their operating system (macOS, Windows, Linux).

DATA_PATH = os.path.join('.', 'data', 'TPI ASCOR data - 13012025')

But this would also work:

DATA_PATH = './data/TPI ASCOR data - 13012025'

πŸ€“ Pro-tip: The reason we use just uppercase letters when creating the DATA_PATH variable is because we want to signal that this is a constant variable. Using constants appropriately in your code means you are engaging with the best practices from the software world and will make us feel really happy when marking your work.

  1. Now, go to the section # 1. Load the data of the notebook and write the necessary pandas code to load the ASCOR_benchmarks.xlsx file into a DataFrame named benchmarks. Use the DATA_PATH variable to specify the path to the file.

    πŸ’‘ Tip: You learned about the pandas method to read Excel files in the Week 01 Lecture.

    Give it a go first before looking at the solution below!

πŸ” Look at the solution

This is as simple as:

benchmarks = pd.read_excel(os.path.join(DATA_PATH, 'ASCOR_benchmarks.xlsx'))

But, the first time you run this cell, you might get an error because pandas does not support Excel files by default. You need to install one of the external libraries that pandas uses to read Excel files.

The full solution therefore is to add this line to your requirements.txt file:

openpyxl>=3.0

Then, to run pip install -r requirements.txt again, restart the kernel of your notebook and rerun the notebook from the beginning.

Interlude (10 min)

πŸ§‘πŸ»β€πŸ« TEACHING MOMENT

Your class teacher will reinforce these concepts:

  • How to specify paths from one file to another.
  • Why os.path.join() is a good coding practice.
  • When to use constant variables in your code.

Part II: Exploratory Data Analysis with pandas (30 min)

Note to instructor: Encourage students to complete foundational tasks independently. Circulate to assist and clarify as needed. Avoid providing solutions immediately. Let students explore and learn from errors first.

Try to use the designated pandas functions to answer the questions below. The class teacher will provide answers at the end of this section.

πŸ“‹ NOTE: Avoid using ChatGPT/Copilot to get answers straightaway. Use the search bar of the pandas documentation or, if using AI, use it to search for the function you need, not to get the answer to the question.

🎯 ACTION POINTS

  1. Demonstrate that there is a pattern of missing records in the dataset.

    It is possible to demonstrate this using just a single line of code involving groupby() and count(). Can you figure out how?

  2. Compile a list of countries that have missing data in the benchmarks dataset. How many countries are in this list?

    See if you can find a solution that involves the isnull() and sum() functions. This one doesn’t have to be a one-liner.

  3. Prove that the list of countries is the same for all relevant years in the dataset.

    πŸ† Who will come up with the simplest and most elegant solution to this problem?

Interlude (15 min)

πŸ§‘πŸ»β€πŸ« TEACHING MOMENT

Give your class teacher your full attention even if you haven’t completed all the tasks.

Your class teacher will:

  • Show a solution to the first two tasks.
  • Invite you to share your solutions to the third task and discuss what makes a solution β€œsimple yet elegant”.

Part III: Melting Data for Visualisation (15 min)

This is also a πŸ§‘πŸ»β€πŸ« TEACHING MOMENT. They will show you the two pieces of code below and will explain why the second one is preferred even though they both achieve the same result:

1️⃣ Visualising the National 1.5C Benchmark for a Specific Country
# Filter the data for a specific country
country = 'United Kingdom'

# Filter the data for the 'National 1.5C benchmark' rows
national_benchmarks = benchmarks[benchmarks['Benchmark type'] == 'National 1.5C benchmark']

# Filter the data for the specific country
country_benchmarks = national_benchmarks[national_benchmarks['Country'] == country]

# Plot the data
plt.plot(country_benchmarks.columns[7:], country_benchmarks.iloc[0, 7:], marker='o')
plt.xlabel('Year')
plt.ylabel('MtCOβ‚‚e')
plt.title(f'National 1.5C benchmark for {country}')
plt.show()

Do you see the problem with this code? It has to do with the fact that we hard-coded the column positions. This is not a good practice because if the dataset changes say, you add another year, you will have to change the code.

πŸ“‹ REFLECT: Why is hard-coding a bad practice in this case?

Would using a constant make things any better in this case? For instance:

YEAR_COLS_START = 7

...some lines of code later...

plt.plot(country_benchmarks.columns[YEAR_COLS_START:], country_benchmarks.iloc[0, YEAR_COLS_START:], marker='o')` 

What is your take on this?

2️⃣ Same Plot, Better Code
# Filter the data for a specific country
country = 'United Kingdom'

# Filter the data for the 'National 1.5C benchmark' rows
national_benchmarks = benchmarks[benchmarks['Benchmark type'] == 'National 1.5C benchmark']

# Filter the data for the specific country
country_benchmarks = national_benchmarks[national_benchmarks['Country'] == country]

# Plot the data
melted = country_benchmarks.melt(id_vars=['Id', 'Country', 'Publication date', 'Emissions metric', 'Emissions boundary', 'Units', 'Benchmark type'], var_name='Year', value_name='MtCOβ‚‚e')

If you look at the melted DataFrame you will find a DataFrame that is more in line with the grammar-of-graphics approach to data visualisation.

We can use the lets-plot package, for example, to plot this data:

from lets_plot import * # Move this to the top of your notebook

uk_plot = (
    ggplot(melted) +
    geom_line(aes(x='Year', y='MtCOβ‚‚e')) +
    ggtitle(f'National 1.5C benchmark for {country}')
)

☝️ If you come from an R background, you will find this code very familiar. You can keep using this library for the rest of the course if you want.

Even if you were to use matplotlib, you would find that the code is now more readable and maintainable:

# Plot the data
plt.plot(melted['Year'], melted['MtCOβ‚‚e'], marker='o')
plt.xlabel('Year')
plt.ylabel('MtCOβ‚‚e')
plt.show()

This code is better because now we transformed all the year columns into rows. Instead of 2023-2030 column names, we have a single column named Year and a single column named MtCOβ‚‚e marking the benchmark values. Even if we add more years to the dataset, this code will still work!

What’s Next ⏩

πŸ“ Formative Exercise

Check out the page of the Formative Exercise where you will continue this exploration.

Support and Resources

Next Week

Prepare for Week 02: Introduction to APIs, where you will learn to serve insights from your datasets via FastAPI.