🧪 Week 07 Lab

Practice normalising JSON data and using the groupby -> apply -> combine strategy

Author

Published

07 March 2025

🥅 Learning Goals

By the end of this lab, you should be able to: i) Use pd.json_normalize() to flatten nested JSON data, ii) Handle columns containing lists using DataFrame.explode(), iii) Apply the split-apply-combine pattern with groupby(), iv) Transform complex data structures into analysis-ready formats.

Last Updated: 6 March 2025, 19:00 GMT

📍Time and Location: Friday, 7 March 2025. Check your timetable for the precise time and location of your class.

📋 Preparation

To come prepared to this lab, make sure you have:

Attended the 🗣️ Week 07 lecture
Reviewed the JSON normalization concepts covered in the lecture
Basic familiarity with pandas groupby operations

🛣️ Roadmap

Here is how we will achieve the goal for this lab:

Part I: ⚙️ Set Up (10 min)

Option 1: Using Nuvolos (Recommended)

If you are working on Nuvolos, follow these steps:

Navigate to the Week 7 materials in your my-ds105w-notes folder. The notebook is in the Week 07 folder, while the data and figures are in the root directories:

my-ds105w-notes/               # Root directory
├── data/                      # Data directory (at root level)
│   ├── opensanctions/         # Contains the data files for this lab
│   └── ... (other data folders)
│
├── figures/                   # Figures directory (at root level)
│   ├── w07-lab/               # Contains reference figures for this lab
│   └── ... (other figure folders)
│
├── Week 01 - ...
├── Week 02 - ...
└── Week 07 - JSON Normalization and Data Reshaping/
    └── W07-Lab-Notebook.ipynb # Open this notebook to begin

Note: The exact order these folders appear in your file explorer may differ depending on your sorting preferences. The important thing is to locate the W07-Lab-Notebook.ipynb file in the Week 07 folder, and ensure you can access the data in the data/opensanctions/ directory.

Simply open the W07-Lab-Notebook.ipynb file on VSCode to begin working on the lab exercises.

Option 2: Download the Lab Files Directly

If you prefer to work on your own machine, you can download the lab files:

After downloading, extract the files and open the W07-Lab-Notebook.ipynb file in your preferred environment.

Part II: 📚 Practice (70-80 min)

💽 DATA SPECIFICATION CARD:
We’re going to use data from the OpenSanctions project. This dataset includes information about individuals and entities that governments and international organizations have sanctioned worldwide. OpenSanctions is operated by a German company, OpenSanctions Datenbanken GmbH, and has received funding from the German Federal Ministry for Education and Research. They offer a paid API for accessing the data, but you can also download the data in bulk for free, for academic and research purposes.
A few things to know about the dataset:
We are focusing on Targets. These are the individuals and entities that have been sanctioned. This dataset includes information about the name, country, and other ‘properties’ of the targets.
We have filtered for Russian Targets. This in part because Alex, who provided us with the data sample for this lab, is doing a PhD where he focuses on studying Russia, and also because the dataset is large and we want to make it more manageable for this lab.
We are using a small random sample. Again, this is to make the dataset more manageable for this lab. The full dataset is much larger.

Follow the instructions in the lab notebook to complete the exercises.

Notes:

You can work alone or in small groups for this.
If you want, feel free to play a game of 🧑‍✈️ Pilot and 🙋 Copilot (s) like we’ve done in the past.

What the exercises will cover:

Using pd.json_normalize to flatten nested JSON data.
Using DataFrame.explode to expand lists in a column.
Using pd.merge to combine dataframes.
Using the groupby method in pandas.

📚 References

Here are some useful references for the techniques we’ll be using in this lab:

The pd.json_normalize() function to convert JSON data more easily into tabular format
The DataFrame.explode() function to handle cases when columns are made out of lists
The DataFrame.groupby() function, combined with apply() and agg() to aggregate data