🗣️ Week 07 Lecture

JSON Normalization & Data Reshaping

Author

Dr Jon Cardoso-Silva

Published

05 March 2025

🥅 Learning Goals

By the end of this lecture, you should be able to: i) Normalize complex JSON data into analysis-ready tables, ii) Handle nested data structures effectively, iii) Apply the Group → Apply → Combine strategy, iv) Prepare data for visualization using tidy data principles.

📍Time and Location: Thursday, 06 March 2025 from 4-6 pm at MAR.1.04

This week’s lecture will focus on techniques for handling complex, nested data structures - particularly JSON data - and transforming them into analysis-ready formats. These skills are essential for real-world data science work where data rarely comes in the perfect format you need.

📓 Interactive Puzzle-Based Learning

This lecture will follow a unique format:

Puzzle-Solve-Learn Cycle: For each data challenge, you will:
- Work in groups of 3-4 to solve a real-world data puzzle using only the concepts and techniques you’ve learned so far
- Share and compare your solutions with other groups
- Learn a powerful new pandas technique that elegantly solves the problem
Competitive Element: The members of the group with the top solutions, most aligned with the DS105 coding philosophy, will win Data Science Institute’s tote bags! 🎁
Hands-On Learning: The focus is on creatively solving problems, practicing a bit of that ambiguous “cozy vs frustrating” feeling that comes with learning new coding techniques. Hopefully, this will help you build intuition and deeper understanding for the new techniques.

📋 Preparation

Before the lecture

Review your understanding of basic Pandas operations of 🗣️ Week 04 and 🗣️ Week 05
Make sure you can access Nuvolos
Bring your laptop to participate in the interactive puzzles

🎬 Lecture Material

The lecture will be structured around four data puzzles, each introducing a key technique for handling complex data:

📥 Lecture Notebooks

Download the notebooks for today’s lecture:

🧩 The Puzzles

Puzzle 1: “The Split-Apply-Combine Strategy”

Understanding the fundamental pattern for data aggregation
Learning to use groupby(), apply(), and agg() methods
Creating summary statistics by group

Puzzle 2: “The Spotify Artist Network”

Working with nested JSON data about artists and their collaborations
Learning to use pd.json_normalize() to flatten nested structures

Puzzle 3: “The Netflix Binge”

Handling nested lists within JSON objects
Using DataFrame.explode() to expand list elements into separate rows

Puzzle 4: “Instagram Analytics”

Transforming multi-level dictionaries with time periods
Using DataFrame.melt() to reshape wide data into long format

📋 TAKE NOTE:

For each puzzle, we’ll start with the raw data structure and work toward a tidy, analysis-ready format
The focus is on understanding the conceptual approach, not just memorizing functions
These techniques will be directly applicable to your coursework and future data science projects

📥 Post-Lecture Actions

Review the Jupyter notebooks from today’s lecture (will be shared after class)
Practice with the sample solutions notebook
Read the Tidy Data paper by Hadley Wickham
Use the #help channel on Slack if you need clarification or help