🗣️ Week 07 Lecture

JSON Normalization & Data Reshaping

Author

Dr Jon Cardoso-Silva

Published

05 March 2025

🥅 Learning Goals
By the end of this lecture, you should be able to: i) Normalize complex JSON data into analysis-ready tables, ii) Handle nested data structures effectively, iii) Apply the Group → Apply → Combine strategy, iv) Prepare data for visualization using tidy data principles.
DS105W course icon

📍Time and Location: Thursday, 06 March 2025 from 4-6 pm at MAR.1.04

This week’s lecture will focus on techniques for handling complex, nested data structures - particularly JSON data - and transforming them into analysis-ready formats. These skills are essential for real-world data science work where data rarely comes in the perfect format you need.

📓 Interactive Puzzle-Based Learning

This lecture will follow a unique format:

  1. Puzzle-Solve-Learn Cycle: For each data challenge, you will:

    • Work in groups of 3-4 to solve a real-world data puzzle using only the concepts and techniques you’ve learned so far
    • Share and compare your solutions with other groups
    • Learn a powerful new pandas technique that elegantly solves the problem
  2. Competitive Element: The members of the group with the top solutions, most aligned with the DS105 coding philosophy, will win Data Science Institute’s tote bags! 🎁

  3. Hands-On Learning: The focus is on creatively solving problems, practicing a bit of that ambiguous “cozy vs frustrating” feeling that comes with learning new coding techniques. Hopefully, this will help you build intuition and deeper understanding for the new techniques.


📋 Preparation

Before the lecture

  • Review your understanding of basic Pandas operations of 🗣️ Week 04 and 🗣️ Week 05
  • Make sure you can access Nuvolos
  • Bring your laptop to participate in the interactive puzzles

🎬 Lecture Material

The lecture will be structured around four data puzzles, each introducing a key technique for handling complex data:

📥 Lecture Notebooks

Download the notebooks for today’s lecture:

🧩 The Puzzles

Puzzle 1: “The Split-Apply-Combine Strategy”

  • Understanding the fundamental pattern for data aggregation
  • Learning to use groupby(), apply(), and agg() methods
  • Creating summary statistics by group

Puzzle 2: “The Spotify Artist Network”

  • Working with nested JSON data about artists and their collaborations
  • Learning to use pd.json_normalize() to flatten nested structures

Puzzle 3: “The Netflix Binge”

  • Handling nested lists within JSON objects
  • Using DataFrame.explode() to expand list elements into separate rows

Puzzle 4: “Instagram Analytics”

  • Transforming multi-level dictionaries with time periods
  • Using DataFrame.melt() to reshape wide data into long format

📋 TAKE NOTE:

  • For each puzzle, we’ll start with the raw data structure and work toward a tidy, analysis-ready format
  • The focus is on understanding the conceptual approach, not just memorizing functions
  • These techniques will be directly applicable to your coursework and future data science projects

📥 Post-Lecture Actions

  1. Review the Jupyter notebooks from today’s lecture (will be shared after class)
  2. Practice with the sample solutions notebook
  3. Read the Tidy Data paper by Hadley Wickham
  4. Use the #help channel on Slack if you need clarification or help

📚 Additional Resources