🗓️ Week 08 - Pre-processing and grouping data with pandas, a groupby-apply tutorial

Theme: Cleaning and reshaping data

Author

It’s time to finally understand why we kept insisting that you rewrite your regular for loops into list and dict comprehensions instead! It’s because using pandas’ apply() method is very similar to using comprehensions, and it’s typically faster, more concise and uses less computer memory than using for loops. Functions are also relevant when you want to use groupby() to group data.

To illustrate these concepts, I will use the IMDb Non-Commercial Datasets this week. These datasets are a collection of data files that contain information about movies, actors, actresses, directors, producers, etc. The data is provided in TSV (tab-separated values) format, which is very similar to CSV. Look at the section below, where I explain how to download the data.

🎯 Learning Objectives

the concept of ‘tidy’ data
using pd.apply() to clean data
the notion of anonymous functions (lambda functions)
grouping data with groupby()
using our custom functions with groupby()

📚 PREPARATION

To come well prepared for the lecture, clone the following GitHub repository:

🖇️ LINK TO REPOSITORY

📃 Lecture Schedule

📍Location: Thursday 16 November 2023, 4 pm - 6 pm at CKK.1.04

👨‍🏫 Lecture Material

🎥 Looking for lecture recordings? You can only find those on Moodle, typically a day after the lecture. If you can’t find the recordings, please contact 📧 .

Material

This week’s lecture material is available under this dedicated GitHub repository:

🖇️ LINK TO REPOSITORY

Solutions to the exercises and live demos in the Jupyter Notebook of this lecture will NOT be posted here afterwards. We will create these solutions together during the lecture.