ποΈ Week 08 - Pre-processing and grouping data with pandas, a groupby-apply tutorial
Theme: Cleaning and reshaping data
Itβs time to finally understand why we kept insisting that you rewrite your regular for
loops into list
and dict
comprehensions instead! Itβs because using pandasβ apply()
method is very similar to using comprehensions, and itβs typically faster, more concise and uses less computer memory than using for
loops. Functions are also relevant when you want to use groupby()
to group data.
To illustrate these concepts, I will use the IMDb Non-Commercial Datasets this week. These datasets are a collection of data files that contain information about movies, actors, actresses, directors, producers, etc. The data is provided in TSV (tab-separated values) format, which is very similar to CSV. Look at the section below, where I explain how to download the data.
π― Learning Objectives
- the concept of βtidyβ data
- using
pd.apply()
to clean data - the notion of anonymous functions (
lambda
functions) - grouping data with
groupby()
- using our custom functions with
groupby()
π PREPARATION
To come well prepared for the lecture, clone the following GitHub repository:
ποΈ LINK TO REPOSITORY
π Lecture Schedule
πLocation: Thursday 16 November 2023, 4 pm - 6 pm at CKK.1.04
π¨βπ« Lecture Material
π₯ Looking for lecture recordings? You can only find those on Moodle, typically a day after the lecture. If you canβt find the recordings, please contact π§ .
Material
This weekβs lecture material is available under this dedicated GitHub repository:
ποΈ LINK TO REPOSITORY
Solutions to the exercises and live demos in the Jupyter Notebook of this lecture will NOT be posted here afterwards. We will create these solutions together during the lecture.