Pandas Cookbook
Hi everyone! Eli, your course rep for DS105A 2025/2026 Autumn Term has compiled a lot of useful pandas functions you might want to use in your assignments (β Mini Project I and beyond)! I replicate them all below.
DataFrames
All example code will specify which DataFrame is being used. Any amendments to the DataFrame will be outlined in the examples. Assume pandas is imported for all code.
DataFrame1
df = pd.DataFrame(
{"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
# Output
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50DataFrame2
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Jan', 'Feb', 'Feb'],
'product_category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Clothing', 'Clothing', 'Clothing', 'Clothing'],
'revenue': [450, 380, 420, 350, 200, 180, 220, 195]
})
# Output
month product_category revenue
0 Jan Electronics 450
1 Jan Electronics 380
2 Feb Electronics 420
3 Feb Electronics 350
4 Jan Clothing 200
5 Jan Clothing 180
6 Feb Clothing 220
7 Feb Clothing 195DataFrame3
api_response = {
'store': 'London Branch',
'location': {
'city': 'London',
'postcode': 'SW1A 1AA'
},
'sales': [
{
'date': '2024-01-15',
'product': {
'category': 'Electronics',
'name': 'Laptop'
},
'revenue': 450,
'units': 25
},
{
'date': '2024-01-22',
'product': {
'category': 'Electronics',
'name': 'Phone'
},
'revenue': 380,
'units': 18
}
]
}Key commands to be familiar with
| Command | What it Does | Pattern |
|---|---|---|
.loc[] |
Pandasβ location-based indexer that lets you select specific rows AND columns | df.loc[row_selector, column_selector] |
.groupby() |
Separates DataFrames, applies a calculation, returns results. | df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function().Some common aggregation functions include .mean(), .sum(), .min(), .max(), .count() |
.dt |
Unlocks datetime-specific operations | df['date'].dt.component.Some common components are .dt.year, .dt.month, .dt.day, .dt.hour, .dt.day_name(), .dt.month_name() |
.sort_values() |
Reorders entire DataFrame based on column values. | df.sort_values('column_name', ascending=False) |
pd.to_datetime() |
Converts UNIX timestamps to datetime objects. | pd.to_datetime(df['unix_column'], unit='s', utc=True) |
pd.json_normalize() |
Flattens nested JSON into DataFrame columns | pd.json_normalize(api_response['key']) |
Building Multiple Conditions with pandas
DataFrame: DataFrame1
What it does: Translate pure python and, or logic and introduces other conditions for parsing pandas DataFrames.
Two conditions using β&β / β|β
Example:
The code and explanation below are the same for both & and | (or) conditions. They both return a Series:
df.loc[(df["BBB"] < 25) & (df["CCC"] >= -40), "AAA"]Explanation / Key Points:
This code creates a boolean mask (True/False value) for each row:
df["BBB"] < 25-> True for rows where column BBB is less than 25.df["CCC"] >= -40-> True for rows where column CCC is greater than or equal to -40.
The two conditions must be wrapped in parantheses for pandasβ boolean logic to work.
Two conditions with assignment
What it does: Using the same code and logic as above, we can modify the DataFrame if certain conditions are met.
Example:
df.loc[(df["BBB"] < 25) | (df["CCC"] >= -40), "AAA"] = 999Explanation:
For each row where either "BBB" < 25 or "CCC" >= -40, AAA will be replaced with 999.
Creating New Columns
DataFrame: DataFrame2
What it does: Adds new columns to DataFrames based on calculations / transformations.
Example:
# Create a new column from calculation
df['profit'] = [600, 400, 400, 400, 300, 200, 400, 205]
df['profit_margin'] = (df['profit'] / df['revenue']) * 100When to Use: Extracting datetime components (e.g. df['year'] = df['date'].dt.year), creating calculated fields, adding new categorical variables.
Conditional Assignment
DataFrame: DataFrame1
What it does: Changes values in one column based on conditions in another column.
Example:
df.loc[df.AAA >= 5, "BBB"] = -1Explanation:
The df.loc command sets BBB to -1 if AAA β₯ 5.
The condition df.AAA >= 5 creates a True/False mask selecting which rows to modify.
The selected rows are set to -1.
When to use:
Creating categorical variables (e.g. for air quality: poor/moderate/good). A more simple alternative to np.select().
Sorting DataFrames
DataFrame: DataFrame2
What it does: Reorders rows based on column values.
Example:
# Sort the entire DataFrame according to a single column (ascending)
df_sorted = df.sort_values('revenue')
# Sort descending
df_sorted = df.sort_values('revenue', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['month', 'revenue'], ascending=[True, False])
# Sort by date (essential for time series plots)
df_sorted = df.sort_values('date')When to use: - Essential for time series plots where dates must be in order. - Finding top/bottom values in context. - Enhanced readability. - Preparing data for sequential operations.
Grouping
DataFrame: DataFrame2
Basic Grouping
What it does:
Splits DataFrames, applies a function to each group, returns results.
Example:
df.groupby('month')['revenue'].mean()Explanation:
All df.groupby commands fit the following pattern:
df.groupby('CATEGORY_COLUMN')['VALUE_COLUMN'].AGGREGATION()
# β β β
# How to split What to measure How to summarize In this case, data is split by 'month', 'revenue' data is measured, before being summarized by its mean.
Counting Group Sizes
What it does: Count items in each category.
Example:
df.groupby('quality').size()Aggregating with multiple functions
What it does: Calculates several statistics at once for each group. Returns a DataFrame with one column per aggregation function.
Example:
df.groupby('month')['revenue'].agg(['mean', 'max', 'min', 'count'])
# Output
mean max min count
month
Feb 296.25 420 195 4
Jan 302.50 450 180 4When to use: When you need comprehensive summary statistics.
Grouping by Multiple Columns
What it does: Creates nested groups for detailed analysis.
Example:
df.groupby(['month', 'product_category'])['revenue'].mean()
# Output
month product_category
Feb Clothing 207.5
Electronics 385.0
Jan Clothing 190.0
Electronics 415.0
Name: revenue, dtype: float64When to use: When you need to break down data by multiple dimensions e.g. the mean revenue for each category per month.
Multiple Column Aggregations
What it does: Calculates statistics for several columns simultaneously.
Example:
# Add units_sold and profit columns
df['units_sold'] = [25, 18, 22, 16, 45, 38, 50, 42]
df['profit'] = [135, 114, 126, 105, 60, 54, 66, 59]
# Generate statistics for multiple columns
df.groupby('month')[['revenue', 'units_sold', 'profit']].mean()
# Output
revenue units_sold profit
month
Feb 296.25 32.5 89.00
Jan 302.50 31.5 90.75Explanation: Use double brackets to select multiple columns. This returns a DataFrame with one column per metric and one row per month.
When to use: Comparing multiple metrics across groups.
Resetting Index (after groupby)
DataFrame: DataFrame2
What it does: Converts index back to regular columns (useful after groupby).
Example:
# After groupby, year becomes the index
yearly_avg = df.groupby('year')['revenue'].mean()
print(yearly_avg)
# Output:
year
2020 350
2021 400
Name: revenue, dtype: int64
# Reset index to make 'year' a column again
yearly_avg = yearly_avg.reset_index()
print(yearly_avg)
# Output:
year revenue
0 2020 350
1 2021 400When to use: - After groupby, before creating plots - When you need to use the grouped column in further operations - Making results easier to work with
Timeseries
DataFrame: DataFrame2
Converting Unix timestamps to Datetime
What it does: Converts Unix timestamps (seconds since January 1, 1970) into pandas datetime objects that are readable.
Example:
# Add a date column to the sales data containing UNIX timestamps
df['unix_timestamp'] =[1705276800, 1705881600, 1707264000, 1708387200, 1704844800, 1706400000, 1707782400, 1708905600]
# Convert to datetime objects
df['date'] = pd.to_datetime(df['unix_timestamp'], unit='s', utc=True)
print(df[['unix_timestamp', 'date', 'revenue']].head())
# Output
unix_timestamp date revenue
0 1705276800 2024-01-15 00:00:00+00:00 450
1 1705881600 2024-01-22 00:00:00+00:00 380
2 1707264000 2024-02-07 00:00:00+00:00 420
3 1708387200 2024-02-20 00:00:00+00:00 350
4 1704844800 2024-01-10 00:00:00+00:00 200Explanation:
unit='s' -> timestamps are in seconds (most common). Other unit options include unit='ms' for milliseconds, unit='us' for microseconds (check API documentation ) utc=True = tells pandas that the timestamps are in UTC timezone.
Datetime Accessors
What it does: Extracts specific components from datetime columns.
# Now, we can extract year/month/day
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_name'] = df['date'].dt.day_name()
print(df[['year', 'month', 'day_name']].head())
# Output
year month day_name
0 2024 1 Monday
1 2024 1 Monday
2 2024 2 Wednesday
3 2024 2 Tuesday
4 2024 1 WednesdayUsing Datetime Accessors with groupby
Example
# Group sales by day of the week
df.groupby(df['date'].dt.day_name())['revenue'].mean()Data In / Out Operations
DataFrame: DataFrame2
Reading and Writing CSV Files
What is does: Saves DataFrames to CSV files and loads them back.
Example: Writing CSV
# Save sales data to a CSV file in a 'data' folder.
df.to_csv('data/sales_data.csv', index=False)Explanation: The index=False prevents pandas from saving row numbers as a column.
Example: Reading CSV
# Load CSV back into DataFrame
df_loaded = pd.read_csv('data/sales_data.csv')Example: Save only specific columns
df.to_csv('data/revenue_only.csv', columns = ['date', 'revenue'], index=False)JSON Normalization
DataFrame: DataFrame3
What it does: Flattens nested JSON structures (e.g. API responses) into DataFrame columns.
Example
# Print original nested structure of the API response (DataFrame3)
print(api_response['sales'][0])
# Output
{'date': '2024-01-15', 'product': {'category': 'Electronics', 'name': 'Laptop'}, 'revenue': 450, 'units': 25}
# Use json_normalize to flatten the nested data
df = pd.json_normalize(api_response['sales'])
print(df)
# Output
date revenue units product.category product.name
0 2024-01-15 450 25 Electronics Laptop
1 2024-01-22 380 18 Electronics PhoneExplanation:* pd.json_normalize() transforms: 1. Nested keys into column names with dots e.g. product['category'] becomes product.category. 2. Lists of dictionaries into DataFrame rows.
Renaming Columns
DataFrame: DataFrame3
What it does: Changes column names to remove nested prefixed / ensure clarity.
Example:
# After json_normalize, two columns are prefaced by 'product.' e.g. 'product.category'. These can be removed:
df.columns = df.columns.str.replace('product.', '')
# Rename specific columns
df = df.rename(columns={'old_name': 'new_name', 'another_old': 'another_new'})
# Rename multiple columns at once
df.columns = ['date', 'income', 'amount', 'category', 'name'] # This must match the number of columns!Explanation:
df.columns.str.replace('being_replaced', 'replacing_with')When to use: - Cleaning up after json_normalize - Making column names more readable - Following naming conventions.
Selecting Columns
DataFrame: DataFrame2
What it does: Extracts specific columns from a DataFrame.
Example:
# Single column (returns Series)
revenue = df['revenue']
# Multiple columns (returns DataFrame)
subset = df[['month', 'revenue']]
# Drop columns you don't need
df_clean = df.drop(columns=['unix_timestamp', 'extra_column'])When to use: - After json_normalize() to keep only relevant data - Creating focused analysis datasets - Removing unnecessary columns before saving CSV
