DS105A – Data for Data Science
🗓️ 30 Oct 2025
.apply() methodplot_df pattern for seaborn visualisationWhy this matters: These skills directly support your ✍️ Mini-Project 1 work.

In the 💻 W04 Lab, you explored nested np.where() when classifying weather attributes like temperature and rainfall.
Today, we’ll solve that same problem with a different (cleaner) approach: custom functions and the .apply() method.
The task was to classify weather based on temperature and rainfall into the following categories:
| Category | Description |
|---|---|
| Hot & Dry | temperature > 25°C and rainfall < 1mm |
| Hot & Wet | temperature > 25°C and rainfall >= 1mm |
| Mild & Dry | temperature in 20-25°C and rainfall < 1mm |
| Mild & Wet | temperature in 20-25°C and rainfall >= 1mm |
| Cool | temperature < 20°C and rainfall any |
Instead of using nested np.where(), I could just more naturally say:
and get this as a response:
(a single normal string)
Such that I could apply this to every combination of temperature and rainfall I have in my dataset?
We don’t do for loops anymore 🙃
Back in the days where we used for loops and separate lists/arrays, this would look like this:
Nice!
pandas…If we had such a way to classify weather, we could use vectorised operations in pandas to classify weather for every row in our dataset in a single line of code (instead of a for loop).
It’s a reusable block of code that takes inputs and produces an output. You can invoke it by calling its name with the appropriate inputs.
How to define a function:
Key components:
def: defines a functionreturn produces the outputYou always test a function on single values first.
Why test first? Easier to debug a function than nested np.where()!
The .apply() method in pandas allows you to apply a function to every element in a Series.
It works kind of like a for loop, but cleaner and more efficient.
It looks like this:
The output is a new pandas Series with the same index as the original Series.
That is, something like this:
If you assign the output of the .apply() method to a new column in the DataFrame…
using the = operator:
alternatively, you can use the .assign() method:
| date | temperature | is_hot |
|---|---|---|
| 2024-08-15 | 28 | True |
| 2024-08-16 | 22 | False |
| 2024-08-17 | 26 | True |
Last week, we talked about code that looked like this:
That is, you create a boolean array using a logical condition and then use it to filter the DataFrame.
By the way, sometimes I find it clearner to split this into two steps:
It makes it easier to read and debug.
.apply()You can also use .apply() to filter data.
This is equivalent to the code we saw last week.
In this particular case, I think the first approach is easier to read and debug: df[df['temperature'] > 25].
This is because greater than (>) is a simple logical operation that is already vectorised and implemented in the pandas (and numpy) library.
pandas documentation to see if the operation you want to perform is already vectorised..apply()df[column].apply(function), you are applying the function to every element in the pandas Series.df.apply(function), you are applying the function to each dimension (row or column) in the DataFrame.pandas.apply() (continued)When you do df[column].apply(function), you are applying the function to every element in the pandas Series.
But if you do df.apply(function), you are applying the function to each dimension (row or column) in the DataFrame.
You can specify an axis argument to control which dimension you want to apply the function to.
axis=0 means “down the rows” (column-wise) and axis=1 means “across columns” (row-wise).
lambdaSometimes you just want a quick, inline function for a one-liner. Use lambda.
You can also combine with .assign() for method chaining:
When logic grows complex, prefer a named def function for readability and testing.
Nested np.where() (W04 Lab):
Function + .apply() (Clean):
💭 Note: I used row as the parameter rather than the individual columns.
Extract functions when:
if-elif-else statements)Use built-in operations when:
pandas and numpy.You might need to use custom functions (def statements) and apply() in your ✍️ Mini-Project 1 either to filter data based on complex logic or to create classification labels.

(A short intro)
To answer questions like “Is London’s air getting better or worse?” you need to:
datetime objectsAPIs typically return timestamps as Unix epoch (seconds since 1970):
Convert to datetime:
Now you get readable dates:
2021-10-01 00:00:00+00:00
.dt AccessorOnce you have datetime objects, you have superpowers!
You can extract components of the datetime object using the .dt accessor:
Before:
| date |
|---|
| 2024-08-15 |
| 2024-08-16 |
| 2024-08-17 |
After:
| date | year | month | day | dayofweek |
|---|---|---|---|---|
| 2024-08-15 | 2024 | 8 | 15 | 3 (Thursday) |
| 2024-08-16 | 2024 | 8 | 16 | 4 (Friday) |
| 2024-08-17 | 2024 | 8 | 17 | 5 (Saturday) |
I really like this RealPython tutorial Using Python datetime to Work With Dates and Times. Give it a read!

Most of the features that exist in the Python default datetime module are also available in the pandas library.
This pandas documentation page is also a good resource.



After the break:
groupby() method: split-apply-combine strategyplot_df pattern: prepare data, then visualise
Very often, we need to calculate summary statistics for groups of data instead of for the entire dataset.
For example, you might want to calculate the average temperature for each month in a year.
groupby() methodThe pandas library provides a method called groupby() to help you do precisely this:
Before (raw data):
| date | year | temperature |
|---|---|---|
| 2021-01-15 | 2021 | 5 |
| 2021-06-15 | 2021 | 22 |
| 2022-01-15 | 2022 | 6 |
| 2022-06-15 | 2022 | 24 |
What pandas will do:
year column.mean for the entire temperature column for each year.After:
| year | temperature |
|---|---|
| 2021 | 13.5 |
| 2022 | 15.0 |
Basic pattern:
Common aggregation functions:
.mean() - average.median() - middle value.sum() - total.max() - maximum.min() - minimum.count() - number of itemsWhen chaining multiple operations, split them across lines:
Why? Each operation is on its own line, making the transformation clear and debuggable.
Alternative (harder to read):
or, say:
Here is an example of grouping by (year, month) combination:

At this stage in the course, we only expect you to create simple line and bar plots but we do have expectations for:
Provided you have a DataFrame plot_df which contains the columns date and rain_mm, this is the line that would produce a simple line plot:
🙃 Can you see why this plot is alright but not GREAT (as we want it to be in DS105)?
There’s the imports and config…
And the actual code to create the plot:
In this course, seaborn is our default because it forces you to think about your data as a table that has everything you need. You decide your variables (columns), build the table, then map columns to aesthetics.
plot_df PatternLet’s agree on a convention 🤝:
We will always create a plot_df first with all the columns we need for the visualisation and only then call matplotlib/seaborn.
Why? This way you can check if plot_df is correct before calling seaborn.
A plot from that plot_df would look like this:
This one shows an average value over time (aggregated by year+month).
We won’t memorise all seaborn syntax. Instead:
Plot types:
sns.barplot(): Compare categoriessns.lineplot(): Show trends over timesns.scatterplot(): Relationships between variablessns.boxplot(): Distributions within groupssns.heatmap(): Patterns across two dimensions💡 For Mini-Project: You’ll have to create 2 insights. Choose plot types based on what pattern you’re showing.
Bad title: “Temperature by Month”
Good title: “London summers are getting hotter”
The difference: Good titles state your finding, not describe what’s in the chart.
For Mini-Project 1: Each of your 2 insights needs a narrative title that communicates your analytical conclusion.
Resources:
#help on SlackLooking ahead: Week 06 (Reading Week) is focus time for Mini-Project 1 completion.
![]()
LSE DS105A (2025/26)