π§ Solutions to W04 Formative and analysis of your submissions
2023/24 Autumn Term
This page contains an analysis of your submissions for the βοΈ W04 Formative.
More than show possible solutions to the problem set, I also want to show you how I would run a in-depth data-driven analysis of your submissions. Here I treat the files you submitted as data, and I am using some of the tools we have learned in class to analyse it. This here is also an example of the kind of analysis you could do in your future notebooks.
I will use some more advanced Terminal scripting and pandas functions we havenβt seen in class yet, but it might be useful to you nonetheless.
π Submission statistics
Enrolled in DS105A |
Accepted Assignment via GitHub |
Submitted Formative |
% |
---|---|---|---|
67 | 62 | 48 | 71.6% |
Number of late submissions: \(\frac{2}{48} = \approx 4\%\).
π Overview
Number of submissions
I downloaded all of your submissions to a local folder using Git commands, I checked the number of people who accepted the W04 formative assignment:
ls . | wc -l
62
(The pipe |
symbol is used in the Terminal to pass the output of one command as input to another command. The wc
commmand counts the number of lines, words, and characters of any text passed to it.)
Great! There are 62 GitHub repos.
Now I am curious about your Jupyter Notebooks. How many of you submitted a Jupyter Notebook?
Number of Jupyter Notebooks
To solve this, I ran the advanced script below, written in awk
(a programming language useful for the terminal), to create a CSV file. This script extracts and saves your unique IDs (equivalent to your GitHub username) and all Jupyter notebooks (ipynb
) found in your GitHub repositories, separated by commas and saves this text to userids_and_notebooks.csv
ls -lth */*.ipynb | awk '{ split($0, a, "/"); split(a[1], b, "ds105a-2023-w04-formative-"); print("\"", b[2], "\",\"", a[2], "\"") } ' > userids_and_notebooks.csv
The *
symbol is a wildcard that denotes any text. That is, the first */
indicates that I want to match any directory, and the second *.ipynb
indicates that I want to match any file that ends with .ipynb
.
π‘ If you read our π A tutorial on HTML, CSS & principles of web scraping notebook in the π Week 04 Appendix page deeply, you might find that this wildcard is also present in CSS selectors!
I then used pandas
to read the CSV file and convert it to a DataFrame
object, a data structure that is very useful for data analysis (See Week 04 lecture).
My original CSV file didnβt have any header, so I added column names: username
and filename
. Then I used the function strip()
to get rid of the whitespace from all the columns.
import pandas as pd
= pd.read_csv("userids_and_notebooks.csv")
df = ["username", "filename"]
df.columns "username"] = df["username"].str.strip()
df["filename"] = df["filename"].str.strip() df[
How many lines are in this DataFrame
?
print("Number of rows in the DataFrame:", len(df))
Number of rows in the DataFrame: 58
Oh! So not everyone submitted a Jupyter Notebook. There are fewer rows than the number of people who accepted the assignment.
Number of unique usernames
Let me check how many unique usernames there are in the DataFrame
. pandas has this useful method called unique()
that returns a list of unique values in a column.
print("Number of people who added a Jupyter Notebooks to their GitHub repository:", len(df["username"].unique()))
Number of people who added a Jupyter Notebooks to their GitHub repository: 48
All of this tells me that:
- 62 people accepted the assignment
- But only 48 people added a Jupyter Notebook to their GitHub repository! The remaining 14 probably stopped at the first step of the assignment or forgot to add a Jupyter Notebook to their GitHub repository.
- In total I found 58 Jupyter Notebooks, which means that some of the 48 people added more than one Jupyter Notebook to their GitHub repository.
How can I find out? I can use groupby() to group the rows by username and then count the number of rows in each group. (We will learn about groupby()
and value_counts()
in the future, after Reading Week)
print("Some of you had more than one Jupyter Notebook in your GitHub repository π€")
'username']).count()['filename'].value_counts() df.groupby([
Some of you had more than one Jupyter Notebook in your GitHub repository π€
1 40
2 8
Name: filename, dtype: int64
How many people adhered to the filename convention?
If I invert the order of the columns, I can see a list of most common filenames of the Jupyter Notebooks you submitted. I can also see how many of you used the filename convention I suggested in the assignment.
'filename']).count()['username'].sort_values(ascending=False) df.groupby([
The majority! Some of you added extra spaces, others preferred to separate words with underscores or dashes. All of these are fine. As long as you are consistent with your naming convention, you will be fine.
Notebooks with names like Untitled-1.ipynb
are less ideal, though! I also noticed that some of you identified yourself in the filename. While this is fine when working on a real project, for the purpose of the individual assignments in this course, it is better to keep your identity anonymous.
filename- Initial Data Analysis.ipynb 37
NB01 -1.ipynb 4 * not ideal. Most likely, this is an empty notebook that you created by mistake
Untitled- Analysis of Family Tree.ipynb 3
NB01 2
NB01_Initial_Data_Analysis.ipynb - Initial Data Analysis.ipynb 1 XXXXX represents the candate number (not part of the requirement on this assignment)
XXXXX 1
Initial Data Analysis.ipynb 1 * you probably used it as a template. Fine, but remember to remove from GitHub before the final push
LSE_DS105A_W04_lecture.ipynb - Initial Data Analysis 1.ipynb 1 * the 1 is not necessary
NB01 - Initial Data Analysis LATEST .ipynb 1 * LATEST is not necessary if you use GitHub history to track your changes
NB01 - Initial Data Analysis".ipynb " 1 * not sure what happened here
NB01 -Initial-Data-Analysis.ipynb 1
NB01- Initial Data Analysis.ipynb 1
NB1 1
formative1.ipynb 1 * XXXXX had your username. Not ideal, as you are no longer anonymous
jupiterXXXXXX.ipynb Name: username, dtype: int64
π Solutions & Common mistakes
Part 1: Create synthetic family data (10 fake marks)
Part 1 was straightforward. All you needed to do was copy and paste the python code and run it to produce a JSON file.
Part 2: Add a notebook + initial setup (10 fake marks)
Part 2 instructions were to create a Jupyter Notebook and add some initial setup in a very precise way:
- The first thing in your Jupyter notebook should be a Markdown cell with your candidate number.
- The following cells would be alternating Markdown and Code cells.
The number #1 common mistake I saw here was a confusion between Markdown and Code cells.
What is the difference?
Part 3: Answer the following questions (80 fake marks)
This was the core of the assignment, and required knowledge of Python dictionaries and lists, as well as the ability to use for
or while
loops to iterate over the data. Inevitably, you would have to search for information on the internet to solve some of the problems. This is fine and encouraged, as it is precisely how developers and data scientists work in the real world.
The number #1 common mistake I saw here was that many wanted to jump straight to the final solution without breaking down the problem into smaller, more manageable problems. I wrote the lengthy section below to help take you step-by-step through the process of solving challenging problems like this one.
Before answering the questions, read the data
The #1 thing you should do as soon as you acquire a new dataset is to read it. This is a good practice because it allows you to understand the data you are working with and to identify potential problems.
How do I read a JSON file in Python? With the json
library:
import json
with open("companies_and_families.json", "r") as file:
= json.load(file) data
OK, now what is the type of data
?
type(data)
dict
It is a pure Python dictionary! This is good, because we know how to work with Python dictionaries. Letβs see what keys this dictionary has:
data.keys()
dict_keys(['companies', 'families'])
Now you can start exploring.
Q1: How many companies are in the dataset? (5 fake marks)
You know the data
dictionary contains a key called ['companies']
, but what type of data is it?
type(data["companies"])
list
It is a simple Python list! You should now know that you can use the len()
function to count the number of elements in a list:
len(data["companies"])
Q2: How many familie trees are there? (5 fake marks)
Similarly, the solution to Q2 is:
len(data["families"])
Q3: How many unique names appear in the first family tree? (10 fake marks)
This requires a bit more work.
Because the family tree is unique to each one of you (it is generated randomly every time you run the script), I canβt give you a solution that works for everyone. Instead, I will show you how I would approach this problem in a way that would work for any family tree.
The first thing Iβd advise you do is write down all the steps you need to take to solve this problem. This is a good practice because it helps you to break down a complex problem into smaller, more manageable problems:
- Take a look at the first family tree to see what it looks like
- Understand where in the family tree the names are stored
- Find a way to extract the names
- Check if my solution above works for other family trees and adjust it if necessary
- Find a way to loop over ALL family trees with a
for
loop, all while keeping track of the names you have already seen (the question asks for unique names) - Check your solutions by inspecting some of the trees manually
Now, go baby steps.
1. Take a look at the first family tree to see what it looks like
Remember from Week 04 lecture how to navigate elements in a nested dictionary or nested list? You can use the square brackets []
to navigate to the element you want. For example, to access the first family tree, you can do:
"families"][0] data[
Here I used [0]
because what is inside data["families"]
is a list, and list must be accessed by their index. The first element of a list has index 0, the second element has index 1, and so on. If the type of data["families"]
was a dictionary, I would have used the key to access the element I want.
2. Understand where in the family tree the names are stored
The code above returned:
{'Partner 1': {'name': 'Maria Moody',
'work': {'company': 'Diaz Group',
'role': 'Admin Coordinator',
'job_type': 'Admin'},
'spouse': 'Becky Klein'},
'Partner 2': {'name': 'Becky Klein',
'work': {'company': 'Rasmussen PLC',
'role': 'Data Engineering Manager',
'job_type': 'Data Engineer'},
'spouse': 'Maria Moody'}}
This is a dictionary with two keys: Partner 1
and Partner 2
. Each of these keys has another dictionary as value. This is a nested dictionary or, in other words, there is a dictionary inside another dictionary.
3. Find a way to extract the names
The most relevant thing there is that names are stored as values of the name
key. So, to access the name of Partner 1
, I can do:
"families"][0]["Partner 1"]["name"] data[
returning:
'Maria Moody'
In this case, I can easily extract both names by doing:
= data["families"][0]["Partner 1"]["name"]
name1 = data["families"][0]["Partner 2"]["name"] name2
But you donβt want to save each individual name to its own Python variable. The beauty of programming lies in that we can find ways to automate repetitive tasks.
Using your knowledge of Python dictionaries to automate the solution above
To improve the solution above, you would need to know that ["Partner 1"]
and ["Partner 2"]
are the only keys in the dictionary and that you can retrieve the values of all keys in a dictionary using the values()
method.
"families"][0].values() data[
dict_values([{'name': 'Maria Moody', 'work': {'company': 'Diaz Group', 'role': 'Admin Coordinator', 'job_type': 'Admin'}, 'spouse': 'Becky Klein'},
{'name': 'Becky Klein', 'work': {'company': 'Rasmussen PLC', 'role': 'Data Engineering Manager', 'job_type': 'Data Engineer'}, 'spouse': 'Maria Moody'}])
The dict_values
is a fancy way of saying that the output is a list that came from a dictionary. I can convert it to a pure list to avoid confusion:
list(data["families"][0].values())
[{'name': 'Maria Moody', 'work': {'company': 'Diaz Group', 'role': 'Admin Coordinator', 'job_type': 'Admin'}, 'spouse': 'Becky Klein'},
{'name': 'Becky Klein', 'work': {'company': 'Rasmussen PLC', 'role': 'Data Engineering Manager', 'job_type': 'Data Engineer'}, 'spouse': 'Maria Moody'}]
Thatβs simpler.
How then can I make use of it? Well, since this is a list, I know I can iterate lists using a for
loop.
A for
loop to save the day
# I start with an empty list
# I will later append the names to this list
= []
all_names
# I iterate over the list of dictionaries
for person in data["families"][0].values():
# I append the name of each person to the list
"name"])
all_names.append(person[
# I print the list to check if it worked
print(all_names)
['Maria Moody', 'Becky Klein']
4. Check if my solution above works for other family trees and adjust it if necessary
How do I know if the above is all I need to extract names from any family tree? I can check by running the code above for other family trees.
= []
all_names
# I chose family 1 instead of family 0 this time
for person in data["families"][1].values():
"name"])
all_names.append(person[
print(all_names)
['Cheryl Olson', 'Nancy Moreno']
To make sure this is correct, let me look at the family tree in its original format:
"families"][1] data[
{'Partner 1': {'name': 'Cheryl Olson',
'work': None,
'spouse': 'Nancy Moreno',
'children': [{'name': 'Jennifer Mcdonald',
'work': {'company': 'Patel-Gonzalez',
'role': 'Lead Data Scientist',
'job_type': 'Data Scientist'},
'spouse': 'Philip Rodriguez'}]},
'Partner 2': {'name': 'Nancy Moreno',
'work': {'company': 'Fields Inc',
'role': 'Junior Data Engineer',
'job_type': 'Data Engineer'},
'spouse': 'Cheryl Olson',
'children': [{'name': 'Jennifer Mcdonald',
'work': {'company': 'Patel-Gonzalez',
'role': 'Lead Data Scientist',
'job_type': 'Data Scientist'},
'spouse': 'Philip Rodriguez'}]}}
Wait! Thereβs more stuff here. These people have children. I need to count them too.
The question you should ask yourself right now is the following: how do I know when a nested dictionary is over? How do I know when I have reached the end of the tree? This is a very important question that comes up frequently when dealing with JSON as well as HTML files.
The answer is that you need to look at the structure of the data. In this case, I can see that the children are stored as a list of dictionaries. So, I need to iterate over this list and extract the names of the children.
I would go back to my previous cell of code and edit it to look like this:
= []
all_names
# I iterate over the list of dictionaries
for person in data["families"][1].values():
"name"])
all_names.append(person[
# I check if the person has children
if "children" in person.keys():
# I iterate over the list of children
for child in person["children"]:
"name"]) all_names.append(child[
['Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Jennifer Mcdonald']
5. Find a way to loop over ALL family trees with a for
loop, all while keeping track of the names you have already seen (the question asks for unique names)
The above works for both family trees 0 and 1. But what about the other family trees? What if I have a family tree that has 100 levels of nested dictionaries? Would I have to write 100 if and for loops?
NO! Remember: you should always try to automate repetitive tasks.
How can I rewrite the code above to make it work for any family tree? I could use a while
loop instead of for
loop. A while
loop is a loop that runs until a certain condition is met, and therefore useful when I donβt know how many elements I need to iterate over. In this case, I want to run the loop until I have reached the end of the tree.
= []
all_names
= 1
curr_family
for person in data["families"][curr_family].values():
"name"])
all_names.append(person[
= person
current_person
# Here I want to write a while loop
# that runs until I have reached the end of the tree
# I know I have reached the end of the tree when
# the person does not have children
while "children" in current_person.keys():
for child in current_person["children"]:
"name"])
all_names.append(child[
# Change the current person to the child
= child
current_person
all_names
The above works perfectly when I select the family 0 (curr_family = 0
) but it only kinda works for family 1. The name of Jennifer Mcdonald appears twice in the list. This is because she is a child of both her parents π.
['Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Jennifer Mcdonald']
Before I can proceed I need to deal with this case of repeated names. One thing I could is: whenever I am about to append a childβs name to the list, I could check if the name is already in the list. If it is, I donβt append it. If it is not, I append it.
if child["name"] not in all_names:
"name"]) all_names.append(child[
This would be fine. But there is a better way. I can use a Python set
instead of a list. A set
is a data structure that only stores unique values. If I try to add a value that is already in the set, it will not be added. This is exactly what I want.
# all_names is no longer a list
= set()
all_names
= 1
curr_family
for person in data["families"][curr_family].values():
"name"])
all_names.add(person[
= person
current_person
while "children" in current_person.keys():
for child in current_person["children"]:
# Instead of .append(), sets use .add()
"name"])
all_names.add(child[= child
current_person
all_names
which correctly prints out:
{'Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno'}
The curly braces, {}
, are a sign that this is a set not a list (which would be enclosed in square brackets, []
).
Looking back at this family tree, I also notice that Jennifer has a spouse! I need to add their name to the list too. I can do this by adding another if
statement to my code:
= set()
all_names
= 1
curr_family
for person in data["families"][curr_family].values():
"name"])
all_names.add(person[
= person
current_person
while "children" in current_person.keys():
for child in current_person["children"]:
"name"])
all_names.add(child[
# Add the spouse's name to the list
if "spouse" in child.keys():
"spouse"])
all_names.add(child[= child
current_person
all_names
{'Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Philip Rodriguez'}
Great! Am I done?
Well, you can check by running the code above for different family trees manualy (change curr_family
to 0, 2, 3, etc.) until you are confident that it works for family trees regardless of how nested they are. (It does).
Use functions!
Instead of editing the curr_family
variable, it would be nice if you wrapped the code above inside a custom function. In Python, functions are created with the operator def
and provide a way to reuse code. In this example, the perfect function would be one that takes a family tree as input and returns a set with all the names in that family tree.
def get_names_in_family(family_tree):
= set()
all_names
for person in family_tree.values():
"name"])
all_names.add(person[
= person
current_person
while "children" in current_person.keys():
for child in current_person["children"]:
"name"])
all_names.add(child[
# Add the spouse's name to the list
if "spouse" in child.keys():
"spouse"])
all_names.add(child[= child
current_person
return all_names
How do I test it? You can just call the function with the family tree you want to test it with:
"families"][0]) get_names_in_family(data[
{'Becky Klein', 'Maria Moody'}
"families"][1]) get_names_in_family(data[
{'Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Philip Rodriguez'}
Q4. Compute the number of unique individuals named per family. (20 fake marks)
If you have the function above, this becomes trivial. You just need to iterate over all family trees and count the number of unique names in each family tree.
for family in data["families"]:
print(len(get_names_in_family(family)))
Improve the solution above
The solution above is fine because we just asked you to compute, not to save the results. But what if we wanted to save the results? How would you do it?
# I start with an empty list
= []
all_family_sizes
# I iterate over all family trees
for family in data["families"]:
len(get_names_in_family(family))) all_family_sizes.append(
Improve it further with list comprehension
The solution above is fine, but it is not very elegant. It is also not very efficient because I am creating an empty list and then appending to it. There is a better way to do this: list comprehension.
= [len(get_names_in_family(family)) for family in data["families"]] all_family_sizes
Better yet: use a dictionary to store the results
List comprehensions are not just for lists, you can create dictionaries with them too. Instead of brackets []
, you use curly braces {}
and, importantly, you would need to provide a key and a value for each element in the dictionary.
# Saving the same results above as a dictionary instead
= {i: len(get_names_in_family(family)) for i, family in enumerate(data["families"])} all_family_sizes
If you have never heard of enumerate
before, it is a very useful function that returns two things at once: the index of the element and the element itself. In this case, I am using it to get the index of each family tree and save it as the key of the dictionary.
This is very useful for later, when I want to use pandas
to create a DataFrame
object with the results.
Q5. Compute the number of unique individuals named per company. (20 fake marks)
The reasoning is similar to the two questions above. I would do well by creating a function that goes over the family tree and keeps track not just of names of people, but also of the companies they work for. I will jump straight to final solution:
# This function would substitute the get_names_in_family() function above
def get_names_and_companies_in_family(family_tree):
= set()
all_names
# Let's keep companies as a dictionary where the key is the company name
= {}
all_companies
for person in family_tree.values():
"name"])
all_names.add(person[
# Does this person have a work entry?
if "work" in person.keys():
# Some people have a work entry but it is empty
if person["work"] is not None:
#Have I seen this company before?
if person["work"]["company"] in all_companies.keys():
# If yes, I add the person's name to the list of people who work for this company
"work"]["company"]].append(person["name"])
all_companies[person[else:
# If not, I create a new key with the company name and add the person's name to the list
"work"]["company"]] = [person["name"]]
all_companies[person[
= person
current_person
while "children" in current_person.keys():
for child in current_person["children"]:
"name"])
all_names.add(child[
# Same thing I did to the parent
# This piece of code is repeated and could be further improved
# by being turned into a separate function
if "work" in child.keys():
if child["work"] is not None:
if child["work"]["company"] in all_companies.keys():
"work"]["company"]].append(child["name"])
all_companies[child[else:
"work"]["company"]] = [child["name"]]
all_companies[child[
if "spouse" in child.keys():
"spouse"])
all_names.add(child[= child
current_person
# I am returning TWO objects: a set with all names and a dictionary with all companies
return all_names, all_companies
See what happens when I run it for the first family tree:
"families"][0]) get_names_and_companies_in_family(data[
({'Becky Klein', 'Maria Moody'},
{'Diaz Group': ['Maria Moody'],
'Rasmussen PLC': ['Becky Klein'],
'Diaz Group': ['Maria Moody'],
'Rasmussen PLC': ['Becky Klein']})
The object above is a Python tuple. It is a data structure that is similar to a list, but it is immutable (you canβt change it). It is useful when you want to return multiple objects from a function.
How can I get just names and just the companies information? I can use the same trick I used before with the enumerate
function and unpack the tuple into two variables:
= get_names_and_companies_in_family(data["families"][0]) names, companies
The get_names_and_companies_in_family()
would be a solution for Q3 too.
Solution
Now that I have this function, how do I use it to solve Q5? I can use a for
loop to iterate over all companies and count the number of unique names in each company.
# I start with an empty dictionary
# This dictionary will have the company name as key and the number of unique names as value
= {}
num_people_per_company
# I iterate over all family trees
for family in data["families"]:
# I use the function above to get the names and companies in each family tree
= get_names_and_companies_in_family(family)
names, companies
# I iterate over all companies in the dictionary
for company in companies.keys():
# I check if the company is already in the dictionary
if company in num_people_per_company.keys():
# If yes, I add the number of unique names to the existing value
+= len(companies[company])
num_people_per_company[company] else:
# If not, I create a new key with the company name and add the number of unique names
= len(companies[company])
num_people_per_company[company]
print(num_people_per_company)
which returns something like:
{'Diaz Group': 80, 'Rasmussen PLC': 87, 'Patel-Gonzalez': 97, 'Fields Inc': 83, 'Sampson-Shaw': 80, 'Rivera and Sons': 81, 'Parrish-Cruz': 67, 'Walker, Harris and Johnson': 73, 'Peterson PLC': 74, 'Smith, Harper and Garza': 78, 'Hale, Dunn and Graham': 74, 'Matthews, Kennedy and James': 66, 'Doyle-Olsen': 88, 'Sherman-Washington': 105, 'Ortiz Ltd': 90, 'Rodriguez, Fox and Gaines': 62, 'Taylor LLC': 67, 'Foley-Villanueva': 72, 'Thomas, Gallagher and Vazquez': 92, 'Lester Group': 86}
π‘ AN EXERCISE TO THE READER: How would you convert the dictionary above to Pandas?
Q6 Write a custom function called get_company_size that takes a company name as an argument and returns the number of people working in that company. Show that it works. (40 fake marks)
This solution could go in many different directions. I could simply use the dictionary I created in Q5 to solve this question.
def get_company_size(company_name):
return num_people_per_company[company_name]
But note that the function assumes that this dictionary with the name num_people_per_company
already exists and is not created inside the function. This is not ideal because it means that the function is not self-contained. It depends on an external variable.
Creating a self-contained function
We would need to adapt the function above to make it self-contained. It is okay to rely on other functions, just not a good idea to rely on external variables.
def get_company_size(company_name, data):
= {}
num_people_per_company
for family in data["families"]:
# I can reuse the function I created above
= get_names_and_companies_in_family(family)
names, companies
for company in companies.keys():
if company in num_people_per_company.keys():
+= len(companies[company])
num_people_per_company[company] else:
= len(companies[company])
num_people_per_company[company]
return num_people_per_company[company_name]
π§ Deep data-driven analysis
Here, I treat your submissions as data and show how you could use some of the tools we have learned in class to analyse it.
Jupyter Notebooks are plain text files
Here is a little secret: Jupyter notebooks are just text files with a .ipynb
extension.
The fact that when you click on a ipynb
file on VS Code (or on JupyterLab) renders the notebook in a nice format is just a feature of VS Code (or JupyterLab). In other words, just like HTML code is translated into a nice webpage by your browser, Jupyter notebooks are translated into a nice format by VS Code (or JupyterLab).
As you saw in Week 04 Lecture, we can use bashβs head
command to print out just the first few lines of a text file. Letβs try it with a Jupyter notebook:
!head -n 30 "ds105a-2023-w04-formative-<USERNAME>/NB01 - Initial Data Analysis.ipynb"
which outputs:
{
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"id": "3a1b7d8a",
"metadata": {},
"outputs": [],
"source": [
"#setup\n",
"import json\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "af684478",
"metadata": {},
"outputs": [],
Look similar to something you have seen before? Yes! This is JSON! Jupyter Notebooks are JSON files.
π‘ The only way you will fully understand the rest of this analysis is if you try to replicate what I did to your own Jupyter Notebook. Try repurposing the code below to read your own Jupyter Notebook as a JSON file into a Python dictionary (see also the Week 04 Lecture notebook for an example).
Exploring the metadata in your Jupyter Notebooks
When I load one of your Jupyter Notebooks as a JSON file into Python with the json library, I can see that it contains four different keys: cells
, metadata
, nbformat
, and nbformat_minor
.
import json
with open("ds105a-2023-w04-formative-<USERNAME>/NB01 - Initial Data Analysis.ipynb") as f:
= json.load(f)
nb_as_json
print(nb_as_json.keys())
Further exploring this dictionary, I see that the nb_as_json["metadata"]
key contains a dictionary with a lot of information about the notebook, such as what version of Python you used to run the notebook.
"metadata"]["language_info"]["version"] nb_as_json[
If I could one step further, I can read ALL Jupyter notebooks and summarise the Python version used by all of you.
But how do I read all Jupyter Notebooks at the same time?
(You wonβt be able to replicate this section because you only have your own notebook)
At the start of this document you saw that I used the *
wildcard in bash to match multiple files. It turns out that I can do the same with Pythonβs glob
library.
import glob
# This matches all files that end with .ipynb
= glob.glob("ds105a-2023-w04-formative-*/**/*.ipynb", recursive=True) path_to_all_notebooks
As this returns a list with the path to all files, I can now iterate over this list using a for
loop and load each file as a JSON file.
Putting things together
import json
import glob
= []
python_versions = glob.glob("ds105a-2023-w04-formative-*/**/*.ipynb", recursive=True)
path_to_all_notebooks
for path_to_notebook in path_to_all_notebooks:
with open(path_to_notebook, "r") as file:
= json.load(file)
nb_as_json
# Some metadata does not have a "version" key, so I need to check if it exists first
if 'version' in nb_as_json["metadata"]["language_info"].keys():
"metadata"]["language_info"]["version"])
python_versions.append(nb_as_json[
=["python_version"]).value_counts() pd.DataFrame(py_versions_in_use, columns
Finally, I can use pandas
to create a DataFrame
object and count the number of times each Python version was used.
import pandas as pd
= pd.DataFrame(python_versions, columns=["python_version"])
df
"python_version"].value_counts() df[
python_version
3.11.4 21
3.11.5 12
3.12.0 6
3.11.6 4
3.10.8 2
3.9.6 2
3.10.7 1
3.11.0 1
3.9.13 1
3.9.18 1
3.9.7 1
dtype: int64
Most of you are using Python 3.11, one of the latest versions, and no one is using any super old version of Python. This is great!
More than just extracting your Python version, I want to read the content of what you wrote in your Jupyter Notebooks. It is now easy to adapt the code above to read all your Jupyter Notebooks as JSON files and store them in a list.
import glob
= glob.glob("ds105a-2023-w04-formative-*/**/*.ipynb", recursive=True)
path_to_all_notebooks = []
all_json_notebooks
for path_to_notebook in path_to_all_notebooks:
if "Untitled" not in path_to_notebook and "lecture" not in path_to_notebook:
with open(path_to_notebook, "r") as file:
= json.load(file)
nb_as_json
all_json_notebooks.append(nb_as_json)
print("I just read all ", len(all_json_notebooks), " notebooks.")
I just read all 52 notebooks.
Requirement 1: add your candidate number to the first cell
Now that I have read all your notebooks into Python a list of dictionaries, I can navigate to the ["cells"]
key (a list) of each notebook, take the first cell ([0]
) and read its content (["source"]
):
# I create an empty list to store it
= []
first_cells
# I iterate over all notebooks and append the first cell to the list
for notebook in all_json_notebooks:
"cells"][0]["source"])
first_cells.append(notebook[
print("Just checking that I read the first cells of all ", len(first_cells), " notebooks.")
Just checking that I read the first cells of all 52 notebooks.
The length of this list matches the number of notebooks I read, so I know that I have read all the notebooks correctly.
Practising list comprehension
I could make the code above neater and shorter by using list comprehension:
# I achieve the same thing as above in just one line of code
= [notebook["cells"][0]["source"] for notebook in all_json_notebooks] first_cells
What to do right after collecting data in a list?
Just browse through it! See what you can find. This is solid advice for any data analysis.
A good example of someone who followed the instructions to the letter
The first element of my list shows that this person correctly added their candidate number to the first cell:
0] first_cells[
['78182']
Note: the output is a list of size 1. The candidate number is the first (and only) element of this list.
Ops! Someone forgot to add their candidate number
3] first_cells[
'##1. Setup\n',
['\n',
'import json\n',
'import pandas as pd\n',
'from pprint import pprint\n',
'\n',
'\n',
'## 2. Read the data\n',
'\n',
'#file = open("companies_and_families.json", "r")\n',
'#data = json.load(file)\n',
'#pprint(data)\n',
'\n']
This shows that the first cell of this personβs notebook contains code to import libraries and read the data, but no candidate number.
A closer look into this notebook also reveals that all of the code above was put together into a Code cell1. The requirements of the assignment required information to be split into multiple cells, some would be Markdown and some would be Code.
Summarising cell types
I want to know how many of you added a first cell of the type Markdown:
= [notebook["cells"][0]["cell_type"] for notebook in all_json_notebooks]
first_cell_type
# Convert the list above to a pandas DataFrame so I can do useful things with it
= pd.DataFrame(first_cell_type, columns=["first_cell_type"])
df
"first_cell_type"].value_counts() df[
returning:
first_cell_type
markdown 33
code 19
dtype: int64
But, I saw that some of those who created the first cell as Code cells in fact added their candidate number as a commented Python code, which is fine.
Number of lines
Perhaps the ultimate test to see if you followed the instructions is to count the number of lines in the first cell. If you followed the instructions, the first cell should have only one line.
= [len(notebook["cells"][0]["source"]) for notebook in all_json_notebooks]
number_lines_first_cell
= pd.DataFrame(number_lines_first_cell, columns=["number_lines"])
df
"number_lines"].value_counts() df[
OK! The majority of you did it correctly.
1 43
3 3
13 1
6 1
7 1
0 1
41 1
113 1
Name: number_lines, dtype: int64
I wonβt go further into the analysis of the following cells, this is already a long document. But I hope you can see how you could use a similar approach if you were to analyse JSON files in the future.
Footnotes
I can further confirm this by running
all_json_notebooks[3]["cells"][0]["cell_type"]
, which returns"code"
.β©οΈ