🧐 Solutions to W04 Formative and analysis of your submissions

2023/24 Autumn Term

Author

This page contains an analysis of your submissions for the ✏️ W04 Formative.

More than show possible solutions to the problem set, I also want to show you how I would run a in-depth data-driven analysis of your submissions. Here I treat the files you submitted as data, and I am using some of the tools we have learned in class to analyse it. This here is also an example of the kind of analysis you could do in your future notebooks.

I will use some more advanced Terminal scripting and pandas functions we haven’t seen in class yet, but it might be useful to you nonetheless.

📊 Submission statistics

Enrolled in DS105A	Accepted Assignment via GitHub	Submitted Formative	%
67	62	48	71.6%

Number of late submissions: \(\frac{2}{48} = \approx 4\%\).

📋 Overview

Number of submissions

I downloaded all of your submissions to a local folder using Git commands, I checked the number of people who accepted the W04 formative assignment:

ls . | wc -l

(The pipe | symbol is used in the Terminal to pass the output of one command as input to another command. The wc commmand counts the number of lines, words, and characters of any text passed to it.)

Great! There are 62 GitHub repos.

Now I am curious about your Jupyter Notebooks. How many of you submitted a Jupyter Notebook?

Number of Jupyter Notebooks

To solve this, I ran the advanced script below, written in awk (a programming language useful for the terminal), to create a CSV file. This script extracts and saves your unique IDs (equivalent to your GitHub username) and all Jupyter notebooks (ipynb) found in your GitHub repositories, separated by commas and saves this text to userids_and_notebooks.csv

ls -lth */*.ipynb | awk '{ split($0, a, "/"); split(a[1], b, "ds105a-2023-w04-formative-"); print("\"", b[2], "\",\"", a[2], "\"") } ' > userids_and_notebooks.csv

Tip

The * symbol is a wildcard that denotes any text. That is, the first */ indicates that I want to match any directory, and the second *.ipynb indicates that I want to match any file that ends with .ipynb.

💡 If you read our 📚 A tutorial on HTML, CSS & principles of web scraping notebook in the 🔖 Week 04 Appendix page deeply, you might find that this wildcard is also present in CSS selectors!

I then used pandas to read the CSV file and convert it to a DataFrame object, a data structure that is very useful for data analysis (See Week 04 lecture).

My original CSV file didn’t have any header, so I added column names: username and filename. Then I used the function strip() to get rid of the whitespace from all the columns.

import pandas as pd

df = pd.read_csv("userids_and_notebooks.csv")
df.columns = ["username", "filename"]
df["username"] = df["username"].str.strip()
df["filename"] = df["filename"].str.strip()

How many lines are in this DataFrame?

print("Number of rows in the DataFrame:", len(df))

Number of rows in the DataFrame: 58

Oh! So not everyone submitted a Jupyter Notebook. There are fewer rows than the number of people who accepted the assignment.

Number of unique usernames

Let me check how many unique usernames there are in the DataFrame. pandas has this useful method called unique() that returns a list of unique values in a column.

print("Number of people who added a Jupyter Notebooks to their GitHub repository:", len(df["username"].unique()))

Number of people who added a Jupyter Notebooks to their GitHub repository: 48

All of this tells me that:

62 people accepted the assignment
But only 48 people added a Jupyter Notebook to their GitHub repository! The remaining 14 probably stopped at the first step of the assignment or forgot to add a Jupyter Notebook to their GitHub repository.
In total I found 58 Jupyter Notebooks, which means that some of the 48 people added more than one Jupyter Notebook to their GitHub repository.

How can I find out? I can use groupby() to group the rows by username and then count the number of rows in each group. (We will learn about groupby() and value_counts() in the future, after Reading Week)

print("Some of you had more than one Jupyter Notebook in your GitHub repository 🤔")
df.groupby(['username']).count()['filename'].value_counts()

Some of you had more than one Jupyter Notebook in your GitHub repository 🤔
1    40
2     8
Name: filename, dtype: int64

How many people adhered to the filename convention?

If I invert the order of the columns, I can see a list of most common filenames of the Jupyter Notebooks you submitted. I can also see how many of you used the filename convention I suggested in the assignment.

df.groupby(['filename']).count()['username'].sort_values(ascending=False)

The majority! Some of you added extra spaces, others preferred to separate words with underscores or dashes. All of these are fine. As long as you are consistent with your naming convention, you will be fine.

Notebooks with names like Untitled-1.ipynb are less ideal, though! I also noticed that some of you identified yourself in the filename. While this is fine when working on a real project, for the purpose of the individual assignments in this course, it is better to keep your identity anonymous.

filename
NB01 - Initial Data Analysis.ipynb            37
Untitled-1.ipynb                               4 * not ideal. Most likely, this is an empty notebook that you created by mistake
NB01 - Analysis of Family Tree.ipynb           3
NB01_Initial_Data_Analysis.ipynb               2
XXXXX - Initial Data Analysis.ipynb            1 XXXXX represents the candate number (not part of the requirement on this assignment)
Initial Data Analysis.ipynb                    1
LSE_DS105A_W04_lecture.ipynb                   1 * you probably used it as a template. Fine, but remember to remove from GitHub before the final push
NB01 - Initial Data Analysis 1.ipynb           1 * the 1 is not necessary
NB01 - Initial Data Analysis LATEST .ipynb     1 * LATEST is not necessary if you use GitHub history to track your changes
NB01 - Initial Data Analysis".ipynb "          1 * not sure what happened here
NB01-Initial-Data-Analysis.ipynb               1
NB1 - Initial Data Analysis.ipynb              1
formative1.ipynb                               1
jupiterXXXXXX.ipynb                            1 * XXXXX had your username. Not ideal, as you are no longer anonymous
Name: username, dtype: int64

📝 Solutions & Common mistakes

Part 1: Create synthetic family data (10 fake marks)

Part 1 was straightforward. All you needed to do was copy and paste the python code and run it to produce a JSON file.

Part 2: Add a notebook + initial setup (10 fake marks)

Part 2 instructions were to create a Jupyter Notebook and add some initial setup in a very precise way:

The first thing in your Jupyter notebook should be a Markdown cell with your candidate number.
The following cells would be alternating Markdown and Code cells.

The number #1 common mistake I saw here was a confusion between Markdown and Code cells.

What is the difference?

Part 3: Answer the following questions (80 fake marks)

This was the core of the assignment, and required knowledge of Python dictionaries and lists, as well as the ability to use for or while loops to iterate over the data. Inevitably, you would have to search for information on the internet to solve some of the problems. This is fine and encouraged, as it is precisely how developers and data scientists work in the real world.

The number #1 common mistake I saw here was that many wanted to jump straight to the final solution without breaking down the problem into smaller, more manageable problems. I wrote the lengthy section below to help take you step-by-step through the process of solving challenging problems like this one.

Before answering the questions, read the data

The #1 thing you should do as soon as you acquire a new dataset is to read it. This is a good practice because it allows you to understand the data you are working with and to identify potential problems.

How do I read a JSON file in Python? With the json library:

import json

with open("companies_and_families.json", "r") as file:
    data = json.load(file)

OK, now what is the type of data?

type(data)

dict

It is a pure Python dictionary! This is good, because we know how to work with Python dictionaries. Let’s see what keys this dictionary has:

data.keys()

dict_keys(['companies', 'families'])

Now you can start exploring.

Q1: How many companies are in the dataset? (5 fake marks)

You know the data dictionary contains a key called ['companies'], but what type of data is it?

type(data["companies"])

list

It is a simple Python list! You should now know that you can use the len() function to count the number of elements in a list:

len(data["companies"])

Q2: How many familie trees are there? (5 fake marks)

Similarly, the solution to Q2 is:

len(data["families"])

Q3: How many unique names appear in the first family tree? (10 fake marks)

This requires a bit more work.

Because the family tree is unique to each one of you (it is generated randomly every time you run the script), I can’t give you a solution that works for everyone. Instead, I will show you how I would approach this problem in a way that would work for any family tree.

The first thing I’d advise you do is write down all the steps you need to take to solve this problem. This is a good practice because it helps you to break down a complex problem into smaller, more manageable problems:

Take a look at the first family tree to see what it looks like
Understand where in the family tree the names are stored
Find a way to extract the names
Check if my solution above works for other family trees and adjust it if necessary
Find a way to loop over ALL family trees with a for loop, all while keeping track of the names you have already seen (the question asks for unique names)
Check your solutions by inspecting some of the trees manually

Now, go baby steps.

1. Take a look at the first family tree to see what it looks like

Remember from Week 04 lecture how to navigate elements in a nested dictionary or nested list? You can use the square brackets [] to navigate to the element you want. For example, to access the first family tree, you can do:

data["families"][0]

Here I used [0] because what is inside data["families"] is a list, and list must be accessed by their index. The first element of a list has index 0, the second element has index 1, and so on. If the type of data["families"] was a dictionary, I would have used the key to access the element I want.

2. Understand where in the family tree the names are stored

The code above returned:

{'Partner 1': {'name': 'Maria Moody',
  'work': {'company': 'Diaz Group',
   'role': 'Admin Coordinator',
   'job_type': 'Admin'},
  'spouse': 'Becky Klein'},
 'Partner 2': {'name': 'Becky Klein',
  'work': {'company': 'Rasmussen PLC',
   'role': 'Data Engineering Manager',
   'job_type': 'Data Engineer'},
  'spouse': 'Maria Moody'}}

This is a dictionary with two keys: Partner 1 and Partner 2. Each of these keys has another dictionary as value. This is a nested dictionary or, in other words, there is a dictionary inside another dictionary.

3. Find a way to extract the names

The most relevant thing there is that names are stored as values of the name key. So, to access the name of Partner 1, I can do:

data["families"][0]["Partner 1"]["name"]

returning:

'Maria Moody'

In this case, I can easily extract both names by doing:

name1 = data["families"][0]["Partner 1"]["name"]
name2 = data["families"][0]["Partner 2"]["name"]

But you don’t want to save each individual name to its own Python variable. The beauty of programming lies in that we can find ways to automate repetitive tasks.

Using your knowledge of Python dictionaries to automate the solution above

To improve the solution above, you would need to know that ["Partner 1"] and ["Partner 2"] are the only keys in the dictionary and that you can retrieve the values of all keys in a dictionary using the values() method.

data["families"][0].values()

dict_values([{'name': 'Maria Moody', 'work': {'company': 'Diaz Group', 'role': 'Admin Coordinator', 'job_type': 'Admin'}, 'spouse': 'Becky Klein'}, 
             {'name': 'Becky Klein', 'work': {'company': 'Rasmussen PLC', 'role': 'Data Engineering Manager', 'job_type': 'Data Engineer'}, 'spouse': 'Maria Moody'}])

The dict_values is a fancy way of saying that the output is a list that came from a dictionary. I can convert it to a pure list to avoid confusion:

list(data["families"][0].values())

[{'name': 'Maria Moody', 'work': {'company': 'Diaz Group', 'role': 'Admin Coordinator', 'job_type': 'Admin'}, 'spouse': 'Becky Klein'}, 
 {'name': 'Becky Klein', 'work': {'company': 'Rasmussen PLC', 'role': 'Data Engineering Manager', 'job_type': 'Data Engineer'}, 'spouse': 'Maria Moody'}]

That’s simpler.

How then can I make use of it? Well, since this is a list, I know I can iterate lists using a for loop.

A for loop to save the day

# I start with an empty list
# I will later append the names to this list
all_names = []

# I iterate over the list of dictionaries
for person in data["families"][0].values():
    # I append the name of each person to the list
    all_names.append(person["name"])

# I print the list to check if it worked
print(all_names)

['Maria Moody', 'Becky Klein']

4. Check if my solution above works for other family trees and adjust it if necessary

How do I know if the above is all I need to extract names from any family tree? I can check by running the code above for other family trees.

all_names = []

# I chose family 1 instead of family 0 this time
for person in data["families"][1].values():
    all_names.append(person["name"])

print(all_names)

['Cheryl Olson', 'Nancy Moreno']

To make sure this is correct, let me look at the family tree in its original format:

data["families"][1]

{'Partner 1': {'name': 'Cheryl Olson',
  'work': None,
  'spouse': 'Nancy Moreno',
  'children': [{'name': 'Jennifer Mcdonald',
    'work': {'company': 'Patel-Gonzalez',
     'role': 'Lead Data Scientist',
     'job_type': 'Data Scientist'},
    'spouse': 'Philip Rodriguez'}]},
 'Partner 2': {'name': 'Nancy Moreno',
  'work': {'company': 'Fields Inc',
   'role': 'Junior Data Engineer',
   'job_type': 'Data Engineer'},
  'spouse': 'Cheryl Olson',
  'children': [{'name': 'Jennifer Mcdonald',
    'work': {'company': 'Patel-Gonzalez',
     'role': 'Lead Data Scientist',
     'job_type': 'Data Scientist'},
    'spouse': 'Philip Rodriguez'}]}}

Wait! There’s more stuff here. These people have children. I need to count them too.

The question you should ask yourself right now is the following: how do I know when a nested dictionary is over? How do I know when I have reached the end of the tree? This is a very important question that comes up frequently when dealing with JSON as well as HTML files.

The answer is that you need to look at the structure of the data. In this case, I can see that the children are stored as a list of dictionaries. So, I need to iterate over this list and extract the names of the children.

I would go back to my previous cell of code and edit it to look like this:

all_names = []

# I iterate over the list of dictionaries
for person in data["families"][1].values():
    all_names.append(person["name"])

    # I check if the person has children
    if "children" in person.keys():
        # I iterate over the list of children
        for child in person["children"]:
            all_names.append(child["name"])

['Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Jennifer Mcdonald']

5. Find a way to loop over ALL family trees with a `for` loop, all while keeping track of the names you have already seen (the question asks for unique names)

The above works for both family trees 0 and 1. But what about the other family trees? What if I have a family tree that has 100 levels of nested dictionaries? Would I have to write 100 if and for loops?

NO! Remember: you should always try to automate repetitive tasks.

How can I rewrite the code above to make it work for any family tree? I could use a while loop instead of for loop. A while loop is a loop that runs until a certain condition is met, and therefore useful when I don’t know how many elements I need to iterate over. In this case, I want to run the loop until I have reached the end of the tree.

all_names = []

curr_family = 1

for person in data["families"][curr_family].values():
    all_names.append(person["name"])

    current_person = person

    # Here I want to write a while loop
    # that runs until I have reached the end of the tree
    # I know I have reached the end of the tree when
    # the person does not have children
    while "children" in current_person.keys():
        for child in current_person["children"]:
            all_names.append(child["name"])

            # Change the current person to the child
            current_person = child

all_names

The above works perfectly when I select the family 0 (curr_family = 0) but it only kinda works for family 1. The name of Jennifer Mcdonald appears twice in the list. This is because she is a child of both her parents 🙃.

['Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Jennifer Mcdonald']

Before I can proceed I need to deal with this case of repeated names. One thing I could is: whenever I am about to append a child’s name to the list, I could check if the name is already in the list. If it is, I don’t append it. If it is not, I append it.

if child["name"] not in all_names:
    all_names.append(child["name"])

This would be fine. But there is a better way. I can use a Python set instead of a list. A set is a data structure that only stores unique values. If I try to add a value that is already in the set, it will not be added. This is exactly what I want.

# all_names is no longer a list
all_names = set()

curr_family = 1

for person in data["families"][curr_family].values():
    all_names.add(person["name"])

    current_person = person

    while "children" in current_person.keys():
        for child in current_person["children"]:
            # Instead of .append(), sets use .add()
            all_names.add(child["name"])
            current_person = child

all_names

which correctly prints out:

{'Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno'}

The curly braces, {}, are a sign that this is a set not a list (which would be enclosed in square brackets, []).

Looking back at this family tree, I also notice that Jennifer has a spouse! I need to add their name to the list too. I can do this by adding another if statement to my code:

all_names = set()

curr_family = 1

for person in data["families"][curr_family].values():
    all_names.add(person["name"])

    current_person = person

    while "children" in current_person.keys():
        for child in current_person["children"]:
            all_names.add(child["name"])

            # Add the spouse's name to the list
            if "spouse" in child.keys():
                all_names.add(child["spouse"])
            current_person = child

all_names

{'Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Philip Rodriguez'}

Great! Am I done?

Well, you can check by running the code above for different family trees manualy (change curr_family to 0, 2, 3, etc.) until you are confident that it works for family trees regardless of how nested they are. (It does).

Use functions!

Instead of editing the curr_family variable, it would be nice if you wrapped the code above inside a custom function. In Python, functions are created with the operator def and provide a way to reuse code. In this example, the perfect function would be one that takes a family tree as input and returns a set with all the names in that family tree.

def get_names_in_family(family_tree):
    all_names = set()

    for person in family_tree.values():
        all_names.add(person["name"])

        current_person = person

        while "children" in current_person.keys():
            for child in current_person["children"]:
                all_names.add(child["name"])

                # Add the spouse's name to the list
                if "spouse" in child.keys():
                    all_names.add(child["spouse"])
                current_person = child

    return all_names

How do I test it? You can just call the function with the family tree you want to test it with:

get_names_in_family(data["families"][0])

{'Becky Klein', 'Maria Moody'}

get_names_in_family(data["families"][1])

{'Cheryl Olson', 'Jennifer Mcdonald', 'Nancy Moreno', 'Philip Rodriguez'}

Q4. Compute the number of unique individuals named per family. (20 fake marks)

If you have the function above, this becomes trivial. You just need to iterate over all family trees and count the number of unique names in each family tree.

for family in data["families"]:
    print(len(get_names_in_family(family)))

Improve the solution above

The solution above is fine because we just asked you to compute, not to save the results. But what if we wanted to save the results? How would you do it?

# I start with an empty list
all_family_sizes = []

# I iterate over all family trees
for family in data["families"]:
    all_family_sizes.append(len(get_names_in_family(family)))

Improve it further with list comprehension

The solution above is fine, but it is not very elegant. It is also not very efficient because I am creating an empty list and then appending to it. There is a better way to do this: list comprehension.

all_family_sizes = [len(get_names_in_family(family)) for family in data["families"]]

Better yet: use a dictionary to store the results

List comprehensions are not just for lists, you can create dictionaries with them too. Instead of brackets [], you use curly braces {} and, importantly, you would need to provide a key and a value for each element in the dictionary.

# Saving the same results above as a dictionary instead
all_family_sizes = {i: len(get_names_in_family(family)) for i, family in enumerate(data["families"])}

If you have never heard of enumerate before, it is a very useful function that returns two things at once: the index of the element and the element itself. In this case, I am using it to get the index of each family tree and save it as the key of the dictionary.

This is very useful for later, when I want to use pandas to create a DataFrame object with the results.

Q5. Compute the number of unique individuals named per company. (20 fake marks)

The reasoning is similar to the two questions above. I would do well by creating a function that goes over the family tree and keeps track not just of names of people, but also of the companies they work for. I will jump straight to final solution:

# This function would substitute the get_names_in_family() function above
def get_names_and_companies_in_family(family_tree):
    all_names = set()

    # Let's keep companies as a dictionary where the key is the company name
    all_companies = {}

    for person in family_tree.values():
        all_names.add(person["name"])

        # Does this person have a work entry?
        if "work" in person.keys():
            # Some people have a work entry but it is empty
            if person["work"] is not None:
                #Have I seen this company before?
                if person["work"]["company"] in all_companies.keys():
                    # If yes, I add the person's name to the list of people who work for this company
                    all_companies[person["work"]["company"]].append(person["name"])
                else:
                    # If not, I create a new key with the company name and add the person's name to the list
                    all_companies[person["work"]["company"]] = [person["name"]]

        current_person = person

        while "children" in current_person.keys():
            for child in current_person["children"]:
                all_names.add(child["name"])
                
                # Same thing I did to the parent
                # This piece of code is repeated and could be further improved 
                # by being turned into a separate function
                if "work" in child.keys():
                    if child["work"] is not None:
                        if child["work"]["company"] in all_companies.keys():
                            all_companies[child["work"]["company"]].append(child["name"])
                        else:
                            all_companies[child["work"]["company"]] = [child["name"]]

                if "spouse" in child.keys():
                    all_names.add(child["spouse"])
                current_person = child

    # I am returning TWO objects: a set with all names and a dictionary with all companies
    return all_names, all_companies

See what happens when I run it for the first family tree:

get_names_and_companies_in_family(data["families"][0])

({'Becky Klein', 'Maria Moody'},
 {'Diaz Group': ['Maria Moody'],
  'Rasmussen PLC': ['Becky Klein'],
  'Diaz Group': ['Maria Moody'],
  'Rasmussen PLC': ['Becky Klein']})

The object above is a Python tuple. It is a data structure that is similar to a list, but it is immutable (you can’t change it). It is useful when you want to return multiple objects from a function.

How can I get just names and just the companies information? I can use the same trick I used before with the enumerate function and unpack the tuple into two variables:

names, companies = get_names_and_companies_in_family(data["families"][0])

The get_names_and_companies_in_family() would be a solution for Q3 too.

Solution

Now that I have this function, how do I use it to solve Q5? I can use a for loop to iterate over all companies and count the number of unique names in each company.

# I start with an empty dictionary
# This dictionary will have the company name as key and the number of unique names as value
num_people_per_company = {}

# I iterate over all family trees
for family in data["families"]:
    # I use the function above to get the names and companies in each family tree
    names, companies = get_names_and_companies_in_family(family)

    # I iterate over all companies in the dictionary
    for company in companies.keys():
        # I check if the company is already in the dictionary
        if company in num_people_per_company.keys():
            # If yes, I add the number of unique names to the existing value
            num_people_per_company[company] += len(companies[company])
        else:
            # If not, I create a new key with the company name and add the number of unique names
            num_people_per_company[company] = len(companies[company])


print(num_people_per_company)

which returns something like:

{'Diaz Group': 80, 'Rasmussen PLC': 87, 'Patel-Gonzalez': 97, 'Fields Inc': 83, 'Sampson-Shaw': 80, 'Rivera and Sons': 81, 'Parrish-Cruz': 67, 'Walker, Harris and Johnson': 73, 'Peterson PLC': 74, 'Smith, Harper and Garza': 78, 'Hale, Dunn and Graham': 74, 'Matthews, Kennedy and James': 66, 'Doyle-Olsen': 88, 'Sherman-Washington': 105, 'Ortiz Ltd': 90, 'Rodriguez, Fox and Gaines': 62, 'Taylor LLC': 67, 'Foley-Villanueva': 72, 'Thomas, Gallagher and Vazquez': 92, 'Lester Group': 86}

💡 AN EXERCISE TO THE READER: How would you convert the dictionary above to Pandas?

Q6 Write a custom function called get_company_size that takes a company name as an argument and returns the number of people working in that company. Show that it works. (40 fake marks)

This solution could go in many different directions. I could simply use the dictionary I created in Q5 to solve this question.

def get_company_size(company_name):
    return num_people_per_company[company_name]

But note that the function assumes that this dictionary with the name num_people_per_company already exists and is not created inside the function. This is not ideal because it means that the function is not self-contained. It depends on an external variable.

Creating a self-contained function

We would need to adapt the function above to make it self-contained. It is okay to rely on other functions, just not a good idea to rely on external variables.

def get_company_size(company_name, data):
    num_people_per_company = {}

    for family in data["families"]:
        # I can reuse the function I created above
        names, companies = get_names_and_companies_in_family(family)

        for company in companies.keys():
            if company in num_people_per_company.keys():
                num_people_per_company[company] += len(companies[company])
            else:
                num_people_per_company[company] = len(companies[company])

    return num_people_per_company[company_name]

🧐 Deep data-driven analysis

Here, I treat your submissions as data and show how you could use some of the tools we have learned in class to analyse it.

Jupyter Notebooks are plain text files

Here is a little secret: Jupyter notebooks are just text files with a .ipynb extension.

The fact that when you click on a ipynb file on VS Code (or on JupyterLab) renders the notebook in a nice format is just a feature of VS Code (or JupyterLab). In other words, just like HTML code is translated into a nice webpage by your browser, Jupyter notebooks are translated into a nice format by VS Code (or JupyterLab).

As you saw in Week 04 Lecture, we can use bash’s head command to print out just the first few lines of a text file. Let’s try it with a Jupyter notebook:

!head -n 30 "ds105a-2023-w04-formative-<USERNAME>/NB01 - Initial Data Analysis.ipynb"

which outputs:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3a1b7d8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#setup\n",
    "import json\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "af684478",
   "metadata": {},
   "outputs": [],

Look similar to something you have seen before? Yes! This is JSON! Jupyter Notebooks are JSON files.

💡 The only way you will fully understand the rest of this analysis is if you try to replicate what I did to your own Jupyter Notebook. Try repurposing the code below to read your own Jupyter Notebook as a JSON file into a Python dictionary (see also the Week 04 Lecture notebook for an example).

Exploring the metadata in your Jupyter Notebooks

When I load one of your Jupyter Notebooks as a JSON file into Python with the json library, I can see that it contains four different keys: cells, metadata, nbformat, and nbformat_minor.

import json

with open("ds105a-2023-w04-formative-<USERNAME>/NB01 - Initial Data Analysis.ipynb") as f:
    nb_as_json = json.load(f)

print(nb_as_json.keys())

Further exploring this dictionary, I see that the nb_as_json["metadata"] key contains a dictionary with a lot of information about the notebook, such as what version of Python you used to run the notebook.

nb_as_json["metadata"]["language_info"]["version"]

If I could one step further, I can read ALL Jupyter notebooks and summarise the Python version used by all of you.

But how do I read all Jupyter Notebooks at the same time?

(You won’t be able to replicate this section because you only have your own notebook)

At the start of this document you saw that I used the * wildcard in bash to match multiple files. It turns out that I can do the same with Python’s glob library.

import glob

# This matches all files that end with .ipynb
path_to_all_notebooks = glob.glob("ds105a-2023-w04-formative-*/**/*.ipynb", recursive=True)

As this returns a list with the path to all files, I can now iterate over this list using a for loop and load each file as a JSON file.

Putting things together

import json
import glob

python_versions = []
path_to_all_notebooks = glob.glob("ds105a-2023-w04-formative-*/**/*.ipynb", recursive=True)

for path_to_notebook in path_to_all_notebooks:
    with open(path_to_notebook, "r") as file:
        nb_as_json = json.load(file)

        # Some metadata does not have a "version" key, so I need to check if it exists first
        if 'version' in nb_as_json["metadata"]["language_info"].keys():
            python_versions.append(nb_as_json["metadata"]["language_info"]["version"])

pd.DataFrame(py_versions_in_use, columns=["python_version"]).value_counts()

Finally, I can use pandas to create a DataFrame object and count the number of times each Python version was used.

import pandas as pd

df = pd.DataFrame(python_versions, columns=["python_version"])

df["python_version"].value_counts()

python_version
3.11.4            21
3.11.5            12
3.12.0             6
3.11.6             4
3.10.8             2
3.9.6              2
3.10.7             1
3.11.0             1
3.9.13             1
3.9.18             1
3.9.7              1
dtype: int64

Most of you are using Python 3.11, one of the latest versions, and no one is using any super old version of Python. This is great!

More than just extracting your Python version, I want to read the content of what you wrote in your Jupyter Notebooks. It is now easy to adapt the code above to read all your Jupyter Notebooks as JSON files and store them in a list.


import glob

path_to_all_notebooks = glob.glob("ds105a-2023-w04-formative-*/**/*.ipynb", recursive=True)
all_json_notebooks = []


for path_to_notebook in path_to_all_notebooks:

    if "Untitled" not in path_to_notebook and "lecture" not in path_to_notebook:
        with open(path_to_notebook, "r") as file:
            nb_as_json = json.load(file)
            all_json_notebooks.append(nb_as_json)

print("I just read all ", len(all_json_notebooks), " notebooks.")

I just read all  52  notebooks.

Requirement 1: add your candidate number to the first cell

Now that I have read all your notebooks into Python a list of dictionaries, I can navigate to the ["cells"] key (a list) of each notebook, take the first cell ([0]) and read its content (["source"]):

# I create an empty list to store it
first_cells = []

# I iterate over all notebooks and append the first cell to the list
for notebook in all_json_notebooks:
    first_cells.append(notebook["cells"][0]["source"])

print("Just checking that I read the first cells of all ", len(first_cells), " notebooks.")

Just checking that I read the first cells of all  52  notebooks.

The length of this list matches the number of notebooks I read, so I know that I have read all the notebooks correctly.

Practising list comprehension

I could make the code above neater and shorter by using list comprehension:

# I achieve the same thing as above in just one line of code
first_cells = [notebook["cells"][0]["source"] for notebook in all_json_notebooks]

What to do right after collecting data in a list?

Just browse through it! See what you can find. This is solid advice for any data analysis.

A good example of someone who followed the instructions to the letter

The first element of my list shows that this person correctly added their candidate number to the first cell:

first_cells[0]

['78182']

Note: the output is a list of size 1. The candidate number is the first (and only) element of this list.

Ops! Someone forgot to add their candidate number

first_cells[3]

['##1. Setup\n',
 '\n',
 'import json\n',
 'import pandas as pd\n',
 'from pprint import pprint\n',
 '\n',
 '\n',
 '## 2. Read the data\n',
 '\n',
 '#file = open("companies_and_families.json", "r")\n',
 '#data = json.load(file)\n',
 '#pprint(data)\n',
 '\n']

This shows that the first cell of this person’s notebook contains code to import libraries and read the data, but no candidate number.

A closer look into this notebook also reveals that all of the code above was put together into a Code cell¹. The requirements of the assignment required information to be split into multiple cells, some would be Markdown and some would be Code.

Summarising cell types

I want to know how many of you added a first cell of the type Markdown:

first_cell_type = [notebook["cells"][0]["cell_type"] for notebook in all_json_notebooks]

# Convert the list above to a pandas DataFrame so I can do useful things with it
df = pd.DataFrame(first_cell_type, columns=["first_cell_type"])

df["first_cell_type"].value_counts()

returning:

first_cell_type
markdown           33
code               19
dtype: int64

But, I saw that some of those who created the first cell as Code cells in fact added their candidate number as a commented Python code, which is fine.

Number of lines

Perhaps the ultimate test to see if you followed the instructions is to count the number of lines in the first cell. If you followed the instructions, the first cell should have only one line.

number_lines_first_cell = [len(notebook["cells"][0]["source"]) for notebook in all_json_notebooks]

df = pd.DataFrame(number_lines_first_cell, columns=["number_lines"])

df["number_lines"].value_counts()

OK! The majority of you did it correctly.

1      43
3       3
13      1
6       1
7       1
0       1
41      1
113     1
Name: number_lines, dtype: int64

I won’t go further into the analysis of the following cells, this is already a long document. But I hope you can see how you could use a similar approach if you were to analyse JSON files in the future.

Footnotes

I can further confirm this by running all_json_notebooks[3]["cells"][0]["cell_type"], which returns "code".↩︎

📋 Overview

Number of submissions

Number of Jupyter Notebooks

Number of unique usernames

How many people adhered to the filename convention?

📝 Solutions & Common mistakes

Part 1: Create synthetic family data (10 fake marks)

Part 2: Add a notebook + initial setup (10 fake marks)

Part 3: Answer the following questions (80 fake marks)

Before answering the questions, read the data

Q1: How many companies are in the dataset? (5 fake marks)

Q2: How many familie trees are there? (5 fake marks)

Q3: How many unique names appear in the first family tree? (10 fake marks)

1. Take a look at the first family tree to see what it looks like

2. Understand where in the family tree the names are stored

3. Find a way to extract the names

4. Check if my solution above works for other family trees and adjust it if necessary

5. Find a way to loop over ALL family trees with a for loop, all while keeping track of the names you have already seen (the question asks for unique names)

Q4. Compute the number of unique individuals named per family. (20 fake marks)

Q5. Compute the number of unique individuals named per company. (20 fake marks)

Solution

Q6 Write a custom function called get_company_size that takes a company name as an argument and returns the number of people working in that company. Show that it works. (40 fake marks)

🧐 Deep data-driven analysis

Jupyter Notebooks are plain text files

Exploring the metadata in your Jupyter Notebooks

Requirement 1: add your candidate number to the first cell

What to do right after collecting data in a list?

Footnotes

5. Find a way to loop over ALL family trees with a `for` loop, all while keeping track of the names you have already seen (the question asks for unique names)