✏️ W04 Formative - Fake it ’till you make it

2023/24 Autumn Term

Author

⏲️ Due Date:

🎯 Main Objectives:

Important

Please submit your work even if you didn’t manage to go very far with the Python code. As this is a formative assignment, it won’t be graded, and you can still benefit from learning how to use GitHub effectively

👉 Note: This assignment will count towards your final class grade if you are a General Course or Exchange student.

📚 Preparation

We will use a feature of GitHub called GitHub Classroom 1 as the place to store your submissions. You will need to have a GitHub account to do this.

  1. In case you haven’t done it already, please follow the steps in the 📚 Week 03 Lab - Preparation page. (It’s the same prep you would have done for )

  2. Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link. Do not share this link with anyone outside this course!

  3. Click on the link, sign in to GitHub and then click on the green button Accept this assignment.

  4. You will be redirected to a new private repository created just for you. The repository will be named ds105a-2023-formative1--yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.

  5. Many of you might still be catching up with Python and GitHub so it is understandable if you struggle a bit with this first coding exercise. Your submission will still count as completed (important for General Course and Exchange students) even if you don’t manage to complete all the questions.

  6. Create your own Jupyter notebook with your answers.

  7. Try to create separate headers and code chunks for each question. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.

  8. Use the #help-assessments channel on Slack liberally if you get stuck. We will give priority to questions asked on these public channels over private DMs.

“What do I submit?”

Do you know your CANDIDATE NUMBER? You will need it.

“Your candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.

Source: LSE

  • A Python script file called generate_fake_data.py.

  • The JSON file produced by the script above which should be called companies_and_families.json

  • A Jupyter Notebook file called `NB01 - Analysis of Family Tree’ that contains your analysis of the JSON, as specified below. Your candidate number must appear on the first cell of this notebook.

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment.

✔️ How we will grade your work

We won’t! This is formative. But you will get feedback on your answers. It won’t be super detailed at this stage, but it should give you an idea of how you are doing.

👉 Note: Completing this assignment will count towards your final class grade if you are a General Course or Exchange student. It will still count as submitted even if you submit just a few coding responses.

📚 Tasks

The questions below will build on principles from the 👨🏻‍🏫 W03 lecture and the code you wrote as part of the 💻 W03 Lab.

Part 1: Create synthetic family data (10 fake marks)

  1. Create a generate_fake_data.py script using the following code:
Click here to see the code
import json
import random
import numpy as np

from faker import Faker

fake = Faker()

all_names = set()
has_spouse = set()

def generate_company():
    return {
        "name": fake.company(),
        "mission": fake.bs(),
        "catch_phrase": fake.catch_phrase()
    }

# Generate a limited set of companies
companies = [generate_company() for _ in range(20)]
company_names = [company["name"] for company in companies]

# Define role hierarchies for different job types
role_hierarchies = {
    "Data Scientist": ["Junior Data Scientist", "Data Scientist", "Senior Data Scientist", "Lead Data Scientist", "Data Science Manager"],
    "Data Engineer": ["Junior Data Engineer", "Data Engineer", "Senior Data Engineer", "Lead Data Engineer", "Data Engineering Manager"],
    "Admin": ["Admin Assistant", "Admin Coordinator", "Admin Manager", "Admin Director"]
} 

def generate_person():
    
    employment_status = np.random.choice([True, False], p=[0.8, 0.2])
    work_data = None
    if employment_status:
        job_type = random.choice(list(role_hierarchies.keys()))
        work_data = {
            "company": random.choice(company_names),
            "role": random.choice(role_hierarchies[job_type]),
            "job_type": job_type
        }

    name = fake.name()
    while name in all_names:
        name = fake.name()


    all_names.add(name)
    person = {
        "name": name,
        "work": work_data
    }
    return person



def generate_family_tree(depth, avg_num_children=2):

    if depth == 0:
        return None, None
    elif depth == 1:
        return generate_person(), None
    else:
        partner1, partner2 = generate_person(), generate_person()

        # If this same name has already been used as someone's spouse, generate a new person
        while partner1["name"] in has_spouse:
            partner1 = generate_person()
        while partner2["name"] in has_spouse:
            partner2 = generate_person()
        
        # Add them to the has_spouse set
        has_spouse.add(partner1["name"])
        has_spouse.add(partner2["name"])
        
        partner1["spouse"] = partner2["name"]
        partner2["spouse"] = partner1["name"]
        
        for _ in range(random.randint(0, avg_num_children)):
            child, _ = generate_family_tree(depth-1, random.randint(0, avg_num_children))
            if child:
                partner1["children"] = partner1.get("children", [])
                partner2["children"] = partner2.get("children", [])

                partner1["children"].append(child)
                partner2["children"].append(child)
        
        return partner1, partner2

def generate_families(num_families, avg_family_depth):
    families = []

    for _ in range(num_families):
        family_depth = int(np.random.exponential(avg_family_depth))
        partner1, partner2 = generate_family_tree(family_depth)

        if partner1 and partner2:
            families.append({"Partner 1": partner1, "Partner 2": partner2})

    return families

if __name__ == '__main__':

    num_families = 500
    avg_family_depth = 10
    families = generate_families(num_families, avg_family_depth)

    data = {
        "companies": companies,  # Include company details here
        "families": families
    }
    
    with open('companies_and_families.json', 'w') as f:
        json.dump(data, f, indent=4)
  1. Run the script on the Terminal:

    python generate_fake_data.py

    You should see a JSON file called companies_and_families.json in your folder. Open it with VS Code to take a look at the data.

  2. Commit and push your changes to GitHub.

Part 2: Add a notebook + initial setup (10 fake marks)

  1. Create a new Jupyter Notebook called NB01 - Initial Data Analysis.ipynb and add your candidate number to the first cell.

  2. Add a new markdown cell and add the following header:

    # 1. Setup
  3. Then, add a new code cell to keep all the imports you will need for this notebook. You will need at least the following:

    import json
    import pandas as pd

    If you feel the need to add more later, come back to this cell and add them.

  4. Add a new markdown cell and add the following header:

    # 2. Read the data
  5. Create a new code cell and read the JSON file into a Python dictionary called simply data.

  6. Save the notebook, then commit and push your changes to GitHub.

Part 3: Answer the following questions (80 fake marks)

Now, you get to decide how to organise the rest of the notebook! Add as many markdown and code cells as you need to answer the questions below.

Answer each question by using Python code from the data list you created in the previous section. You don’t need to use pandas for this part (but it’s ok if you do).

  1. How many companies are there in the dataset? (5 fake marks)

  2. How many family trees are there? (5 fake marks)

  3. How many unique names appear in the first family tree? (10 fake marks)

  4. Compute the number of unique individuals named per family. (20 fake marks)

  5. For each company, compute how many people work in that company. (20 fake marks)

  6. Write a custom function called get_company_size that takes a company name as an argument and returns the number of people working in that company. Show that it works. (40 fake marks)

  7. Save the notebook, commit and push your changes to GitHub.

Feel like you can do more? Try to calculate statistics about the different job roles and types.