πŸ“ W11+1 Summative

2023/24 Autumn Term

Author

πŸ’‘ NOTE: This time, you are not asked to write code as part of the assignment. If you choose to do so, please ensure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.

⏲️ Due Date:

⚠️ NOTE: This is in the middle of the day, not midnight! ⚠️

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to πŸ“§

βš–οΈ Assignment Weight:

This assignment is worth 30% of your final grade in this course.

30%

Do you know your CANDIDATE NUMBER? You will need it.

β€œYour candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

πŸ“ Instructions

πŸ‘‰ Read it carefully, as some details might change from one assignment to another.

  1. Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled πŸ“ W11+1 Summative. Do not share this link with anyone outside this course!

  2. Click on the link, sign in to GitHub, and then click on the green Accept this assignment button.

  3. You will be redirected to a new private repository created just for you. The repository will be named ds202a-2023-w11p1-summative-yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.

  4. Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.

  5. Create a <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

    For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  6. Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:

    ---
    title: "DS202A - W11+1 Summative"
    author: <CANDIDATE_NUMBER>
    output: html
    self-contained: true
    ---

    Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:

    ---
    title: "DS202A - W11+1 Summative"
    author: 12345
    output: html
    self-contained: true
    ---
  7. Fill out the .qmd file with your answers. This time, you are not required to write code. You can still get a high grade if you provide correct and deeply insightful answers and your notebook is nicely formatted.

    • Use headers and code chunks to keep your work organized. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
  8. Once done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.

    • If you added any code, ensure your .qmd code is reproducible. If we were to restart R and RStudio and run your notebook, it should run without errors, and we should get the same results as you did.

    • If you choose to add code, please ensure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.

  9. Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

  10. Read the section How to get help and collaborate with others at the end of this document.

β€œWhat do I submit?”

You will submit two files:

  • A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  • An HTML file render of the Quarto markdown file.

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

πŸ—„οΈ The data

This assignment will use the OkCupid dataset (Kim and Escobedo-Land 2015) first introduced in your πŸ›£οΈ Week 10 lab. The dataset contains information about 59,946 users of the dating site OkCupid and includes demographic and lifestyle data, as well as text answers to questions asked by OkCupid.

πŸ“‹ Your Tasks

What do we need from you?

Context

Imagine we are interested in understanding the user profiles on OkCupid, focusing on specific variables:

Demographic variables:

  • sex
  • orientation
  • status
  • age
  • height

Attitudinal variables:

  • drinks
  • smokes
  • drugs

We represented the categorical variables as ordered factors and then converted them into numbers. Here are the factor levels we used to categorize the variables:

  • status: c("not_informed", "single", "available", "seeing someone", "married")
  • drinks: c("not_informed", "not_at_all", "rarely", "socially", "often", "very_often", "desperately")
  • smokes: c("not_informed", "no", "sometimes", "when drinking", "yes", "trying to quit")
  • drugs: c("not_informed", "never", "sometimes", "often")

This whole process leads to a data frame that looks like the following:

sex orientation status age height drinks smokes drugs
male straight 2 22 190.50 4 3 2
male straight 2 36 177.80 5 2 3
male straight 3 37 172.72 4 2 1
male straight 2 22 180.34 4 2 1
male straight 2 30 167.64 4 2 2
male straight 2 28 170.18 4 2 1

To identify groups of users with similar profiles, we create the recipe below before applying any clustering algorithm:

base_recipe <-
  recipe(~., data = df_okcupid) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_factor_predictors()) %>%
  prep()

Part 1: Clustering and PCA (60 marks)

Question 1

Value: 7 marks

Describe the recipe outlined above. What is the purpose of each step? Is there anything you would modify? (~ 100 words should be enough)

Question 2

Value: 8 marks

If you decided to apply k-means clustering to the data after using this recipe, how would you go about choosing the number of clusters? Please explain your reasoning. (~ 100 words should be enough)

Question 3

Value: 15 marks

How would you determine what each cluster means? In other words, how would you go about describing the most typical user profile within each cluster? Please provide an explanation for your approach. (~ 200 words should be enough)

Question 4

Value: 30 marks

Now suppose you added an extra step to the base_recipe, to calculate the principal components of the data:

pca_recipe <-
    base_recipe %>%
    step_pca(all_predictors(), num_comp = Inf, keep_original_cols=TRUE) %>%
    prep()

If you plot the cumulative proportion of variance explained by each principal component, you will see the following:

Now answer the following question: (max. 300 words)

  • Why are there 9 principal components?
  • If you were to use k-means clustering on the data now, how many principal components would you consider using? Give reasons for your choice.
  • How would you change the approach to interpreting the clusters? Would you still use the same method you proposed in Question 3? Why or why not?

Part 2: Text mining (40 marks)

On top of the numeric and categorical variables, the dataset also contains β€˜essays’ written by the users in their profiles. These essays are text answers to the following questions asked by OkCupid:

  • essay0: My self summary
  • essay1: What I’m doing with my life
  • essay2: I’m really good at
  • essay3: The first thing people usually notice about me
  • essay4: Favorite books, movies, show, music, and food
  • essay5: The six things I could never do without
  • essay6: I spend a lot of time thinking about
  • essay7: On a typical Friday night I am
  • essay8: The most private thing I am willing to admit
  • essay9: You should message me if…

Note: The essays cannot be linked to the users’ profile data we have been analysing, as the researchers who supplied the data have intentionally shuffled them. This was done to prevent the possibility of identifying individual users. You can read more about it on the codebook provided alongside this data on GitHub.

Now answer the following questions: (~ 200-500 words)

  • How would you go about analysing the text data? What would you do first?
  • What interesting research questions would you pose to the data? Why?
  • What techniques (supervised or unsupervised) would you use to analyse the text data? Why?

βœ”οΈ How we will grade your work

Expect to score around 70/100 if your answers are correct. Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score; you need to add unique insights or analyses to get a distinction - I cannot tell you what these are, but these should be things that make us go, β€œwow, that’s a great idea! I hadn’t thought of that”.

Part 1: Clustering and PCA (60 marks)

Here is a rough rubric of how we will grade your answers:

Question 1

  • 0 marks: No answer was provided, or the response is irrelevant to the question.
  • 3 marks: The answer is somewhat correct but lacks precision. It is not well-explained.
  • 5 marks: The answer is accurate, precise, and well-explained.
  • 7 marks: Besides being correct, precise, and well-explained, the answer offers a distinct insight or perspective.

Question 2

  • 0 marks: No answer was provided, or the response is irrelevant to the question.
  • 4 marks: The answer is somewhat correct but lacks precision. It is not well-explained.
  • 6 marks: The answer is accurate, precise, and well-explained.
  • 8 marks: Besides being correct, precise, and well-explained, the answer offers a distinct insight or perspective.

Question 3

  • 0 marks: No answer was provided, or the response is irrelevant to the question.
  • 8 marks: The answer is somewhat correct but lacks precision. It is not well-explained.
  • 12 marks: The answer is accurate, precise, and well-explained.
  • 15 marks: Besides being correct, precise, and well-explained, the answer offers a distinct insight or perspective.

Question 4

  • 0 marks: No answer was provided, or the response is irrelevant to the question.
  • 15 marks: The answer is somewhat correct but lacks precision. It is not well-explained.
  • 22 marks: The answer is accurate, precise, and well-explained.
  • 30 marks: Besides being correct, precise, and well-explained, the answer offers a distinct insight or perspective.

Part 2: Text mining (40 marks)

Here is a rough rubric of how we will grade your answers:

  • 0 marks: No answer was provided, or the response is irrelevant to the question.
  • 10 marks: The answer is somewhat correct but lacks precision. It is not well-explained.
  • 20 marks: The answer is accurate, precise, and well-explained.
  • 30 marks: Besides being correct, precise, and well-explained, the answer offers a distinct insight or perspective.

How to get help and collaborate with others

πŸ™‹ Getting help

You can post general clarifying questions on Slack.

For example, you can ask:

  • β€œWhere do I find material that compares different clustering techniques?”
  • β€œI came across the term β€˜loadings’ when reading about PCA in the textbook, but I don’t fully understand it. Does anyone have a good alternative resource about it?”

You won’t be penalized for posting something on Slack that violates this principle without realizing it. Don’t worry; we will delete your message and let you know.

πŸ‘― Collaborating with others

You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you cannot share or copy code from others β€” pretty much the same rules as above.

πŸ€– Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see πŸ€– Our Generative AI policy.

References

Kim, Albert Y., and Adriana Escobedo-Land. 2015. β€œOkCupid Data for Introductory Statistics and Data Science Courses.” Journal of Statistics Education 23 (2): 5. https://doi.org/10.1080/10691898.2015.11889737.