πŸ“ Spring Term Summative

2023/24 Winter Term

Author

Dr. Jon Cardoso-Silva

πŸ’‘ NOTE: This time, you are not asked to write code as part of the assignment. If you choose to do so, please ensure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.

⏲️ Due Date: Thursday, April 18th 25th, 5pm

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to πŸ“§

βš–οΈ Assignment Weight:

This assignment is worth 30% of your final grade in this course.

30%

Do you know your CANDIDATE NUMBER? You will need it.

β€œYour candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

πŸ“ Instructions

πŸ‘‰ Read it carefully, as some details might change from one assignment to another.

  1. Go to our Slack workspace’s #announcements channel to find a GitHub Classroom link entitled πŸ“ W11 Summative. Do not share this link with anyone outside this course!

  2. Click on the link, sign in to GitHub, and then click on the green Accept this assignment button.

  3. You will be redirected to a new private repository created just for you. The repository will be named ds202w-2024-w11-summative-yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.

  4. Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.

  5. Create a <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

    For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  6. Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:

    ---
    title: "DS202W - W11 Summative"
    author: <CANDIDATE_NUMBER>
    output: html
    self-contained: true
    ---

    Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:

    ---
    title: "DS202W - W11+1 Summative"
    author: 12345
    output: html
    self-contained: true
    ---
  7. Fill out the .qmd file with your answers. This time, you are not required to write code. Still, you should provide a nicely formatted notebook.

    • Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
  8. Once done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.

    • If you added any code, ensure your .qmd code is reproducible. If we were to restart R and RStudio and run your notebook, it should run without errors, and we should get the same results as you did.

    • If you choose to add code, please ensure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.

  9. Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

  10. Read the section How to get help and collaborate with others at the end of this document.

β€œWhat do I submit?”

You will submit two files:

  • A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  • An HTML file render of the Quarto markdown file.

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

πŸ—„οΈ The Data

For this assignment, we will use data from Reddit, a social media platform that resembles a vast forum. Reddit contains various communities, called subreddits, where users engage in discussions and share content, often anonymously 1. Reddit uses upvotes and downvotes to rank content. This mechanism shapes the visibility of posts and comments, making it a crucial part of the platform’s culture.

Selected Rankings

In this task, we will focus on data from two Reddit rankings:

  • The top ranking features the most upvoted posts.
  • The controversial ranking highlights posts with a combination of upvotes and downvotes.

The data provided contains the top 1000 posts from the top ranking and the top 1000 from the controversial ranking, both comprising the period over the past year (mid-March 2023 to mid-March 2024).

Data Overview

Below is a brief description of the data you will be working with. The data is separated into two files: one containing the posts and another containing the comments.

Note: We do not recommend storing the data in your GitHub repository, as it may be too large for version control. Consider using a .gitignore file to exclude the data from your repository.

Reddit Posts

A CSV file includes the top 1000 posts from the top ranking and the top 1000 from the controversial ranking over the past year (mid-March 2023 to mid-March 2024). The file contains columns such as:

  • ranking_type: The ranking the post comes from (top or controversial).
  • post_id: The unique identifier of the post.
  • title: The title of the post.
  • permalink: The URL of the post.
  • post_hint: The type of content the post contains (e.g., image, link, self).
  • url: The URL of the content the post contains.
  • created_utc: The post’s creation time (in Unix time).
  • selftext: The text of the post (if any)
  • ups: The number of upvotes the post received.
  • upvote_ratio: The ratio of upvotes to downvotes the post received.
  • score: The post’s score (upvotes minus downvotes).
  • subreddit: The subreddit from which the post comes.
  • subreddit_subscribers: The number of subscribers to the subreddit from which the post comes.
  • over_18: Whether the post is marked as NSFW (Not Safe For Work).
  • num_comments: The number of comments the post received.
  • is_original_content: Whether the post is original content.
  • author: The username of the author of the post.
  • edited: Whether the post was edited after being created.

Reddit Comments

A CSV file containing the top-level comments (if any) found on the posts in the file above. The columns in this file are:

  • post_id: The unique identifier of the post.
  • id: The unique identifier of the comment.
  • permalink: The URL of the comment.
  • author: The username of the author of the comment.
  • created_utc: When the comment was created (in Unix time).
  • body: The text of the comment.
  • edited: Whether the comment was edited after being created.
  • gilded: Whether the comment was gilded (i.e., received a reward from another user).
  • ups: The number of upvotes the comment received.
  • num_reports: The number of times other users reported the comment to the moderators.

πŸ“‹ Your Tasks

What do we need from you?

Context

While we provide data, we will not specify the insights we seek in some questions. Instead, we will task you with proposing your approach to the data. This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.

πŸ’‘ Remember: if you decide to write R code, please ensure your code confirms, reinforces, or complements your answers and that it aligns with the style of code we practiced throughout the course. Adding code just for the sake of it will not help you get a higher grade.

Part 1: Supervised Learning (30 marks)

Suppose we want to create a model that, given a post, can predict whether it belongs to the top or controversial ranking based on its content and the comments it received, irrespective of when it was posted.

  • How would you create the dataset for this task?
  • Which technique(s) from the course would you use to address this research question?
  • And how would you interpret the results?

Part 2: Similarity (30 marks)

Suppose we want to calculate the similarity between the posts we have in our dataset based on the combination of the content of the posts and the comments they received.

How would you approach this task?

Part 3: Unsupervised Learning (40 marks)

Now, propose one compelling research question that a social scientist could investigate with this dataset using the unsupervised learning methods covered in this course and how you would answer them.

βœ”οΈ How we will grade your work

A 70% or above score is considered a distinction in the typical LSE expectation. This means that you can expect to score around 70% if you provide adequate answers to all questions, in line with the learning outcomes of the course and the instructions provided.

Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score. You need to add unique insights or analyses to get a distinction - we cannot tell you what these are, but these should be things that make us go, β€œwow, that’s a great idea! I hadn’t thought of that”.

Here is a rough rubric of how we will grade your answers. Note that the rigor of our marking varies with the expected difficulty of the question – this is reflected on the marking rubric.

Part 1: Supervised Learning (30 marks)

  • >22 marks: Besides being correct, precise, greatly formatted, and well-explained, you provided code that confirms, reinforces, or complements your answers. The code aligns with the style of code we practiced throughout the course; your plots are beyond the basic ones, and you provided a compelling interpretation of the results.

  • 22 marks: Your response is well-structured and covers all the questions concisely. We see a well-justified process of transforming the original data into a suitable dataset for supervised learning (or maybe you demonstrated it directly with well-documented code). The choice of algorithm is logical, focusing on its application to the dataset rather than its technical details (we already know how the algorithms work). Your approach to interpreting the results directly addresses the research question, avoiding generic explanations. The markdown formatting is also excellent.

  • 17-21: The response is accurate and relevant. Clearly, you have thought of how the techniques apply to this dataset. However, there are minor slips in some areas. For instance, you might have missed a minor step in the process, some justifications are vague or unclear, or the formatting could be improved.

  • 13-16: Your response missed some crucial details from the instructions. The answer is somewhat correct but lacks precision. It is not well-explained. The response is somewhat generic and could apply to any dataset. The markdown formatting is somewhat poor.

  • 12 marks: A pass. We recognise a few keywords, and although the response is somewhat correct, it lacks precision. The answer is so generic that it could apply to any dataset. Your response resembles language copy-pasted from the web or an AI chatbot.

  • <12 marks: No answer was provided, or the response is very inadequate.

Part 2: Similarity (30 marks)

  • >22 marks: Your response, besides being accurate, well-explained, effectively formatted, and surprisingly concise, surprised us positively. That is because you might have devised a custom similarity measure logically justified for this dataset or because you actually coded it and provided excellent plots and a compelling interpretation of the results. Either the mathematical formulation you provided or the plots you created provided deep insights into the dynamics of the dataset.

  • 22 marks: Your response is well-structured and covers all the questions concisely. We see a well-justified process of calculating the similarity between the posts, or maybe you demonstrated it directly with well-documented code. You chose a similarity measure that is logical, focusing on its application to the dataset rather than its technical details (we already know how the distances work). Your approach to interpreting the results directly addresses the research question, avoiding generic explanations. The markdown formatting is also excellent.

  • 17-21: The response is accurate and relevant. Clearly, you have thought of how the similarity metrics apply to this particular dataset. However, there are minor slips in some areas. For instance, some justifications are vague or unclear, or the formatting could be improved.

  • 13-16: Your response missed some crucial details from the instructions. The answer is somewhat correct but lacks precision. It is not well-explained. The response is somewhat generic and could apply to any dataset. The markdown formatting is somewhat poor.

  • 12 marks: A pass. We recognise a few keywords, and although the response is somewhat correct, it lacks precision. The answer is so generic that it could apply to any dataset. Your response resembles language copy-pasted from the web or an AI chatbot.

  • <12 marks: No answer was provided, or the response is very inadequate.

Part 1: Unsupervised Learning (40 marks)

  • >29 marks: Besides being correct, precise, greatly formatted, and well-explained, you provided code that confirms, reinforces, or complements your answers. The code aligns with the style of code we practiced throughout the course; your plots are beyond the basic ones, and you provided a compelling interpretation of the results.

  • 29 marks: Your response is well-structured and covers all the questions concisely. We see a clear explanation of how the question you’re asking fits an unsupervised learning model rather than a supervised learning one and what you would do differently process-wise from what you did in Part I (Supervised learning) (maybe you even included snippets of code to showcase the difference). The choice of algorithm is logical, focusing on its application to the dataset rather than its technical details (we already know how the algorithms work). Your approach to interpreting the results directly addresses the research question, avoiding generic explanations. The markdown formatting is also excellent.

  • 23-28: The response is accurate and relevant. Clearly, you know in which cases you should apply unsupervised learning techniques to this dataset and have come up with interesting questions to match. However, there are minor slips in some areas. For instance, you might have missed a minor step your explanations of the process, some justifications are vague or unclear, or the formatting could be improved.

  • 16-22: Your response missed some crucial details from the instructions. The answer is somewhat correct but lacks precision. It is not well-explained. The response is somewhat generic and could apply to any dataset. The markdown formatting is somewhat poor.

  • 16 marks: A pass. The response is somewhat correct but it lacks precision. The answer is so generic that it could apply to any dataset and doesn’t offer any real insights. Your response resembles language copy-pasted from the web or an AI chatbot.

  • <16 marks: No answer was provided, or the response is very inadequate or irrelevant to the question.

How to get help and collaborate with others

πŸ™‹ Getting help

You can post general clarifying questions on Slack.

For example, you can ask:

  • β€œWhere do I find material that compares different clustering techniques?”
  • β€œI came across the term β€˜loadings’ when reading about PCA in the textbook, but I don’t fully understand it. Does anyone have a good alternative resource about it?”

You won’t be penalized for posting something on Slack that violates this principle without realizing it. Don’t worry; we will delete your message and let you know.

πŸ‘― Collaborating with others

You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you cannot share or copy code from others β€” pretty much the same rules as above.

πŸ€– Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see πŸ€– Our Generative AI policy.