πŸ“ IRDAP Exam

2023/24 Summer Term

Author

Dr. Ghita Berrada

Published

19 Aug 2024

⏲️ Due Date: Tuesday, August 20th 2024 at 11:59:59 am (midday London time)

If you update your files on GitHub after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to πŸ“§

βš–οΈ Assignment Weight:

This assignment is worth 40% of your final grade in this course.

40%

Do you know your CANDIDATE NUMBER? You will need it.

β€œYour candidate number is a unique five digit number that ensures that your work is marked anonymously. It is different to your student number and will change every year. Candidate numbers can be accessed using LSE for You.”

Source: LSE

πŸ“ Instructions

πŸ‘‰ Read it carefully, as some details might change from one assignment to another.

  1. Go to our (i.e the DS202W’s) Slack workspace’s #irdap channel to find a GitHub Classroom link entitled πŸ“ IRDAP Exam. Do not share this link with anyone outside this course!

  2. Click on the link, sign in to GitHub, and then click on the green Accept this assignment button.

  3. You will be redirected to a new private repository created just for you. The repository will be named ds202a_w-2024-irdap-exam-yourusername, where yourusername is your GitHub username. The repository will be private and will contain a README.md file with a copy of these instructions.

  4. Recall what is your LSE CANDIDATE NUMBER. You will need it in the next step.

  5. Create a <CANDIDATE_NUMBER>.qmd file with your answers, replacing the text <CANDIDATE_NUMBER> with your actual LSE number.

    For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  6. Then, replace whatever is between the --- lines at the top of your newly created .qmd file with the following:

    ---
    title: "DS202A/W - IRDAP Exam"
    author: <CANDIDATE_NUMBER>
    output: html
    self-contained: true
    ---

    Once again, replace the text <CANDIDATE_NUMBER> with your actual LSE CANDIDATE NUMBER. For example, if your candidate number is 12345, then your .qmd file should start with:

    ---
    title: "DS202A/W - IRDAP Exam"
    author: 12345
    output: html
    self-contained: true
    ---
  7. Fill out the .qmd file with your answers. Make sure you provide a nicely formatted notebook.

    • Use headers and code chunks to keep your work organised. This will make it easier for us to grade your work. Learn more about the basics of markdown formatting here.
  8. Once done, click on the Render button at the top of the .qmd file. This will create an .html file with the same name as your .qmd file. For example, if your .qmd file is named 12345.qmd, then the .html file will be named 12345.html.

    • If you added any code, ensure your .qmd code is reproducible. If we were to restart R and RStudio and run your notebook, it should run without errors, and we should get the same results as you did.

    • If you choose to add code, please ensure your code confirms, reinforces, or complements your answers. Adding code just for the sake of it will not help you get a higher grade.

  9. Push both files to your GitHub repository. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

  10. Read the section How to get help and collaborate with others at the end of this document.

β€œWhat do I submit?”

You will submit two files:

  • A Quarto markdown file with the following naming convention: <CANDIDATE_NUMBER>.qmd, where <CANDIDATE_NUMBER> is your candidate number. For example, if your candidate number is 12345, then your file should be named 12345.qmd.

  • An HTML file render of the Quarto markdown file.

You don’t need to click to submit anything. Your assignment will be automatically submitted when you commit AND push your changes to GitHub. You can push your changes as many times as you want before the deadline. We will only grade the last version of your assignment. Not sure how to use Git on your computer? You can always add the files via the GitHub web interface.

πŸ—„οΈ Get the data

What data will you be using?

You will be using two distinct datasets for this exam.

Part 1

Your dataset comes from the UCI Machine Learning Repository where it was made public. This dataset was created for the purpose of better understanding absenteeism at work. It consists of records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil and contains data related to the employees of the company (e.g demographic features or reasons or length of absence). To learn more about the dataset and the context of the absenteeism prediction problem, you could take a look at (Ferreira et al. 2018) (accessible here)

Preparation

  1. Download the data by clicking on the button below.

You’ll need to unzip the archive to access the dataset (and its documentation).

Part 2

In this part, you’ll be using the LIAR dataset made publicly available here. It’s a dataset made for fake news detection that includes 12.8 thousand short phrases labeled by hand for honesty, topic, context/place, speaker, status, party, and past date and that contains short, decade-old statements in various contexts from politifact.com. Statements with the dataset are labeled as true, mostly-true, half-true, barely-true , false or pants-fire (depending on their degree of veracity). The dataset comes in the form of a zip archive containing 4 files:

  • a README file that constitutes the dataset documentation
  • a file called train.tsv which is the training set
  • a file called test.tsv which is the test set
  • and a file called valid.tsv which is the validation set

Preparation

Click on the button below to download the dataset:

πŸ“‹ Your Tasks

What do we actually want from you?

Context

While we provide data, we will not specify the insights we seek in some questions. Instead, we will task you with proposing your approach to the data/question. This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.

πŸ’‘ Remember: if you decide to write R code, please ensure your code confirms, reinforces, or complements your answers and that it aligns with the style of code we practiced throughout the course. Adding code just for the sake of it will not help you get a higher grade.

Always focus on the quality of your explanations/justification of your modeling choices over the quantity of code: code by itself is never enough!

Part 1: Gaining insights about absenteeism at work… (55 marks) πŸ€”

All the questions in this part relate to the Absenteeism at Work Dataset

Question 1 (15 marks)

Your task is to predict absenteeism at work, given the Absenteeism at Work dataset given to you. The dataset does not have a target variable for a classification model.

Q1 : How would you construct a target variable for a classification model? Justify your choices. Do your choices have any downsides?

Question 2 (15 marks)

Q2: Can you build a baseline model to predict absenteeism at work? How would you evaluate your model? Explain each of your choices.

Question 3 (10 marks)

Q3: How would you improve your model from question 2? Explain your reasoning.

Question 4 (15 marks)

Consider the original dataset again (i.e prior to the creation of the target variable). Can we gain other insights on absenteeism at work from the dataset as is?

Part 2: Fake news… (45 marks) ⭐

Your dataset, in this part, is the LIAR dataset

Question 5 (45 marks):

Q5: Given the dataset, the main question you aim to answer here is: β€œWhat differentiates fake statements from true ones?”. Propose at least a plausible approach that tackles this question that makes use of the dataset.


Finally:

Well done on getting this far!πŸŽ‰

Q: How do you plan on rewarding yourself (as you should!) after completing this exam?

How to get help and collaborate with others

πŸ™‹ What if I am confused?

  • This is a test. Certain questions are intentionally open-ended and somewhat vague. Part of the assignment involves deciphering what we want from you.

  • We will assess your ability to identify which of the concepts you learned in class are relevant to a given problem at hand and how to apply them to solve it. Strive to achieve that neat balance of conciseness and completeness.

  • However, if you feel that a question is too ambiguous, please send an e-mail or a private Slack message to . If we deem your question valid, we will post a clarification on the public Slack channel. If you don’t get a response, assume that the question is not ambiguous and you should proceed with your best judgement.

πŸ‘― Can I get help from others?

⚠️ For this assignment, we expect you to work independently and refrain from discussing it with your peers.

⚠️ This task presents a significant challenge, and your grade will greatly benefit from your personal creativity and originality. It is crucial that your responses are distinct and unique compared to your classmates’ work.

  • Nonetheless, you are allowed to use the internet, refer to course materials, and even Generative AI tools when working on your answers. Again, try to aim for originality, do not let these resources dictate your answers.

  • Consult the πŸ€– Our Generative AI policy to see examples of how to report the use of Generative AI tools in your work.

πŸ€– Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your notebook to explain how much you used it.

Note that while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to create formulaic and repetitive responses, thus limiting your chances of getting a high mark. When it comes to coding, these tools tend to generate code that is not very efficient or old and does not follow the principles we teach in this course.

To see examples of how to report the use of AI tools, see πŸ€– Our Generative AI policy.

References

Ferreira, Ricardo Pinto, AndrΓ©a Martiniano, Domingos Napolitano, Edquel Bueno Prado Farias, and Renato JosΓ© Sassi. 2018. β€œArtificial Neural Network and Their Application in the Prediction of Absenteeism at Work.” International Journal of Recent Scientific Research 9 (1): 2332–34. http://dx.doi.org/10.24327/ijrsr.2018.0901.1447.