✍️ Coursework (Formative)

Author
Published

06 March 2023

🎯 OBJECTIVE: Write a 1,500-word essay using Quarto Markdown and Zotero. (See 💻 Week 08 lab)

DUE DATE: 16 March 2023.

🗺️ Context

Data Science is a mix of several academic disciplines and industry practices. It involves a mix of mathematics, statistics, computer science and, importantly, some domain knowledge. Data scientists often have to consult scholarly papers related to the datasets at hand to help inform their analysis.

Scenario

Consider the following scenario: You are a data scientist working at a think tank. You are working as part of a team on a project evaluating the uses of personal data in different contexts.

You have been tasked with providing the team with insights about three aspects:

  1. the stated use of personal data in a specific context;

  2. the range of techniques that would be applied to the personal data in this specific context and with what goals in mind;

  3. any potential issues you can identify in the use of personal data in this way.

Different members of your team have been assigned different cases where personal data is being collected and analysed to generate insights.

Your task

The case you have been assigned to is “The use of supervised and unsupervised learning techniques in the detection of credit card fraud.”

Specifically, you have been asked to consider the three elements above about some of the techniques put forward in this paper:

Carcillo, F., Le Borgne, Y.-A., Caelen, O., Kessaci, Y., Oblé, F., & Bontempi, G. (2019). Combining unsupervised and supervised learning in credit card fraud detection | Elsevier Enhanced Reader. Information Sciences, 557, 317–331. – (Carcillo et al. 2019)

In addition to this key resource you have also dug out an old report from the World Economic Forum which discussed ‘personal data’ as a new ‘asset class’, which includes information related to different types of personal data, classification and valuation processes, and broader issues of potential concern with the use of personal data:

World Economic Forum. (2011). Personal Data: The Emergence of a New Asset Class. – (Forum 2011)

📝 Instructions

What we are looking for

Now here is what you need to do:

Read Carcillo et al’s (2019) article about supervised and unsupervised learning techniques. Use the WEF report about personal data and the concepts and techniques you’ve learned from DS101L to consider the following questions about the article:

Data collection:

  • Note some of the different types of personal data in terms of whether it was collected with or without the individuals’ knowledge of the collection as identified in the WEF report. Consider where the data that is the subject of your case falls, based on evidence from the journal article.
  • What are some of the issues that might arise from data collected in this way?
  • Are there any other considerations that might override these concerns?

Data classification, data types and data cleaning:

  • Note the different types of personal data identified in the WEF report in terms of the actual information collected. Compare these with the types of data identified in the article that discusses your case. Consider questions of data type, structured and unstructured data.
  • You might want to try and identify the different labels and variables that have been assigned to these types of data in the article. E.g. Is all the data numeric/quantitative?
  • What ‘personal data’ is collected about the cardholders?
  • What are ‘global’ and ‘local’ approaches in terms of ‘granularity’
  • What are the pros and cons associated with each?

Data analysis:

This is the final part of your report back to the wider team.

  • You are to provide an outline of the two techniques used in the article: supervised and unsupervised learning.
  • You don’t need to describe all the computations and equations in detail but the team is looking to understand a bit about the ways that each approach treats the data and with what outcomes. E.g. How is suspicious activity identified from the data in each approach? What different variables have been identified for observation?

Structure of the Essay

  1. Your essay must be written in Quarto Markdown. You can use the template provided in the lab. Check, also, the readings listed on Weeks 7 & 8 of the 📔 Syllabus.

  2. Feel free to modify the layout and aesthetics of the template. You can also add images, tables, bullet points, etc. to your essay.

  3. On top of the two main references, you must also include at least 5 other references. You must cite these references in your markdown using Zotero (revisit 💻 Week 08 lab).

    • Any ideas, arguments or results that were not produced by your mind must be cited in the references.
  4. Do not exceed 1500 words (bibliographical references and the (optional) generative AI section do not count).

  5. Make it clear, do not hide your thoughts behind jargon. You are not writing an academic article. Your essay is emulating a communication you would send to work colleagues who have very different educational backgrounds.

  6. Do not plagiarise. It is not that difficult to spot that someone copied content from other sources and, frankly, it is very embarrassing if you get caught. Here is the link to the LSE regulation on plagiarism.

    • You are allowed to use Generative AI to help you write your essay. But you are asked to report the AI tool you used and the extent to which you used it. Read more about Generative AI in the section below.
  7. Make sure you address all the questions.

🤖 Using AI help?

You are allowed to use Generative AI tools such as ChatGPT to help you write your essay. If you do use it, however minimal use you made, you are asked to report the AI tool you used and add an extra section to your essay to explain the extent to which you used it (this won’t count towards the word limit).

Note that, while these tools can be helpful, they tend to generate responses that sound convincing but are not necessarily correct. Another problem is that they tend to generate responses that are formulaic and repetitive; thus, limiting your chances of getting a high mark.

In effect, you are asked to explain the following:

  • What AI tool did you use?
  • How did you use it? For example, did you use it to generate ideas, write a draft, proofread your essay, etc.?
  • How much of your essay was written by the AI tool? For example, did you feed it the entire prompt and it wrote the entire essay? Or did you feed it guided questions?
  • If you didn’t edit the AI tool’s output, what was the output like? For example, did it produce a coherent essay?
  • What did you do to make sure that the AI tool did not produce gibberish? and that the essay was not formulaic.
  • Importantly, how did you ensure that the essay did not contain any plagiarism?

✅ Submission

  • Render your Quarto Markdown file to HTML
  • ⚠️ IMPORTANT ⚠️: Rename your HTML to DS101L-2023-formative-essay-<CANDIDATE_NUMBER>.html, replacing <CANDIDATE_NUMBER> with your candidate number. For example: DS101L-2023-formative-essay-123456.html
  • Upload this file to Moodle under the appropriate assignment.

✋ Getting Help

  • If you have any questions about the assignment, please post them on #assignments channel on Slack.
  • Book office hours.
  • Organise a study group with your classmates.

📑 Marking Scheme

(You will be graded as if this was a summative assessment.)

80% or above

  • Shows a strong element of independent critical analysis of social or controversial issues
  • Not formulaic
  • Good structure to essay and interesting to read
  • Sophisticated arguments demonstrating critical engagement with the subject
  • Demonstrates very good knowledge and familiarity with the data science concepts related to the essay prompt
  • Good use of markdown and formatting
  • Good use of Zotero references

60% to 79%

  • A structured essay that is well-presented and concluded
  • Logical paragraphing
  • Ability to present arguments critically and assess alternate views
  • Substantive engagement with issues or problems raised
  • Some original elements
  • Good use of Zotero references
  • Good use of markdown and formatting

40% to 59%

  • Some understanding of the subject matter but poor presentation
  • Simplistic arguments
  • Some inaccuracy or irrelevance
  • Inadequate use of Zotero references
  • Inadequate use of markdown and formatting

20% to 39%

  • Very weak answers
  • Poorly written and lacking relevance, accuracy or substance
  • Barely attempted question

References

Carcillo, Fabrizio, Yann-Aël Le Borgne, Olivier Caelen, Yacine Kessaci, Frédéric Oblé, and Gianluca Bontempi. 2019. “Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection.” Information Sciences 557: 317–31. https://doi.org/10.1016/j.ins.2019.05.042.
Forum, World Economic. 2011. “Personal Data: The Emergence of a New Asset Class.” World Economic Forum. https://www3.weforum.org/docs/WEF_ITTC_PersonalDataNewAsset_Report_2011.pdf.