🗓️ Week 10
Case study: Use of LLMs in sensitive contexts

DS101 – Fundamentals of Data Science

Dr. Ghita Berrada

LSE Data Science Institute

02 Dec 2024

Recap from last week: What is unstructured data

Unstructured data is any type of data that lacks pre-defined formats and organization.
Structured data is collected following a known method or instance with a known schema (i.e table format), and unstructured data is everything else.
Unstructured data includes:
- Audio
- Video
- Text content, often unformatted for structured data collection, including text messages, emails, social media, business documents and news
Unstructured data is reusable:
- one piece of unstructured data can be examined different ways
- the person collecting the data does not have to define ahead of time what information needs to be collected from it
Because unstructured data can be used to satisfy multiple intentions, this could raise potential privacy concerns for the user

⏭️ Recap from last week’s Project Gutenberg case study

Tokenization:

The process of tokenization consists in splitting text into smaller units called tokens (most often words or sentences). In our case study, we split our text into words.

Lemmatization:

The process of lemmatization groups together all inflections/variants of a word into a single form to more easily analyze each word. For instance, if you lemmatize “program”, “programming”, and “programmed” they would all be converted to “program”. This makes certain types of analysis much easier.

Stopwords:

Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. They tend to be the most common words in the language (e.g in English “a”, “the”, “is”, “are”, etc.) but can also be task-specific (e.g the word “immigration” in a corpus entirely about the topic of immigration).

Analysing the news: a case study

We use the newspaper3k package from Python to extract the most recent UK-related news articles from several newspapers:

The Guardian
BBC
The Sun
The Telegraph
The Independent
The Financial Times

Head to the course website to download the material needed to follow this second demo (news case study materials).

Introducing the concept of corpus

What is a corpus?

Simply put, a corpus is a collection of texts. It can be monolingual or multilingual.
The plural of corpus is corpora.
You can find some examples of publicly available (downloadable) corpora here.

The concept of Term frequency-inverse document frequency (TF-IDF)

Text vectorizer method that transforms the text into a usable vector i.e assigns a number to words in a text
Each number is a function of the word (term) frequency and inverse document frequency:
- term frequency (TF) is how often a word appears in a document, divided by how many words there are. Term frequency measures how common a word is.
- inverse document frequency (IDF) is how unique or rare a word is and is: \(IDF(t) = log_e(\textrm{Total number of documents} / \textrm{Number of documents with term t in it})\)

Example:

Consider a document containing 100 words where the word cooker appears 5 times. The term frequency (i.e., TF) for cooker is then (5 / 100) = 0.05.
Now, assume we have 10 million documents and the word cooker appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.
Therefore, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.

To learn more about TF-IDF, see this link or this one

Once vectorized, the text can be used as input to ML algorithms (supervised or unsupervised)!

Other Python libraries for NLP

This week’s case main case study: the use of LLMs (e.g ChatGPT) in sensitive contexts

How LLMs work

Before we start this case, let’s have a quick look at how LLMs work:

Today’s study material

Step 1 (~35-40 min)

Read through the following articles:

Luke Taylor (2023). “Colombian judge says he used ChatGPT in ruling”. The Guardian – (Taylor 2023)
Amy Guthrie (2023). “Colombian Judge Uses AI Tool ChatGPT in Court Ruling”. Law.com International – (Amy Guthrie 2023)
Benjamin Weiser (2023). “Here’s What Happens When Your Lawyer Uses ChatGPT”. The New York Times – (Weiser 2023) (You can also access this article through LSE Library Search)
Dan Milmo (2023). “Two US lawyers fined for submitting fake court citations from ChatGPT”. The Guardian – (Milmo 2023)
Simon Gibson (2023). “Cautioning the legal sector: AI is a tool, not a panacea”. Legal Futures – (Simon Gibson 2023)
Hibaq Farah (2023). “Court of appeal judge praises ‘jolly useful’ ChatGPT after asking it for legal summary”. The Guardian – (Hibaq Farah 2023)
Pranshu Verma and Will Oremus (2023).“How lawyers used ChatGPT and got in trouble”. The Washington Post – (Pranshu Verma and Will Oremus 2023)
Leyland Cecco (2024). “Canada lawyer under fire for submitting fake cases created by AI chatbot”. The Guardian – (Leyland Cecco 2024)
Josh Taylor (2023) . “AMA calls for stronger AI regulations after doctors use ChatGPT to write medical notes”. The Guardian – (Josh Taylor 2023)
Josh Taylor (2024) . “AI ban ordered after child protection worker used ChatGPT in Victorian court case”. The Guardian – (Josh Taylor 2024)

Within your respective tables, discuss your first impressions of the articles.

Guiding questions for today’s discussion

What are the articles about?
What do the cases have in common? What are the differences?
What do these cases tell you about the reliability of ChatGPT?
What risks are associated with its use?
Who is accountable/responsible for decisions made using ChatGPT?
How do we mitigate the risks posed by ChatGPT use in sensitive contexts?

References

Amy Guthrie. 2023. “Colombian Judge Uses AI Tool ChatGPT in Court Ruling.” Law.com International. https://www.law.com/international-edition/2023/02/08/colombian-judge-uses-ai-tool-chatgpt-in-court-ruling/?slreturn=20241202-41222.

Hibaq Farah. 2023. “Court of Appeal Judge Praises ‘Jolly Useful’ ChatGPT After Asking It for Legal Summary.” The Guardian. https://www.theguardian.com/technology/2023/sep/15/court-of-appeal-judge-praises-jolly-useful-chatgpt-after-asking-it-for-legal-summary.

Josh Taylor. 2023. “AMA Calls for Stronger AI Regulations After Doctors Use ChatGPT to Write Medical Notes.” The Guardian. https://www.theguardian.com/technology/2023/jul/27/chatgpt-health-industry-hospitals-ai-regulations-ama.

———. 2024. “AI Ban Ordered After Child Protection Worker Used ChatGPT in Victorian Court Case.” The Guardian. https://www.theguardian.com/australia-news/2024/sep/26/victoria-child-protection-chat-gpt-ban-ovic-report-ntwnfb.

Jurafsky, D., and J. H. Martin. 2023. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition draft. Prentice Hall Series in Artificial Intelligence. Pearson Prentice Hall. https://web.stanford.edu/~jurafsky/slp3/.

Leyland Cecco. 2024. “Canada Lawyer Under Fire for Submitting Fake Cases Created by AI Chatbot.” The Guardian. https://www.theguardian.com/world/2024/feb/29/canada-lawyer-chatgpt-fake-cases-ai.

Milmo, Dan. 2023. “Two US Lawyers Fined for Submitting Fake Court Citations from ChatGPT.” The Guardian (London). https://www.theguardian.com/technology/2023/jun/23/two-us-lawyers-fined-submitting-fake-court-citations-chatgpt.

Pranshu Verma, and Will Oremus. 2023. “How Lawyers Used ChatGPT and Got in Trouble.” The Washington Post. https://www.washingtonpost.com/technology/2023/11/16/chatgpt-lawyer-fired-ai/.

Simon Gibson. 2023. “Cautioning the Legal Sector: AI Is a Tool, Not a Panacea.” Legal Futures. https://www.legalfutures.co.uk/blog/cautioning-the-legal-sector-ai-is-a-tool-not-a-panacea.

Taylor, Luke. 2023. “Colombian Judge Says He Used ChatGPT in Ruling.” The Guardian 2. https://www.theguardian.com/technology/2023/feb/03/colombia-judge-chatgpt-ruling.

Weiser, Benjamin. 2023. “Here’s What Happens When Your Lawyer Uses ChatGPT.” New York Times 27. https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html.

🗓️ Week 10 Case study: Use of LLMs in sensitive contexts