DS101 – Fundamentals of Data Science
06 Oct 2025

Sign in here
Important
🥐 Important releases this week
Your first summative is now available!
Deadline: Week 04 class (October 21st)
📒 A dataset is a collection of data
But not just any collection…
Think of a library without organization:
Now imagine it organized:
A dataset is the same: Raw data becomes useful when properly structured
The reality check: ~80% of your time will be in data preparation!

UK MP Donations Database - A real-world example we’ll explore today
Single files or file collections
Think: spreadsheets, CSVs, JSON files, or a folder of logs.
Security: Depends on the device and file permissions; no automatic encryption.
Best for: small datasets, prototypes, one-off analyses.
Local databases
Think: SQLite, MySQL, PostgreSQL.
Security: Fully under your control — you manage access, backups, and encryption. Can be very secure if properly maintained.
Best for: structured, relational data, analytical workflows, multi-user local projects.
Cloud storage
Think: AWS S3, Google Cloud Storage, Azure Blob, BigQuery.
Security: Providers handle encryption, redundancy, and network security, but you rely on their policies and proper configuration (permissions, keys, access rules). Shared responsibility model applies.
Best for: massive datasets, team collaborations, projects needing scalable compute.
📒 Structured data fits neatly into tables with rows and columns
Key characteristics:
Note
Coming later: We’ll cover unstructured data (images, text, audio) in Week 09
Let’s examine the MP donations dataset:
| Date | Member | Entity | Entity Category | Value (in £) | Nature |
|---|---|---|---|---|---|
| 09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
| 09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
| 09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
| 04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
| 09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
Question: What data types do you see here?
| Date | Member | Entity | Entity Category | Value (in £) | Nature |
|---|---|---|---|---|---|
| 09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
| 09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
| 09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
| 04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
| 09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
Numeric data
| Date | Member | Entity | Entity Category | Value (in £) | Nature |
|---|---|---|---|---|---|
| 09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
| 09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
| 09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
| 04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
| 09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
String/Categorical data
| Date | Member | Entity | Entity Category | Value (in £) | Nature |
|---|---|---|---|---|---|
| 09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
| 09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
| 09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
| 04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
| 09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
Date/Time data
Different data types unlock different operations:
Numbers → compute averages, sums, ranges (e.g., sales totals)
Dates → sort chronologically, calculate durations (e.g., time between events)
Categories → represent a fixed set of labels (e.g., country, gender, product type)
Text → search, extract, or match patterns (e.g., find all “error” messages)
Get the type wrong — and the logic breaks:
"2,500" stored as text → can’t calculate totals"2023-01-15" as text → won’t sort chronologically"France" stored as text instead of category → wastes memory, harder to group"London" as a number → average of cities? nonsense!Efficiency matters too:
City (string)
| City |
|---|
| “London” |
| “Paris” |
| “London” |
As plain text:
City (categorical)
| City |
|---|
| London 🏙️ |
| Paris 🗼 |
| Berlin 🐻 |
As categorical:
0, 1, 0, 2{0: "London", 1: "Paris", 2: "Berlin"}The building blocks:
| Bits | Value |
|---|---|
| 0000 | 0 |
| 0001 | 1 |
| 0010 | 2 |
| 0011 | 3 |
| 0100 | 4 |
| 0101 | 5 |
| 0110 | 6 |
| 0111 | 7 |
| Bits | Value |
|---|---|
| 1000 | 8 |
| 1001 | 9 |
| 1010 | 10 |
| 1011 | 11 |
| 1100 | 12 |
| 1101 | 13 |
| 1110 | 14 |
| 1111 | 15 |
| Unit | Abbr. | Size | Real-world example |
|---|---|---|---|
| bit | b | 1 bit | A single light switch |
| Byte | B | 8 bits | One character (‘A’) |
| Kilobyte | KB | 1,024 bytes | A paragraph of text |
| Megabyte | MB | 1,024 KB | A high-res photo |
| Gigabyte | GB | 1,024 MB | An HD movie |
| Terabyte | TB | 1,024 GB | 500 hours of HD video |
| Petabyte | PB | 1,024 TB | All of Netflix’s content |
Whole numbers (ℤ): no decimal points

Source: (Baraniuk 2015)
Ariane 5 Rocket (1996):

Source: (Baraniuk 2015)
Boeing 787 Dreamliner (2015):

Source: (Baraniuk 2015)
Patriot Missile System (1991):

Source: (Baraniuk 2015)
Key lessons:

Source: (BBC News 2014)
What happened (2014):
YouTube’s fix:
Key lessons:
Real numbers (ℝ): includes decimals
float (32-bit)double (64-bit)Example of rounding errors:
Why? Binary can’t represent 0.1 exactly (like 1/3 in decimal = 0.333…)
Practical impact: Never compare floats with ==. Instead use:
Character (char)
'B', '!', '🎉'String
"Data Science is fascinating"Quote Flexibility in Programming
Quote Flexibility in Programming
Why triple quotes are your friend:
\ns)The ASCII table encodes 128 characters using 7 bits
| Dec | Char | Description |
|---|---|---|
| 32 | space | Space |
| 33 | ! | exclamation mark |
| 48 | 0 | zero |
| 49 | 1 | one |
| 65 | A | Uppercase A |
| 66 | B | Uppercase B |
| 97 | a | Lowercase a |
| 98 | b | Lowercase b |
Key insight:
Every character you type is converted to numbers!
Example: “Hi” = 72 + 105 in ASCII
| Dec | Binary | Char | Description |
|---|---|---|---|
| 0 | 0000000 | NUL | Null |
| 1 | 0000001 | SOH | Start of Header |
| 2 | 0000010 | STX | Start of Text |
| 3 | 0000011 | ETX | End of Text |
| 4 | 0000100 | EOT | End of Transmission |
| 5 | 0000101 | ENQ | Enquiry |
| 6 | 0000110 | ACK | Acknowledge |
| 7 | 0000111 | BEL | Bell |
| 8 | 0001000 | BS | Backspace |
| 9 | 0001001 | HT | Horizontal Tab |
| 10 | 0001010 | LF | Line Feed |
| 11 | 0001011 | VT | Vertical Tab |
| 12 | 0001100 | FF | Form Feed |
| 13 | 0001101 | CR | Carriage Return |
| 14 | 0001110 | SO | Shift Out |
| 15 | 0001111 | SI | Shift In |
| 16 | 0010000 | DLE | Data Link Escape |
| 17 | 0010001 | DC1 | Device Control 1 |
| 18 | 0010010 | DC2 | Device Control 2 |
| 19 | 0010011 | DC3 | Device Control 3 |
| 20 | 0010100 | DC4 | Device Control 4 |
| 21 | 0010101 | NAK | Negative Acknowledge |
| 22 | 0010110 | SYN | Synchronize |
| 23 | 0010111 | ETB | End of Transmission Block |
| 24 | 0011000 | CAN | Cancel |
| 25 | 0011001 | EM | End of Medium |
| 26 | 0011010 | SUB | Substitute |
| 27 | 0011011 | ESC | Escape |
| 28 | 0011100 | FS | File Separator |
| 29 | 0011101 | GS | Group Separator |
| 30 | 0011110 | RS | Record Separator |
| 31 | 0011111 | US | Unit Separator |
| Dec | Binary | Char | Description |
|---|---|---|---|
| 32 | 0100000 | space | Space |
| 33 | 0100001 | ! | exclamation mark |
| 34 | 0100010 | ” | double quote |
| 35 | 0100011 | # | number |
| 36 | 0100100 | $ | dollar |
| 37 | 0100101 | % | percent |
| 38 | 0100110 | & | ampersand |
| 39 | 0100111 | ’ | single quote |
| 40 | 0101000 | ( | left parenthesis |
| 41 | 0101001 | ) | right parenthesis |
| 42 | 0101010 | * | asterisk |
| 43 | 0101011 | + | plus |
| 44 | 0101100 | , | comma |
| 45 | 0101101 | - | minus |
| 46 | 0101110 | . | period |
| 47 | 0101111 | / | slash |
| 48 | 0110000 | 0 | zero |
| 49 | 0110001 | 1 | one |
| 50 | 0110010 | 2 | two |
| 51 | 0110011 | 3 | three |
| 52 | 0110100 | 4 | four |
| 53 | 0110101 | 5 | five |
| 54 | 0110110 | 6 | six |
| 55 | 0110111 | 7 | seven |
| 56 | 0111000 | 8 | eight |
| 57 | 0111001 | 9 | nine |
| 58 | 0111010 | : | colon |
| 59 | 0111011 | ; | semicolon |
| 60 | 0111100 | < | less than |
| 61 | 0111101 | = | equality sign |
| 62 | 0111110 | > | greater than |
| 63 | 0111111 | ? | question mark |
| Dec | Binary | Char | Description |
|---|---|---|---|
| 64 | 1000000 | @ | at sign |
| 65 | 1000001 | A | |
| 66 | 1000010 | B | |
| 67 | 1000011 | C | |
| 68 | 1000100 | D | |
| 69 | 1000101 | E | |
| 70 | 1000110 | F | |
| 71 | 1000111 | G | |
| 72 | 1001000 | H | |
| 73 | 1001001 | I | |
| 74 | 1001010 | J | |
| 75 | 1001011 | K | |
| 76 | 1001100 | L | |
| 77 | 1001101 | M | |
| 78 | 1001110 | N | |
| 79 | 1001111 | O | |
| 80 | 1010000 | P | |
| 81 | 1010001 | Q | |
| 82 | 1010010 | R | |
| 83 | 1010011 | S | |
| 84 | 1010100 | T | |
| 85 | 1010101 | U | |
| 86 | 1010110 | V | |
| 87 | 1010111 | W | |
| 88 | 1011000 | X | |
| 89 | 1011001 | Y | |
| 90 | 1011010 | Z | |
| 91 | 1011011 | [ | left square bracket |
| 92 | 1011100 | \ | backslash |
| 93 | 1011101 | ] | right square bracket |
| 94 | 1011110 | ^ | caret / circumflex |
| 95 | 1011111 | _ | underscore |
| Dec | Binary | Char | Description |
|---|---|---|---|
| 96 | 1100000 | ` | grave / accent |
| 97 | 1100001 | a | |
| 98 | 1100010 | b | |
| 99 | 1100011 | c | |
| 100 | 1100100 | d | |
| 101 | 1100101 | e | |
| 102 | 1100110 | f | |
| 103 | 1100111 | g | |
| 104 | 1101000 | h | |
| 105 | 1101001 | i | |
| 106 | 1101010 | j | |
| 107 | 1101011 | k | |
| 108 | 1101100 | l | |
| 109 | 1101101 | m | |
| 110 | 1101110 | n | |
| 111 | 1101111 | o | |
| 112 | 1110000 | p | |
| 113 | 1110001 | q | |
| 114 | 1110010 | r | |
| 115 | 1110011 | s | |
| 116 | 1110100 | t | |
| 117 | 1110101 | u | |
| 118 | 1110110 | v | |
| 119 | 1110111 | w | |
| 120 | 1111000 | x | |
| 121 | 1111001 | y | |
| 122 | 1111010 | z | |
| 123 | 1111011 | { | left curly bracket |
| 124 | 1111100 | ||
| 125 | 1111101 | } | right curly bracket |
| 126 | 1111110 | ~ | tilde |
| 127 | 1000001 | DEL | delete |
|\ _,,,---,,_
ZZZzz /,`.-'`' -. ;-;;,_
|,4- ) )-,_. ,\ ( `'-'
'---''(_/--' `-'\_)
Sleeping cat by Felix Lee
;)( ;
:----: o8Oo./
C|====| ._o8o8o8Oo_.
| | \========/
`----' `------'
Coffee by Hayley Jane Wakenshaw
Other encoding standards:
Encoding Mismatches
Ever seen text like “Nestlé” instead of “Nestlé”?
That’s an encoding mismatch between UTF-8 and Latin-1!
Emojis are part of the UTF-8 standard and can be used in data analysis!
See the complete list of emojis on the Unicode website

Reference: (Felbo et al. 2017)

Research shows emojis improve sentiment analysis accuracy!
Date formats vary globally:
DD-MM-YYYY (UK: 06-10-2025)MM-DD-YYYY (US: 10-06-2025)YYYY-MM-DD (ISO: 2025-10-06) ✓ RecommendedYYYYMMDD (Compact: 20251006)Timestamps add time and timezone:
2025-10-06T14:30:00+01:00Important
Always use ISO 8601 format (YYYY-MM-DD) when possible!

Reference: (Oren 2019)
The issue:
YY not YYYY31/12/99 → 01/01/00The result:
More on the topic:

Reference: (Gibbs 2014a)
Unix timestamps count seconds since January 1, 1970
32-bit systems will overflow on: January 19, 2038 at 03:14:07 UTC
Solution: Upgrade to 64-bit systems (happening now)
The challenge:
The solution: Use specialized libraries
lubridate packagedatetime moduleThese handle the complexity for you!


Coming up after the break:
Comma-Separated Values - the workhorse of data exchange
Characteristics:
.csv)Why use CSV?
Let’s see how our MP donations look as CSV:
Date,Member,Entity,Entity Category,Value,Nature
09/12/2024,John Milne,Isabella Tree,Individual,2500.00,cash
09/12/2024,Robert Jenrick,Colin Moynihan,Individual,2500.00,cash
09/12/2024,Mr James Cleverly,IPGL HOLDINGS LIMITED,Company,10000.00,cash
04/12/2024,Jeremy Corbyn,We Deserve Better,Unincorporated Association,5000.00,cash
09/11/2024,Richard Baker,Community Union,Trade Union,4000.00,cash
Live demo: Let’s open the MP donation database →
Let’s open Camden open data →
eXtensible Markup Language
Key features:
< >.xmlNote
Did you know? Microsoft Office files (.docx, .xlsx, .pptx) are actually ZIP files containing XML!
Try it: Rename a .docx to .zip and extract it
Real example: Let’s explore Camden open data XML →
HyperText Markup Language - like XML but for web pages
JavaScript Object Notation - flexible and powerful
Why JSON is popular:
Flat structure - Simple and direct
JSON can handle complex hierarchies:
Nested structure
Same three donations, different formats:
XML (verbose):
<donations>
<donation>
<date>09/12/2024</date>
<member>John Milne</member>
<value>2500</value>
</donation>
<donation>
<date>09/12/2024</date>
<member>Robert Jenrick</member>
<value>2500</value>
</donation>
<donation>
<date>09/12/2024</date>
<member>James Cleverly</member>
<value>10000</value>
</donation>
</donations>367 characters
Why this matters: At 1 million records, JSON saves ~99MB in storage/bandwidth
For sharing data, prefer:
CSV
JSON
XML
Plain text
❌ Avoid proprietary formats for sharing:
.xls, .sas7bdat, .rdata, .rds, software-specific formats (limit accessibility)
Why? Open formats ensure your data remains accessible regardless of software availability
Building a tidy dataset is crucial for everything that follows!
🤔 Discussion: What’s wrong with this spreadsheet for data analysis?
Take 30 seconds to think about it…
Excel is great for:
But dangerous for:
Let’s see why with some real examples…
Time for reflection! (10 minutes total)

Source: (Vincent 2020)
What happened:
The scale:
The fix: 27 genes officially renamed (MARCH1 → MARCHF1)

Source: (Hern 2020)
The incident:
Impact: Lives at risk

Source: (“The Excel Depression” 2013)
Reinhart-Rogoff spreadsheet error:
Lesson: Always check your formulas!
Source: (Kwak 2013)

JP Morgan VaR Model (2012)
Lesson:
Even sophisticated financial models fail on fragile, manually managed software. Critical calculations need robust, auditable systems.

Source: (Beales 2013)
A pattern of Excel-related failures:
2007 CPDOs (i.e Constant proportion debt obligations): Moody’s coding error inflated structured finance ratings (pre-crisis)
2012 London Whale: JPMorgan’s Excel-based risk model failures ($6.2B loss)
2013 Reinhart-Rogoff: Spreadsheet error undermined influential austerity research
The pattern: Office tools used as substitutes for proper systems and critical thinking

Excel is NOT for:
Excel IS fine for:
Just know its limitations!
The “Excel horror stories” show one type of problem…
But there’s another kind of “messy”:
Data that’s technically correct but structured in ways that make analysis difficult or impossible
This is where “tidy data” principles become essential
A specific structure that makes analysis easier
Warning
Real-world data commonly violates these rules! Examples:
Messy (wide format):
| Patient | Baseline_BP | Week1_BP | Week2_BP |
|---------|-------------|----------|----------|
| John | 140 | 135 | 130 |
| Mary | 150 | 148 | 145 |
Problem: Time periods are columns, not observations. Hard to plot or calculate trends.
Tidy (long format):
| Patient | Week | Blood_Pressure |
|---------|----------|----------------|
| John | Baseline | 140 |
| John | Week1 | 135 |
| John | Week2 | 130 |
| Mary | Baseline | 150 |
| Mary | Week1 | 148 |
| Mary | Week2 | 145 |
Now we can easily: group by patient, plot BP over time, calculate average change
“Tidy” isn’t always optimal
Sometimes you need different structures for:
The key: Understand why tidy data is useful, then make informed choices about when to deviate
Further reading: Chapter 12, Section 12.7 in (Wickham and Grolemund 2016), (Leek 2016)
| Feature | Flat File | Database |
|---|---|---|
| Structure | Simple, unstructured format with records stored in plain text files | Highly structured format with tables, rows, and columns |
| Organization | Records are stored in a single file or multiple files, but there is no relationship between them | Records are organized into tables, with relationships established between tables through keys and indexes |
| Access | Data is accessed by reading the entire file sequentially, making it less efficient for large datasets | Data is accessed using SQL (Structured Query Language) commands, which are more efficient for large datasets and allow for more advanced querying and manipulation |
| Scalability | Limited scalability, as adding more data requires creating new files or appending to existing ones | High scalability, as data can be easily added, updated, and deleted without affecting the overall structure of the database |
| Security | Limited built-in security features, and data is vulnerable to corruption or loss if the file is not backed up properly | Built-in security features, such as user accounts and permissions, and data is protected against corruption or loss through automatic backups and transaction logs |
| Examples | CSV, TSV files | SQL databases, like MySQL, Oracle, SQL Server |
Source: DatabaseTown
Rule of thumb: Start with CSV, graduate to databases as complexity grows
Data types matter: Understand bits, bytes, and how data is represented
Choose the right format: CSV for simplicity, JSON for flexibility
Dates are tricky: Use ISO 8601 format and proper libraries
Excel has limits: Great for exploration, dangerous for production
Tidy data enables analysis: Structure your data for the tools you’ll use
Context matters: Rules are guidelines, not absolute laws
Next week:
Explore:
Practice:
Essential reading:

Office hours: Check StudentHub/course website for times
Discussion forum: Post questions on Slack for peer support or for support from teachers
Email or Slack private messages: For personal queries (prefer Slack as there is some latency in responding to emails)
Remember: Your first summative is due Week 04 (October 21st)
LSE DS101 2025/26 Autumn Term