DS101 – Fundamentals of Data Science
06 Oct 2025
Sign in here
Important
🥐 Important releases this week
Your first summative is now available!
Deadline: Week 04 class (October 21st)
📒 A dataset is a collection of data
But not just any collection…
Think of a library without organization:
Now imagine it organized:
A dataset is the same: Raw data becomes useful when properly structured
The reality check: ~80% of your time will be in data preparation!
UK MP Donations Database - A real-world example we’ll explore today
Single files or file collections
Think: spreadsheets, CSVs, JSON files, or a folder of logs.
Security: Depends on the device and file permissions; no automatic encryption.
Best for: small datasets, prototypes, one-off analyses.
Local databases
Think: SQLite, MySQL, PostgreSQL.
Security: Fully under your control — you manage access, backups, and encryption. Can be very secure if properly maintained.
Best for: structured, relational data, analytical workflows, multi-user local projects.
Cloud storage
Think: AWS S3, Google Cloud Storage, Azure Blob, BigQuery.
Security: Providers handle encryption, redundancy, and network security, but you rely on their policies and proper configuration (permissions, keys, access rules). Shared responsibility model applies.
Best for: massive datasets, team collaborations, projects needing scalable compute.
📒 Structured data fits neatly into tables with rows and columns
Key characteristics:
Note
Coming later: We’ll cover unstructured data (images, text, audio) in Week 09
Let’s examine the MP donations dataset:
Date | Member | Entity | Entity Category | Value (in £) | Nature |
---|---|---|---|---|---|
09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
Question: What data types do you see here?
Date | Member | Entity | Entity Category | Value (in £) | Nature |
---|---|---|---|---|---|
09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
Numeric data
Date | Member | Entity | Entity Category | Value (in £) | Nature |
---|---|---|---|---|---|
09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
String/Categorical data
Date | Member | Entity | Entity Category | Value (in £) | Nature |
---|---|---|---|---|---|
09/12/2024 | John Milne | Isabella Tree | Individual | 2,500.00 | cash |
09/12/2024 | Robert Jenrick | Colin Moynihan | Individual | 2,500.00 | cash |
09/12/2024 | Mr James Cleverly | IPGL (HOLDINGS) LIMITED | Company | 10,000.00 | cash |
04/12/2024 | Jeremy Corbyn | We Deserve Better | Unincorporated Association | 5,000.00 | cash |
09/11/2024 | Richard Baker | Community Union | Trade Union | 4,000.00 | cash |
Date/Time data
Different data types unlock different operations:
Numbers → compute averages, sums, ranges (e.g., sales totals)
Dates → sort chronologically, calculate durations (e.g., time between events)
Categories → represent a fixed set of labels (e.g., country, gender, product type)
Text → search, extract, or match patterns (e.g., find all “error” messages)
Get the type wrong — and the logic breaks:
"2,500"
stored as text → can’t calculate totals"2023-01-15"
as text → won’t sort chronologically"France"
stored as text instead of category → wastes memory, harder to group"London"
as a number → average of cities? nonsense!Efficiency matters too:
City (string)
City |
---|
“London” |
“Paris” |
“London” |
As plain text:
City (categorical)
City |
---|
London 🏙️ |
Paris 🗼 |
Berlin 🐻 |
As categorical:
0, 1, 0, 2
{0: "London", 1: "Paris", 2: "Berlin"}
The building blocks:
Bits | Value |
---|---|
0000 | 0 |
0001 | 1 |
0010 | 2 |
0011 | 3 |
0100 | 4 |
0101 | 5 |
0110 | 6 |
0111 | 7 |
Bits | Value |
---|---|
1000 | 8 |
1001 | 9 |
1010 | 10 |
1011 | 11 |
1100 | 12 |
1101 | 13 |
1110 | 14 |
1111 | 15 |
Unit | Abbr. | Size | Real-world example |
---|---|---|---|
bit | b | 1 bit | A single light switch |
Byte | B | 8 bits | One character (‘A’) |
Kilobyte | KB | 1,024 bytes | A paragraph of text |
Megabyte | MB | 1,024 KB | A high-res photo |
Gigabyte | GB | 1,024 MB | An HD movie |
Terabyte | TB | 1,024 GB | 500 hours of HD video |
Petabyte | PB | 1,024 TB | All of Netflix’s content |
Whole numbers (ℤ): no decimal points
Source: (Baraniuk 2015)
Ariane 5 Rocket (1996):
Source: (Baraniuk 2015)
Boeing 787 Dreamliner (2015):
Source: (Baraniuk 2015)
Patriot Missile System (1991):
Source: (Baraniuk 2015)
Key lessons:
Source: (BBC News 2014)
What happened (2014):
YouTube’s fix:
Key lessons:
Real numbers (ℝ): includes decimals
float
(32-bit)double
(64-bit)Example of rounding errors:
Why? Binary can’t represent 0.1 exactly (like 1/3 in decimal = 0.333…)
Practical impact: Never compare floats with ==
. Instead use:
Character (char
)
'B'
, '!'
, '🎉'
String
"Data Science is fascinating"
Quote Flexibility in Programming
Quote Flexibility in Programming
Why triple quotes are your friend:
\n
s)The ASCII table encodes 128 characters using 7 bits
Dec | Char | Description |
---|---|---|
32 | space | Space |
33 | ! | exclamation mark |
48 | 0 | zero |
49 | 1 | one |
65 | A | Uppercase A |
66 | B | Uppercase B |
97 | a | Lowercase a |
98 | b | Lowercase b |
Key insight:
Every character you type is converted to numbers!
Example: “Hi” = 72 + 105 in ASCII
Dec | Binary | Char | Description |
---|---|---|---|
0 | 0000000 | NUL | Null |
1 | 0000001 | SOH | Start of Header |
2 | 0000010 | STX | Start of Text |
3 | 0000011 | ETX | End of Text |
4 | 0000100 | EOT | End of Transmission |
5 | 0000101 | ENQ | Enquiry |
6 | 0000110 | ACK | Acknowledge |
7 | 0000111 | BEL | Bell |
8 | 0001000 | BS | Backspace |
9 | 0001001 | HT | Horizontal Tab |
10 | 0001010 | LF | Line Feed |
11 | 0001011 | VT | Vertical Tab |
12 | 0001100 | FF | Form Feed |
13 | 0001101 | CR | Carriage Return |
14 | 0001110 | SO | Shift Out |
15 | 0001111 | SI | Shift In |
16 | 0010000 | DLE | Data Link Escape |
17 | 0010001 | DC1 | Device Control 1 |
18 | 0010010 | DC2 | Device Control 2 |
19 | 0010011 | DC3 | Device Control 3 |
20 | 0010100 | DC4 | Device Control 4 |
21 | 0010101 | NAK | Negative Acknowledge |
22 | 0010110 | SYN | Synchronize |
23 | 0010111 | ETB | End of Transmission Block |
24 | 0011000 | CAN | Cancel |
25 | 0011001 | EM | End of Medium |
26 | 0011010 | SUB | Substitute |
27 | 0011011 | ESC | Escape |
28 | 0011100 | FS | File Separator |
29 | 0011101 | GS | Group Separator |
30 | 0011110 | RS | Record Separator |
31 | 0011111 | US | Unit Separator |
Dec | Binary | Char | Description |
---|---|---|---|
32 | 0100000 | space | Space |
33 | 0100001 | ! | exclamation mark |
34 | 0100010 | ” | double quote |
35 | 0100011 | # | number |
36 | 0100100 | $ | dollar |
37 | 0100101 | % | percent |
38 | 0100110 | & | ampersand |
39 | 0100111 | ’ | single quote |
40 | 0101000 | ( | left parenthesis |
41 | 0101001 | ) | right parenthesis |
42 | 0101010 | * | asterisk |
43 | 0101011 | + | plus |
44 | 0101100 | , | comma |
45 | 0101101 | - | minus |
46 | 0101110 | . | period |
47 | 0101111 | / | slash |
48 | 0110000 | 0 | zero |
49 | 0110001 | 1 | one |
50 | 0110010 | 2 | two |
51 | 0110011 | 3 | three |
52 | 0110100 | 4 | four |
53 | 0110101 | 5 | five |
54 | 0110110 | 6 | six |
55 | 0110111 | 7 | seven |
56 | 0111000 | 8 | eight |
57 | 0111001 | 9 | nine |
58 | 0111010 | : | colon |
59 | 0111011 | ; | semicolon |
60 | 0111100 | < | less than |
61 | 0111101 | = | equality sign |
62 | 0111110 | > | greater than |
63 | 0111111 | ? | question mark |
Dec | Binary | Char | Description |
---|---|---|---|
64 | 1000000 | @ | at sign |
65 | 1000001 | A | |
66 | 1000010 | B | |
67 | 1000011 | C | |
68 | 1000100 | D | |
69 | 1000101 | E | |
70 | 1000110 | F | |
71 | 1000111 | G | |
72 | 1001000 | H | |
73 | 1001001 | I | |
74 | 1001010 | J | |
75 | 1001011 | K | |
76 | 1001100 | L | |
77 | 1001101 | M | |
78 | 1001110 | N | |
79 | 1001111 | O | |
80 | 1010000 | P | |
81 | 1010001 | Q | |
82 | 1010010 | R | |
83 | 1010011 | S | |
84 | 1010100 | T | |
85 | 1010101 | U | |
86 | 1010110 | V | |
87 | 1010111 | W | |
88 | 1011000 | X | |
89 | 1011001 | Y | |
90 | 1011010 | Z | |
91 | 1011011 | [ | left square bracket |
92 | 1011100 | \ | backslash |
93 | 1011101 | ] | right square bracket |
94 | 1011110 | ^ | caret / circumflex |
95 | 1011111 | _ | underscore |
Dec | Binary | Char | Description |
---|---|---|---|
96 | 1100000 | ` | grave / accent |
97 | 1100001 | a | |
98 | 1100010 | b | |
99 | 1100011 | c | |
100 | 1100100 | d | |
101 | 1100101 | e | |
102 | 1100110 | f | |
103 | 1100111 | g | |
104 | 1101000 | h | |
105 | 1101001 | i | |
106 | 1101010 | j | |
107 | 1101011 | k | |
108 | 1101100 | l | |
109 | 1101101 | m | |
110 | 1101110 | n | |
111 | 1101111 | o | |
112 | 1110000 | p | |
113 | 1110001 | q | |
114 | 1110010 | r | |
115 | 1110011 | s | |
116 | 1110100 | t | |
117 | 1110101 | u | |
118 | 1110110 | v | |
119 | 1110111 | w | |
120 | 1111000 | x | |
121 | 1111001 | y | |
122 | 1111010 | z | |
123 | 1111011 | { | left curly bracket |
124 | 1111100 | ||
125 | 1111101 | } | right curly bracket |
126 | 1111110 | ~ | tilde |
127 | 1000001 | DEL | delete |
|\ _,,,---,,_
ZZZzz /,`.-'`' -. ;-;;,_
|,4- ) )-,_. ,\ ( `'-'
'---''(_/--' `-'\_)
Sleeping cat by Felix Lee
;)( ;
:----: o8Oo./
C|====| ._o8o8o8Oo_.
| | \========/
`----' `------'
Coffee by Hayley Jane Wakenshaw
Other encoding standards:
Encoding Mismatches
Ever seen text like “Nestlé” instead of “Nestlé”?
That’s an encoding mismatch between UTF-8 and Latin-1!
Emojis are part of the UTF-8 standard and can be used in data analysis!
See the complete list of emojis on the Unicode website
Reference: (Felbo et al. 2017)
Research shows emojis improve sentiment analysis accuracy!
Date formats vary globally:
DD-MM-YYYY
(UK: 06-10-2025)MM-DD-YYYY
(US: 10-06-2025)YYYY-MM-DD
(ISO: 2025-10-06) ✓ RecommendedYYYYMMDD
(Compact: 20251006)Timestamps add time and timezone:
2025-10-06T14:30:00+01:00
Important
Always use ISO 8601 format (YYYY-MM-DD) when possible!
Reference: (Oren 2019)
The issue:
YY
not YYYY
31/12/99
→ 01/01/00
The result:
More on the topic:
Reference: (Gibbs 2014a)
Unix timestamps count seconds since January 1, 1970
32-bit systems will overflow on: January 19, 2038 at 03:14:07 UTC
Solution: Upgrade to 64-bit systems (happening now)
The challenge:
The solution: Use specialized libraries
lubridate
packagedatetime
moduleThese handle the complexity for you!
Coming up after the break:
Comma-Separated Values - the workhorse of data exchange
Characteristics:
.csv
)Why use CSV?
Let’s see how our MP donations look as CSV:
Date,Member,Entity,Entity Category,Value,Nature
09/12/2024,John Milne,Isabella Tree,Individual,2500.00,cash
09/12/2024,Robert Jenrick,Colin Moynihan,Individual,2500.00,cash
09/12/2024,Mr James Cleverly,IPGL HOLDINGS LIMITED,Company,10000.00,cash
04/12/2024,Jeremy Corbyn,We Deserve Better,Unincorporated Association,5000.00,cash
09/11/2024,Richard Baker,Community Union,Trade Union,4000.00,cash
Live demo: Let’s open the MP donation database →
Let’s open Camden open data →
eXtensible Markup Language
Key features:
< >
.xml
Note
Did you know? Microsoft Office files (.docx, .xlsx, .pptx) are actually ZIP files containing XML!
Try it: Rename a .docx
to .zip
and extract it
Real example: Let’s explore Camden open data XML →
HyperText Markup Language - like XML but for web pages
JavaScript Object Notation - flexible and powerful
Why JSON is popular:
Flat structure - Simple and direct
JSON can handle complex hierarchies:
Nested structure
Same three donations, different formats:
XML (verbose):
<donations>
<donation>
<date>09/12/2024</date>
<member>John Milne</member>
<value>2500</value>
</donation>
<donation>
<date>09/12/2024</date>
<member>Robert Jenrick</member>
<value>2500</value>
</donation>
<donation>
<date>09/12/2024</date>
<member>James Cleverly</member>
<value>10000</value>
</donation>
</donations>
367 characters
Why this matters: At 1 million records, JSON saves ~99MB in storage/bandwidth
For sharing data, prefer:
CSV
JSON
XML
Plain text
❌ Avoid proprietary formats for sharing:
.xls
, .sas7bdat
, .rdata
, .rds
, software-specific formats (limit accessibility)
Why? Open formats ensure your data remains accessible regardless of software availability
Building a tidy dataset is crucial for everything that follows!
🤔 Discussion: What’s wrong with this spreadsheet for data analysis?
Take 30 seconds to think about it…
Excel is great for:
But dangerous for:
Let’s see why with some real examples…
Time for reflection! (10 minutes total)
Source: (Vincent 2020)
What happened:
The scale:
The fix: 27 genes officially renamed (MARCH1 → MARCHF1)
Source: (Hern 2020)
The incident:
Impact: Lives at risk
Source: (“The Excel Depression” 2013)
Reinhart-Rogoff spreadsheet error:
Lesson: Always check your formulas!
Source: (Kwak 2013)
JP Morgan VaR Model (2012)
Lesson:
Even sophisticated financial models fail on fragile, manually managed software. Critical calculations need robust, auditable systems.
Source: (Beales 2013)
A pattern of Excel-related failures:
2007 CPDOs (i.e Constant proportion debt obligations): Moody’s coding error inflated structured finance ratings (pre-crisis)
2012 London Whale: JPMorgan’s Excel-based risk model failures ($6.2B loss)
2013 Reinhart-Rogoff: Spreadsheet error undermined influential austerity research
The pattern: Office tools used as substitutes for proper systems and critical thinking
Excel is NOT for:
Excel IS fine for:
Just know its limitations!
The “Excel horror stories” show one type of problem…
But there’s another kind of “messy”:
Data that’s technically correct but structured in ways that make analysis difficult or impossible
This is where “tidy data” principles become essential
A specific structure that makes analysis easier
Warning
Real-world data commonly violates these rules! Examples:
Messy (wide format):
| Patient | Baseline_BP | Week1_BP | Week2_BP |
|---------|-------------|----------|----------|
| John | 140 | 135 | 130 |
| Mary | 150 | 148 | 145 |
Problem: Time periods are columns, not observations. Hard to plot or calculate trends.
Tidy (long format):
| Patient | Week | Blood_Pressure |
|---------|----------|----------------|
| John | Baseline | 140 |
| John | Week1 | 135 |
| John | Week2 | 130 |
| Mary | Baseline | 150 |
| Mary | Week1 | 148 |
| Mary | Week2 | 145 |
Now we can easily: group by patient, plot BP over time, calculate average change
“Tidy” isn’t always optimal
Sometimes you need different structures for:
The key: Understand why tidy data is useful, then make informed choices about when to deviate
Further reading: Chapter 12, Section 12.7 in (Wickham and Grolemund 2016), (Leek 2016)
Feature | Flat File | Database |
---|---|---|
Structure | Simple, unstructured format with records stored in plain text files | Highly structured format with tables, rows, and columns |
Organization | Records are stored in a single file or multiple files, but there is no relationship between them | Records are organized into tables, with relationships established between tables through keys and indexes |
Access | Data is accessed by reading the entire file sequentially, making it less efficient for large datasets | Data is accessed using SQL (Structured Query Language) commands, which are more efficient for large datasets and allow for more advanced querying and manipulation |
Scalability | Limited scalability, as adding more data requires creating new files or appending to existing ones | High scalability, as data can be easily added, updated, and deleted without affecting the overall structure of the database |
Security | Limited built-in security features, and data is vulnerable to corruption or loss if the file is not backed up properly | Built-in security features, such as user accounts and permissions, and data is protected against corruption or loss through automatic backups and transaction logs |
Examples | CSV, TSV files | SQL databases, like MySQL, Oracle, SQL Server |
Source: DatabaseTown
Rule of thumb: Start with CSV, graduate to databases as complexity grows
Data types matter: Understand bits, bytes, and how data is represented
Choose the right format: CSV for simplicity, JSON for flexibility
Dates are tricky: Use ISO 8601 format and proper libraries
Excel has limits: Great for exploration, dangerous for production
Tidy data enables analysis: Structure your data for the tools you’ll use
Context matters: Rules are guidelines, not absolute laws
Next week:
Explore:
Practice:
Essential reading:
Office hours: Check StudentHub/course website for times
Discussion forum: Post questions on Slack for peer support or for support from teachers
Email or Slack private messages: For personal queries (prefer Slack as there is some latency in responding to emails)
Remember: Your first summative is due Week 04 (October 21st)
LSE DS101 2025/26 Autumn Term