🗓️ Week 02
Basic blocks of data: getting familiar with the most common data types and file formats

DS101 – Fundamentals of Data Science

Dr. Ghita Berrada

LSE Data Science Institute

06 Oct 2025

Quick announcements

Important

🥐 Important releases this week

Your first summative is now available!

Deadline: Week 04 class (October 21st)

What Actually Is a Dataset?

The Simple Answer

📒 A dataset is a collection of data

But not just any collection…

Think of a library without organization:

Books scattered randomly on shelves
No way to find a specific title
Can’t tell fiction from non-fiction
Impossible to analyze the collection

Now imagine it organized:

Each book has a catalog entry (metadata)
Systematic arrangement by subject/author
Anyone can locate what they need
You can answer questions like “How many books from the 1990s?”

A dataset is the same: Raw data becomes useful when properly structured

The Data Science Pipeline

The reality check: ~80% of your time will be in data preparation!

Real Dataset Example

UK MP Donations Database - A real-world example we’ll explore today

Where Can Datasets Live?

Storage Options

Single files or file collections

Think: spreadsheets, CSVs, JSON files, or a folder of logs.

✅ Easy to share: Drag-and-drop between computers or email it.
✅ Version control-friendly: Works well with Git or similar tools.
✅ Portable: No database software needed.
⚠️ Limited scale: Hard to manage hundreds of thousands of records efficiently.
⚠️ No built-in query engine: Searching/filtering can be slow on big datasets.
⚠️ Minimal relational support: Linking datasets requires manual joins.

Security: Depends on the device and file permissions; no automatic encryption.

Best for: small datasets, prototypes, one-off analyses.

Storage Options

Local databases

Think: SQLite, MySQL, PostgreSQL.

✅ Fast queries: Optimized for filtering, sorting, and aggregations.
✅ Supports relationships: Foreign keys, joins, and indexing.
✅ Handles large datasets: Millions of rows without breaking a sweat.
⚠️ Requires setup: Installation, configuration, schema design.
⚠️ Mostly single-machine: Sharing requires exporting or server setup.
⚠️ Maintenance overhead: Backups, updates, and security patches.

Security: Fully under your control — you manage access, backups, and encryption. Can be very secure if properly maintained.

Best for: structured, relational data, analytical workflows, multi-user local projects.

Storage Options

Cloud storage

Think: AWS S3, Google Cloud Storage, Azure Blob, BigQuery.

✅ Highly scalable: From megabytes to petabytes.
✅ Accessible anywhere: Collaborate in real time across locations.
✅ Integrated ecosystem: Often comes with analytics, versioning, and sharing tools.
⚠️ Dependent on internet: Offline access is limited.
⚠️ Cost can grow: Storage + bandwidth + query costs add up.
⚠️ Latency: Small queries may be slower than local access.

Security: Providers handle encryption, redundancy, and network security, but you rely on their policies and proper configuration (permissions, keys, access rules). Shared responsibility model applies.

Best for: massive datasets, team collaborations, projects needing scalable compute.

Storage Options

Understanding Data Structure

Structured Data Defined

📒 Structured data fits neatly into tables with rows and columns

Key characteristics:

Follows a predefined schema
Organized in a consistent format
Tabular form — rows as records and columns as variables — is what defines structured data
Easy to query and analyze

Note

Coming later: We’ll cover unstructured data (images, text, audio) in Week 09

Anatomy of Tabular Data

Let’s examine the MP donations dataset:

Date	Member	Entity	Entity Category	Value (in £)	Nature
09/12/2024	John Milne	Isabella Tree	Individual	2,500.00	cash
09/12/2024	Robert Jenrick	Colin Moynihan	Individual	2,500.00	cash
09/12/2024	Mr James Cleverly	IPGL (HOLDINGS) LIMITED	Company	10,000.00	cash
04/12/2024	Jeremy Corbyn	We Deserve Better	Unincorporated Association	5,000.00	cash
09/11/2024	Richard Baker	Community Union	Trade Union	4,000.00	cash

View full dataset

Question: What data types do you see here?

Spotting Numeric Data

Date	Member	Entity	Entity Category	Value (in £)	Nature
09/12/2024	John Milne	Isabella Tree	Individual	2,500.00	cash
09/12/2024	Robert Jenrick	Colin Moynihan	Individual	2,500.00	cash
09/12/2024	Mr James Cleverly	IPGL (HOLDINGS) LIMITED	Company	10,000.00	cash
04/12/2024	Jeremy Corbyn	We Deserve Better	Unincorporated Association	5,000.00	cash
09/11/2024	Richard Baker	Community Union	Trade Union	4,000.00	cash

Numeric data

Numbers we can calculate with

Identifying Categorical Data

Date	Member	Entity	Entity Category	Value (in £)	Nature
09/12/2024	John Milne	Isabella Tree	Individual	2,500.00	cash
09/12/2024	Robert Jenrick	Colin Moynihan	Individual	2,500.00	cash
09/12/2024	Mr James Cleverly	IPGL (HOLDINGS) LIMITED	Company	10,000.00	cash
04/12/2024	Jeremy Corbyn	We Deserve Better	Unincorporated Association	5,000.00	cash
09/11/2024	Richard Baker	Community Union	Trade Union	4,000.00	cash

String/Categorical data

Text representing categories or labels

Recognizing Temporal Data

Date	Member	Entity	Entity Category	Value (in £)	Nature
09/12/2024	John Milne	Isabella Tree	Individual	2,500.00	cash
09/12/2024	Robert Jenrick	Colin Moynihan	Individual	2,500.00	cash
09/12/2024	Mr James Cleverly	IPGL (HOLDINGS) LIMITED	Company	10,000.00	cash
04/12/2024	Jeremy Corbyn	We Deserve Better	Unincorporated Association	5,000.00	cash
09/11/2024	Richard Baker	Community Union	Trade Union	4,000.00	cash

Date/Time data

Temporal information (spoiler: this gets complicated!)

Why Care About Data Types?

Different data types unlock different operations:

Numbers → compute averages, sums, ranges (e.g., sales totals)
Dates → sort chronologically, calculate durations (e.g., time between events)
Categories → represent a fixed set of labels (e.g., country, gender, product type)
- Easier to group, count, or model
- More memory-efficient than storing them as plain text
Text → search, extract, or match patterns (e.g., find all “error” messages)

Get the type wrong — and the logic breaks:

"2,500" stored as text → can’t calculate totals
"2023-01-15" as text → won’t sort chronologically
"France" stored as text instead of category → wastes memory, harder to group
"London" as a number → average of cities? nonsense!

Efficiency matters too:

Numeric and categorical types use less space than text
Proper types make storage and computation faster

Example: Strings vs. Categorical Data

City (string)

City
“London”
“Paris”
“London”

As plain text:

Each value stored as a separate string in memory
Grouping or comparison scans the full text repeatedly
Less efficient for grouping and analysis

City (categorical)

City
London 🏙️
Paris 🗼
Berlin 🐻

As categorical:

Internally stored as integer codes: 0, 1, 0, 2
Labels mapped: {0: "London", 1: "Paris", 2: "Berlin"}
Faster grouping, filtering, and memory savings

Deep Dive: Data Types

The Foundation: Bits and Bytes

The building blocks:

A bit = 0 or 1
With n bits → 2ⁿ possible values
4 bits = 16 combinations →

Bits	Value
0000	0
0001	1
0010	2
0011	3
0100	4
0101	5
0110	6
0111	7

Bits	Value
1000	8
1001	9
1010	10
1011	11
1100	12
1101	13
1110	14
1111	15

Data Size Units

Unit	Abbr.	Size	Real-world example
bit	b	1 bit	A single light switch
Byte	B	8 bits	One character (‘A’)
Kilobyte	KB	1,024 bytes	A paragraph of text
Megabyte	MB	1,024 KB	A high-res photo
Gigabyte	GB	1,024 MB	An HD movie
Terabyte	TB	1,024 GB	500 hours of HD video
Petabyte	PB	1,024 TB	All of Netflix’s content

Numbers: Integers

Whole numbers (ℤ): no decimal points

Examples: -2, -1, 0, +1, +2
Size matters for range:
- 8-bit: -128 to +127
- 16-bit: -32,768 to +32,767
- 32-bit: ±2.1 billion
- 64-bit: ±9.2 quintillion

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Ariane 5 Rocket (1996):

Horizontal velocity calculated as 64-bit number
Converted to 16-bit signed integer → overflow
Guidance system crashed 39s after launch
Cost: rocket exploded, $370 million loss
Video explanation

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Boeing 787 Dreamliner (2015):

Engine control uptime stored as 32-bit signed integer (hundredths of a second)
After 248 days → counter reached max (2,147,483,647) → failsafe triggered
Engines could automatically shut down
Mitigation: periodic reboots and software fix

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Patriot Missile System (1991):

Clock measured in tenths of seconds, converted to floating-point
Small rounding errors accumulated over 100+ hours
Mis-tracked incoming Scud missile
Consequence: missile hit US Army barracks, 28 killed, 98 injured

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Key lessons:

Data type limits can crash systems or cause critical errors
Overflow or precision drift is subtle but dangerous
Design for scale, and monitor counters/precision closely

Gangnam Style Breaks YouTube

Source: (BBC News 2014)

What happened (2014):

YouTube view counter: 32-bit signed integer
Maximum: 2,147,483,647 views
Gangnam Style exceeded this
Counter displayed negative values!

YouTube’s fix:

Upgraded to 64-bit integers
New maximum: 9,223,372,036,854,775,807
(Over 9 quintillion views)

Key lessons:

Design systems for unexpected scale
Monitor approaching limits
Plan for growth beyond estimates
Users notice when counters break!

Numbers: Floating-Point

Real numbers (ℝ): includes decimals

Examples: -232.32, 3.14159, 1,402.92
Usually represented as:
- float (32-bit)
- double (64-bit)
Important: These are approximations, not exact values

Example of rounding errors:

>>> 0.1 + 0.2
0.30000000000000004  # Not 0.3!

>>> 0.1 + 0.2 == 0.3
False  # Dangerous for comparisons!

Why? Binary can’t represent 0.1 exactly (like 1/3 in decimal = 0.333…)

Practical impact: Never compare floats with ==. Instead use:

abs(a - b) < 0.0001  # Check if "close enough"

Text: Characters and Strings

Character (char)

A single symbol: 'B', '!', '🎉'
Wrapped in quotes

String

A sequence of characters
Example: "Data Science is fascinating"
Also wrapped in quotes

String Quote Options

Quote Flexibility in Programming

Single vs. double quotes – pick your flavor!

Most languages (Python, JavaScript, Ruby) happily accept either:

name1 = 'Alice'
name2 = "Bob"  # Both valid – no judgment here!

⚠️ Watch out: some languages like C treat them differently ('A' vs "A").

String Quote Options

Quote Flexibility in Programming

Triple quotes in Python – the “multiline superhero”:

query = """
SELECT name, age
FROM students
WHERE grade > 80
"""

poem = '''
Roses are red,
Violets are blue,
Data is great,
And Python is too!
'''

Can store big text blocks without crying over line breaks or escaping quotes.

String Quote Options

Quote Flexibility in Programming

Why triple quotes are your friend:

Keep formatting intact (no messy \ns)
No need to escape quotes inside (your keyboard will thank you)
Ideal for SQL, documentation, or fancy ASCII art
Python loves them for docstrings – even your future self will thank you! 😎

How Text Becomes Binary: ASCII

The ASCII table encodes 128 characters using 7 bits

Dec	Char	Description
32	space	Space
33	!	exclamation mark
48	0	zero
49	1	one
65	A	Uppercase A
66	B	Uppercase B
97	a	Lowercase a
98	b	Lowercase b

Key insight:

Every character you type is converted to numbers!

Example: “Hi” = 72 + 105 in ASCII

The ASCII table

Dec	Binary	Char	Description
0	0000000	NUL	Null
1	0000001	SOH	Start of Header
2	0000010	STX	Start of Text
3	0000011	ETX	End of Text
4	0000100	EOT	End of Transmission
5	0000101	ENQ	Enquiry
6	0000110	ACK	Acknowledge
7	0000111	BEL	Bell
8	0001000	BS	Backspace
9	0001001	HT	Horizontal Tab
10	0001010	LF	Line Feed
11	0001011	VT	Vertical Tab
12	0001100	FF	Form Feed
13	0001101	CR	Carriage Return
14	0001110	SO	Shift Out
15	0001111	SI	Shift In
16	0010000	DLE	Data Link Escape
17	0010001	DC1	Device Control 1
18	0010010	DC2	Device Control 2
19	0010011	DC3	Device Control 3
20	0010100	DC4	Device Control 4
21	0010101	NAK	Negative Acknowledge
22	0010110	SYN	Synchronize
23	0010111	ETB	End of Transmission Block
24	0011000	CAN	Cancel
25	0011001	EM	End of Medium
26	0011010	SUB	Substitute
27	0011011	ESC	Escape
28	0011100	FS	File Separator
29	0011101	GS	Group Separator
30	0011110	RS	Record Separator
31	0011111	US	Unit Separator

Dec	Binary	Char	Description
32	0100000	space	Space
33	0100001	!	exclamation mark
34	0100010	”	double quote
35	0100011	#	number
36	0100100	$	dollar
37	0100101	%	percent
38	0100110	&	ampersand
39	0100111	’	single quote
40	0101000	(	left parenthesis
41	0101001	)	right parenthesis
42	0101010	*	asterisk
43	0101011	+	plus
44	0101100	,	comma
45	0101101	-	minus
46	0101110	.	period
47	0101111	/	slash
48	0110000	0	zero
49	0110001	1	one
50	0110010	2	two
51	0110011	3	three
52	0110100	4	four
53	0110101	5	five
54	0110110	6	six
55	0110111	7	seven
56	0111000	8	eight
57	0111001	9	nine
58	0111010	:	colon
59	0111011	;	semicolon
60	0111100	<	less than
61	0111101	=	equality sign
62	0111110	>	greater than
63	0111111	?	question mark

Dec	Binary	Char	Description
64	1000000	@	at sign
65	1000001	A
66	1000010	B
67	1000011	C
68	1000100	D
69	1000101	E
70	1000110	F
71	1000111	G
72	1001000	H
73	1001001	I
74	1001010	J
75	1001011	K
76	1001100	L
77	1001101	M
78	1001110	N
79	1001111	O
80	1010000	P
81	1010001	Q
82	1010010	R
83	1010011	S
84	1010100	T
85	1010101	U
86	1010110	V
87	1010111	W
88	1011000	X
89	1011001	Y
90	1011010	Z
91	1011011	[	left square bracket
92	1011100	\	backslash
93	1011101	]	right square bracket
94	1011110	^	caret / circumflex
95	1011111	_	underscore

Dec	Binary	Char	Description
96	1100000	`	grave / accent
97	1100001	a
98	1100010	b
99	1100011	c
100	1100100	d
101	1100101	e
102	1100110	f
103	1100111	g
104	1101000	h
105	1101001	i
106	1101010	j
107	1101011	k
108	1101100	l
109	1101101	m
110	1101110	n
111	1101111	o
112	1110000	p
113	1110001	q
114	1110010	r
115	1110011	s
116	1110100	t
117	1110101	u
118	1110110	v
119	1110111	w
120	1111000	x
121	1111001	y
122	1111010	z
123	1111011	{	left curly bracket
124	1111100
125	1111101	}	right curly bracket
126	1111110	~	tilde
127	1000001	DEL	delete

ASCII Art: When Text Becomes Visual

See “The surprisingly rich history of ASCII art” article

      |\      _,,,---,,_
ZZZzz /,`.-'`'    -.  ;-;;,_
     |,4-  ) )-,_. ,\ (  `'-'
    '---''(_/--'  `-'\_)

Sleeping cat by Felix Lee

  ;)( ;
 :----:     o8Oo./
C|====| ._o8o8o8Oo_.
 |    |  \========/
 `----'   `------'

Coffee by Hayley Jane Wakenshaw

Beyond ASCII: Modern Text Encoding

Other encoding standards:

UTF-8 (most common today)
- Supports all languages
- Backward compatible with ASCII
UTF-16, UTF-32
ISO-8859-1 (Latin-1)

Encoding Mismatches

Ever seen text like “NestlÃ©” instead of “Nestlé”?

That’s an encoding mismatch between UTF-8 and Latin-1!

Emojis Are Text Too! 😊

Emojis are part of the UTF-8 standard and can be used in data analysis!

Emojis in Data Science

Reference: (Felbo et al. 2017)

Research shows emojis improve sentiment analysis accuracy!

Dates and Times: Deceptively Complex

Date formats vary globally:

DD-MM-YYYY (UK: 06-10-2025)
MM-DD-YYYY (US: 10-06-2025)
YYYY-MM-DD (ISO: 2025-10-06) ✓ Recommended
YYYYMMDD (Compact: 20251006)

Timestamps add time and timezone:

2025-10-06T14:30:00+01:00

Important

Always use ISO 8601 format (YYYY-MM-DD) when possible!

The Y2K Problem

Reference: (Oren 2019)

The issue:

Dates stored as YY not YYYY
31/12/99 → 01/01/00
Is that 1900 or 2000?

The result:

Billions spent on fixes
Global panic (largely avoided due to preparation)

The Next Date Crisis: Y2038

Reference: (Gibbs 2014a)

Unix timestamps count seconds since January 1, 1970

32-bit systems will overflow on: January 19, 2038 at 03:14:07 UTC

Solution: Upgrade to 64-bit systems (happening now)

Working with Dates in Practice

The challenge:

Global format variations
Mixed formats in same dataset
Time zones and daylight saving
Leap years and calendar quirks

The solution: Use specialized libraries

R: lubridate package
Python: datetime module

These handle the complexity for you!

Break Time! ☕

Coming up after the break:

File formats demystified
Tidy data principles
Excel horror stories
Hands-on examples

Common File Formats

CSV: The Universal Format

Comma-Separated Values - the workhorse of data exchange

Characteristics:

Plain text format (.csv)
First row = column headers
Commas separate values (usually!)
Human-readable

Why use CSV?

Works everywhere
Simple to inspect
Version control friendly
No vendor lock-in

CSV in Action

Let’s see how our MP donations look as CSV:

Date,Member,Entity,Entity Category,Value,Nature
09/12/2024,John Milne,Isabella Tree,Individual,2500.00,cash
09/12/2024,Robert Jenrick,Colin Moynihan,Individual,2500.00,cash
09/12/2024,Mr James Cleverly,IPGL HOLDINGS LIMITED,Company,10000.00,cash
04/12/2024,Jeremy Corbyn,We Deserve Better,Unincorporated Association,5000.00,cash
09/11/2024,Richard Baker,Community Union,Trade Union,4000.00,cash

Live demo: Let’s open the MP donation database →

Let’s open Camden open data →

XML: Structured and Semantic

eXtensible Markup Language

Key features:

Uses tags: < >
Self-describing
Hierarchical structure
Industry standard
Files end with .xml

<?xml version="1.0" encoding="UTF-8"?>
<donation>
  <date>09/12/2024</date>
  <member>John Milne</member>
  <entity>Isabella Tree</entity>
  <category>Individual</category>
  <value currency="GBP">2500.00</value>
  <nature>cash</nature>
</donation>

XML Powers Office Documents

Note

Did you know? Microsoft Office files (.docx, .xlsx, .pptx) are actually ZIP files containing XML!

Try it: Rename a .docx to .zip and extract it

Real example: Let’s explore Camden open data XML →

HTML: The Web’s Language

HyperText Markup Language - like XML but for web pages

<!DOCTYPE html>
<html>
<head>
  <title>Data Science</title>
</head>
<body>
  <h1>Welcome!</h1>
  <p>This is HTML</p>
</body>
</html>

Every website uses HTML

Structured like XML
Designed for browsers
Can contain data
Scrape-able (with permission!)

JSON: The Modern Standard

JavaScript Object Notation - flexible and powerful

{
  "MP_donations": [
    {
      "date": "09/12/2024",
      "member": "John Milne",
      "entity": "Isabella Tree",
      "category": "Individual",
      "value": 2500.00,
      "nature": "cash"
    }
  ]
}

Why JSON is popular:

Web APIs use it extensively
Native to JavaScript
Supports nested structures
Human-readable and machine-parseable

JSON: Visualized

Flat structure - Simple and direct

JSON: Nested Structures

JSON can handle complex hierarchies:

{
  "MP_donations": [
    {
      "date": "09/12/2024",
      "nature": "cash",
      "donations": [
        {
          "member": "John Milne",
          "entity": "Isabella Tree",
          "category": "Individual",
          "value": 2500.00
        },
        {
          "member": "Robert Jenrick",
          "entity": "Colin Moynihan",
          "category": "Individual",
          "value": 2500.00
        }
      ]
    }
  ]
}

Nested JSON Visualized

Nested structure

Grouping related data

JSON vs XML: Compactness Compared

Same three donations, different formats:

XML (verbose):

<donations>
  <donation>
    <date>09/12/2024</date>
    <member>John Milne</member>
    <value>2500</value>
  </donation>
  <donation>
    <date>09/12/2024</date>
    <member>Robert Jenrick</member>
    <value>2500</value>
  </donation>
  <donation>
    <date>09/12/2024</date>
    <member>James Cleverly</member>
    <value>10000</value>
  </donation>
</donations>

367 characters

JSON (compact):

{
  "donations": [
    {
      "date": "09/12/2024",
      "member": "John Milne",
      "value": 2500
    },
    {
      "date": "09/12/2024",
      "member": "Robert Jenrick",
      "value": 2500
    },
    {
      "date": "09/12/2024",
      "member": "James Cleverly",
      "value": 10000
    }
  ]
}

268 characters (27% smaller)

Why this matters: At 1 million records, JSON saves ~99MB in storage/bandwidth

File Format Best Practices

For sharing data, prefer:

CSV

Simple tabular data (single table, flat structure)
Best for: Spreadsheet-like data, time series, survey results
Example: Sales records, sensor data, student grades

JSON

Nested/hierarchical data with relationships
Best for: API responses, configuration files, complex data structures
Example: Social media posts with comments, product catalogs with variants

XML

Industry-mandated standards and document exchange
Best for: Healthcare (HL7), finance (FpML), government data interchange
Example: Legal documents, RSS feeds, SOAP web services

Plain text

Always human-readable, future-proof
Best for: Documentation, logs, simple configurations

❌ Avoid proprietary formats for sharing:
.xls, .sas7bdat, .rdata, .rds, software-specific formats (limit accessibility)

Why? Open formats ensure your data remains accessible regardless of software availability

Tidy Data Philosophy

Returning to Our Pipeline

Building a tidy dataset is crucial for everything that follows!

What Makes Data Messy?

🤔 Discussion: What’s wrong with this spreadsheet for data analysis?

Take 30 seconds to think about it…

The Excel Problem

Excel is great for:

Quick data entry
Simple calculations
Basic visualization
Non-technical users

But dangerous for:

Data science workflows
Reproducible analysis
Large datasets
Collaborative projects

Let’s see why with some real examples…

Group Discussion Activity

Time for reflection! (10 minutes total)

Break into assigned groups (~5 min reading)
- Article 1: Gene name errors
- Article 2: COVID data loss
- Article 3: Economic policy impact
- Article 4: London Whale
- Article 5: Microsoft and global finance
Group summary (~5 min discussion)
- What went wrong?
- What were the consequences?
- How could it have been prevented?

Horror Story #1: Gene Names

Source: (Vincent 2020)

What happened:

Excel auto-converted gene names to dates
MARCH1 (short for “Membrane Associated Ring-CH-Type Finger 1”) → 1-Mar
SEPT2 (short for “Septin 2”)→ 2-Sep

The scale:

2016 study (Ziemann, Eren, and El-Osta 2016): ~20% of 3,597 genomics papers affected by Excel gene name conversion
2021 follow-up (Abeysooriya et al. 2021): Still 30.9% of 11,117 papers had errors
Problem persisted despite awareness!

The fix: 27 genes officially renamed (MARCH1 → MARCHF1)

Horror Story #2: COVID-19 Data Loss

Source: (Hern 2020)

The incident:

UK Public Health England used Excel as exchange format with local trusts
Hit the 65,536 row limit
Lost 16,000 COVID cases
Delayed contact tracing

Impact: Lives at risk

Horror Story #3: Economic Policy

Source: (“The Excel Depression” 2013)

Reinhart-Rogoff spreadsheet error:

Influential economics paper
Original results showed average real economic growth slows (a 0.1% decline) when a country’s debt rises to more than 90% of gross domestic product
Influenced austerity policies globally
But, Excel formula excluded 5 rows (i.e 5 countries: Australia, Austria, Belgium, Canada and Denmark)
Wrong conclusions about debt/GDP (when that error was corrected, the “0.1% decline” data became a 2.2% average increase in economic growth)

Lesson: Always check your formulas!

Horror Story #4: The London Whale

Source: (Kwak 2013)

Horror Story #4: The London Whale

JP Morgan VaR Model (2012)

Built entirely in Excel
Manual copy-paste & fragile formulas
Formula error: divided by sum instead of average → VaR understated
Model review flagged issues but flaws never fixed
Resulted in $6.2 billion trading loss

Lesson:

Even sophisticated financial models fail on fragile, manually managed software. Critical calculations need robust, auditable systems.

Horror Story #5: Microsoft and Finance

Source: (Beales 2013)

A pattern of Excel-related failures:

2007 CPDOs (i.e Constant proportion debt obligations): Moody’s coding error inflated structured finance ratings (pre-crisis)
2012 London Whale: JPMorgan’s Excel-based risk model failures ($6.2B loss)
2013 Reinhart-Rogoff: Spreadsheet error undermined influential austerity research

The pattern: Office tools used as substitutes for proper systems and critical thinking

The Excel Takeaway

Excel is NOT for:

Serious data science
Reproducible analysis
Large-scale data processing
Production systems

Excel IS fine for:

Quick data exploration
Small personal datasets
Initial data entry
Simple calculations

Just know its limitations!

Beyond Excel Horror: Real Messiness

The “Excel horror stories” show one type of problem…

But there’s another kind of “messy”:

Data that’s technically correct but structured in ways that make analysis difficult or impossible

This is where “tidy data” principles become essential

What Is Tidy Data?

A specific structure that makes analysis easier

Credit: Allison Horst Stats Illustrations

The Three Rules of Tidy Data

Each variable forms a column
- One column = one type of measurement
- Example violation: Multiple years as separate columns
Each observation forms a row
- One row = one distinct data point
- Example violation: Multiple measurements in one cell
Each type of observational unit forms a table
- Related data stays together
- Example violation: Mixing patient data with hospital data

The Three Rules of Tidy Data

Warning

Real-world data commonly violates these rules! Examples:

Column headers are values, not variable names (e.g., “2020”, “2021”, “2022”)
Multiple variables in one column (e.g., “Country_Year”)
Multiple types of observations in same table (e.g., mixing individual and summary statistics)

Tidy vs. Messy: Example

Messy (wide format):

| Patient | Baseline_BP | Week1_BP | Week2_BP |
|---------|-------------|----------|----------|
| John    | 140         | 135      | 130      |
| Mary    | 150         | 148      | 145      |

Problem: Time periods are columns, not observations. Hard to plot or calculate trends.

Tidy (long format):

| Patient | Week     | Blood_Pressure |
|---------|----------|----------------|
| John    | Baseline | 140            |
| John    | Week1    | 135            |
| John    | Week2    | 130            |
| Mary    | Baseline | 150            |
| Mary    | Week1    | 148            |
| Mary    | Week2    | 145            |

Now we can easily: group by patient, plot BP over time, calculate average change

Important Caveat: Context Matters

“Tidy” isn’t always optimal

Sometimes you need different structures for:

Performance (e.g., matrix operations in ML)
Specialized domains (e.g., genomics formats)
Specific algorithms (e.g., sparse matrices)

The key: Understand why tidy data is useful, then make informed choices about when to deviate

Further reading: Chapter 12, Section 12.7 in (Wickham and Grolemund 2016), (Leek 2016)

Flat Files vs. Databases

Feature	Flat File	Database
Structure	Simple, unstructured format with records stored in plain text files	Highly structured format with tables, rows, and columns
Organization	Records are stored in a single file or multiple files, but there is no relationship between them	Records are organized into tables, with relationships established between tables through keys and indexes
Access	Data is accessed by reading the entire file sequentially, making it less efficient for large datasets	Data is accessed using SQL (Structured Query Language) commands, which are more efficient for large datasets and allow for more advanced querying and manipulation
Scalability	Limited scalability, as adding more data requires creating new files or appending to existing ones	High scalability, as data can be easily added, updated, and deleted without affecting the overall structure of the database
Security	Limited built-in security features, and data is vulnerable to corruption or loss if the file is not backed up properly	Built-in security features, such as user accounts and permissions, and data is protected against corruption or loss through automatic backups and transaction logs
Examples	CSV, TSV files	SQL databases, like MySQL, Oracle, SQL Server

Source: DatabaseTown

Rule of thumb: Start with CSV, graduate to databases as complexity grows

Key Takeaways

Data types matter: Understand bits, bytes, and how data is represented
Choose the right format: CSV for simplicity, JSON for flexibility
Dates are tricky: Use ISO 8601 format and proper libraries
Excel has limits: Great for exploration, dangerous for production
Tidy data enables analysis: Structure your data for the tools you’ll use
Context matters: Rules are guidelines, not absolute laws

Looking Ahead

Next week:

What is computational thinking?
What do we mean by algorithms?

Resources & Further Reading

Explore:

Practice:

Resources & Further Reading

Essential reading:

Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu, and Mark Ziemann. 2021. “Gene Name Errors: Lessons Not Learned.” PLoS Computational Biology 17 (7): e1008984.

Bailey, David H., and Jonathan Borwein (Jon). 2013. “The Reinhart-Rogoff Error – or How Not to Excel at Economics. The Conversation.” April 22, 2013. http://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646.

Baraniuk, Chris. 2015. “The Number Glitch That Can Lead to Catastrophe.” BBC Future, May. https://www.bbc.com/future/article/20150505-the-numbers-that-lead-to-disaster.

BBC News. 2014. “Gangnam Style Music Video ’Broke’ YouTube View Limit.” BBC News, December. https://www.bbc.co.uk/news/world-asia-30288542.

Beales, Richard. 2013. “Blame Microsoft.” New York Times (Online). 2013. https://www.proquest.com/blogs-podcasts-websites/blame-microsoft/docview/2215333604/se-2.

Felbo, Bjarke, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. “Using Millions of Emoji Occurrences to Learn Any-Domain Representations for Detecting Sentiment, Emotion and Sarcasm.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1615–25. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1169.

Gibbs, Samuel. 2014a. “Is the Year 2038 Problem the New Y2K Bug?” The Guardian (London). https://www.theguardian.com/technology/2014/dec/17/is-the-year-2038-problem-the-new-y2k-bug.

———. 2014b. “Y2K Bug Triggers Army Conscription Notices Sent to 14,000 Dead Men.” The Guardian (London). https://www.theguardian.com/technology/2014/jul/11/y2k-bug-us-army-conscription-1800s-pennsylvania.

Hern, Alex. 2020. “Covid: How Excel May Have Caused Loss of 16,000 Test Results in England.” The Guardian. https://www.theguardian.com/politics/2020/oct/05/how-excel-may-have-caused-loss-of-16000-covid-tests-in-england.

Kwak, James. 2013. “The Importance of Excel.” "The Baseline Scenario" Blog, February. https://baselinescenario.com/2013/02/09/the-importance-of-excel/.

Leek, Jeff. 2016. “Non-Tidy Data.” "Simply Statistics" Blog. https://simplystatistics.org/posts/2016-02-17-non-tidy-data/.

Oren, Nir. 2019. “If You Think the Millennium Bug Was a Hoax, Here Comes a History Lesson.” The Conversation. http://theconversation.com/if-you-think-the-millennium-bug-was-a-hoax-here-comes-a-history-lesson-129042.

Reuters. 1999. “Nuclear ’Glitches’ in Japan.” Wired, December. https://www.wired.com/1999/12/nuclear-glitches-in-japan/.

Ted Bridis. 2000. “’Business as Usual’ in Y2K. Standard-Times.” January 3, 2000. https://www.southcoasttoday.com/story/news/nation-world/2000/01/03/business-as-usual-in/50500031007/.

“The Excel Depression.” 2013. New York Times (Online). https://gate3.library.lse.ac.uk/login?url=https://www.proquest.com/docview/2215248827?pq-origsite=primo&accountid=9630.

Thomas, Martyn. 2019. “The Millennium Bug Was Real – and 20 Years Later We Face the Same Threats.” The Guardian (Online). https://www.theguardian.com/commentisfree/2019/dec/31/millennium-bug-face-fears-y2k-it-systems.

Uenuma, Francine. 2019. “Years Later, the Y2K Bug Seems Like a Joke–Because Those Behind the Scenes Took It Seriously.” Time. https://time.com/5752129/y2k-bug-history/.

Vincent, James. 2020. “Scientists Rename Human Genes to Stop Microsoft Excel from Misreading Them as Dates.” The Verge, August. https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly. https://r4ds.had.co.nz/.

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1): 1–3.

Questions?

Office hours: Check StudentHub/course website for times

Discussion forum: Post questions on Slack for peer support or for support from teachers

Email or Slack private messages: For personal queries (prefer Slack as there is some latency in responding to emails)

Remember: Your first summative is due Week 04 (October 21st)

🗓️ Week 02 Basic blocks of data: getting familiar with the most common data types and file formats

Quick announcements

The menu for today

What Actually Is a Dataset?

The Simple Answer

The Data Science Pipeline

Real Dataset Example

Where Can Datasets Live?

Storage Options

Storage Options

Storage Options

Storage Options

Understanding Data Structure

Structured Data Defined

Anatomy of Tabular Data

Spotting Numeric Data

Identifying Categorical Data

Recognizing Temporal Data

Why Care About Data Types?

Example: Strings vs. Categorical Data

Deep Dive: Data Types

The Foundation: Bits and Bytes

Data Size Units

Numbers: Integers

When Numbers Hit Their Limits

When Numbers Hit Their Limits

When Numbers Hit Their Limits

When Numbers Hit Their Limits

Gangnam Style Breaks YouTube

Numbers: Floating-Point

Text: Characters and Strings

String Quote Options

String Quote Options

String Quote Options

How Text Becomes Binary: ASCII

The ASCII table

ASCII Art: When Text Becomes Visual

Beyond ASCII: Modern Text Encoding

Emojis Are Text Too! 😊

Emojis in Data Science

Dates and Times: Deceptively Complex

The Y2K Problem

The Next Date Crisis: Y2038

Working with Dates in Practice

Break Time! ☕

Common File Formats

CSV: The Universal Format

CSV in Action

XML: Structured and Semantic

XML Powers Office Documents

HTML: The Web’s Language

JSON: The Modern Standard

JSON: Visualized

JSON: Nested Structures

Nested JSON Visualized

JSON vs XML: Compactness Compared

File Format Best Practices

Tidy Data Philosophy

Returning to Our Pipeline

What Makes Data Messy?

The Excel Problem

Group Discussion Activity

Horror Story #1: Gene Names

Horror Story #2: COVID-19 Data Loss

Horror Story #3: Economic Policy

Horror Story #4: The London Whale

Horror Story #4: The London Whale

Horror Story #5: Microsoft and Finance

The Excel Takeaway

Beyond Excel Horror: Real Messiness

What Is Tidy Data?

The Three Rules of Tidy Data

The Three Rules of Tidy Data

Tidy vs. Messy: Example

Important Caveat: Context Matters

Flat Files vs. Databases

Key Takeaways

Looking Ahead

Resources & Further Reading

Resources & Further Reading

🗓️ Week 02
Basic blocks of data: getting familiar with the most common data types and file formats