🗓️ Week 02
Basic blocks of data: getting familiar with the most common data types and file formats

DS101 – Fundamentals of Data Science

06 Oct 2025

Quick announcements

Sign in here

Important

🥐 Important releases this week

Your first summative is now available!

Deadline: Week 04 class (October 21st)

The menu for today

  1. Understanding datasets - What makes data “organised”?
  2. Data types demystified - From bits to text to dates
  3. File formats explained - CSV, JSON, XML and friends
  4. Tidy data principles - Making data analysis-ready
  5. Real-world cautionary tales - When data goes wrong

What Actually Is a Dataset?

The Simple Answer

📒 A dataset is a collection of data

But not just any collection…

Think of a library without organization:

  • Books scattered randomly on shelves
  • No way to find a specific title
  • Can’t tell fiction from non-fiction
  • Impossible to analyze the collection

Now imagine it organized:

  • Each book has a catalog entry (metadata)
  • Systematic arrangement by subject/author
  • Anyone can locate what they need
  • You can answer questions like “How many books from the 1990s?”

A dataset is the same: Raw data becomes useful when properly structured

The Data Science Pipeline

start Start gather Gather data   start->gather store Store it somewhere gather->store       clean Clean & pre-process store->clean       build Build a dataset clean->build       eda Exploratory analysis build->eda ml Machine learning eda->ml       insight Extract insights ml->insight       communicate Share results insight->communicate       end End communicate->end

The reality check: ~80% of your time will be in data preparation!

Real Dataset Example

UK MP Donations Database - A real-world example we’ll explore today

Where Can Datasets Live?

Storage Options

Single files or file collections

Think: spreadsheets, CSVs, JSON files, or a folder of logs.

  • ✅ Easy to share: Drag-and-drop between computers or email it.
  • ✅ Version control-friendly: Works well with Git or similar tools.
  • ✅ Portable: No database software needed.
  • ⚠️ Limited scale: Hard to manage hundreds of thousands of records efficiently.
  • ⚠️ No built-in query engine: Searching/filtering can be slow on big datasets.
  • ⚠️ Minimal relational support: Linking datasets requires manual joins.

Security: Depends on the device and file permissions; no automatic encryption.

Best for: small datasets, prototypes, one-off analyses.

Storage Options

Local databases

Think: SQLite, MySQL, PostgreSQL.

  • ✅ Fast queries: Optimized for filtering, sorting, and aggregations.
  • ✅ Supports relationships: Foreign keys, joins, and indexing.
  • ✅ Handles large datasets: Millions of rows without breaking a sweat.
  • ⚠️ Requires setup: Installation, configuration, schema design.
  • ⚠️ Mostly single-machine: Sharing requires exporting or server setup.
  • ⚠️ Maintenance overhead: Backups, updates, and security patches.

Security: Fully under your control — you manage access, backups, and encryption. Can be very secure if properly maintained.

Best for: structured, relational data, analytical workflows, multi-user local projects.

Storage Options

Cloud storage

Think: AWS S3, Google Cloud Storage, Azure Blob, BigQuery.

  • ✅ Highly scalable: From megabytes to petabytes.
  • ✅ Accessible anywhere: Collaborate in real time across locations.
  • ✅ Integrated ecosystem: Often comes with analytics, versioning, and sharing tools.
  • ⚠️ Dependent on internet: Offline access is limited.
  • ⚠️ Cost can grow: Storage + bandwidth + query costs add up.
  • ⚠️ Latency: Small queries may be slower than local access.

Security: Providers handle encryption, redundancy, and network security, but you rely on their policies and proper configuration (permissions, keys, access rules). Shared responsibility model applies.

Best for: massive datasets, team collaborations, projects needing scalable compute.

Storage Options

Understanding Data Structure

Structured Data Defined

📒 Structured data fits neatly into tables with rows and columns

Key characteristics:

  • Follows a predefined schema
  • Organized in a consistent format
  • Tabular form — rows as records and columns as variables — is what defines structured data
  • Easy to query and analyze

Note

Coming later: We’ll cover unstructured data (images, text, audio) in Week 09

Anatomy of Tabular Data

Let’s examine the MP donations dataset:

Date Member Entity Entity Category Value (in £) Nature
09/12/2024 John Milne Isabella Tree Individual 2,500.00 cash
09/12/2024 Robert Jenrick Colin Moynihan Individual 2,500.00 cash
09/12/2024 Mr James Cleverly IPGL (HOLDINGS) LIMITED Company 10,000.00 cash
04/12/2024 Jeremy Corbyn We Deserve Better Unincorporated Association 5,000.00 cash
09/11/2024 Richard Baker Community Union Trade Union 4,000.00 cash

View full dataset

Question: What data types do you see here?

Spotting Numeric Data

Date Member Entity Entity Category Value (in £) Nature
09/12/2024 John Milne Isabella Tree Individual 2,500.00 cash
09/12/2024 Robert Jenrick Colin Moynihan Individual 2,500.00 cash
09/12/2024 Mr James Cleverly IPGL (HOLDINGS) LIMITED Company 10,000.00 cash
04/12/2024 Jeremy Corbyn We Deserve Better Unincorporated Association 5,000.00 cash
09/11/2024 Richard Baker Community Union Trade Union 4,000.00 cash

Numeric data

  • Numbers we can calculate with

Identifying Categorical Data

Date Member Entity Entity Category Value (in £) Nature
09/12/2024 John Milne Isabella Tree Individual 2,500.00 cash
09/12/2024 Robert Jenrick Colin Moynihan Individual 2,500.00 cash
09/12/2024 Mr James Cleverly IPGL (HOLDINGS) LIMITED Company 10,000.00 cash
04/12/2024 Jeremy Corbyn We Deserve Better Unincorporated Association 5,000.00 cash
09/11/2024 Richard Baker Community Union Trade Union 4,000.00 cash

String/Categorical data

  • Text representing categories or labels

Recognizing Temporal Data

Date Member Entity Entity Category Value (in £) Nature
09/12/2024 John Milne Isabella Tree Individual 2,500.00 cash
09/12/2024 Robert Jenrick Colin Moynihan Individual 2,500.00 cash
09/12/2024 Mr James Cleverly IPGL (HOLDINGS) LIMITED Company 10,000.00 cash
04/12/2024 Jeremy Corbyn We Deserve Better Unincorporated Association 5,000.00 cash
09/11/2024 Richard Baker Community Union Trade Union 4,000.00 cash

Date/Time data

  • Temporal information (spoiler: this gets complicated!)

Why Care About Data Types?

Different data types unlock different operations:

  • Numbers → compute averages, sums, ranges (e.g., sales totals)

  • Dates → sort chronologically, calculate durations (e.g., time between events)

  • Categories → represent a fixed set of labels (e.g., country, gender, product type)

    • Easier to group, count, or model
    • More memory-efficient than storing them as plain text
  • Text → search, extract, or match patterns (e.g., find all “error” messages)

Get the type wrong — and the logic breaks:

  • "2,500" stored as text → can’t calculate totals
  • "2023-01-15" as text → won’t sort chronologically
  • "France" stored as text instead of category → wastes memory, harder to group
  • "London" as a number → average of cities? nonsense!

Efficiency matters too:

  • Numeric and categorical types use less space than text
  • Proper types make storage and computation faster

Example: Strings vs. Categorical Data

City (string)

City
“London”
“Paris”
“London”

As plain text:

  • Each value stored as a separate string in memory
  • Grouping or comparison scans the full text repeatedly
  • Less efficient for grouping and analysis

City (categorical)

City
London 🏙️
Paris 🗼
Berlin 🐻

As categorical:

  • Internally stored as integer codes: 0, 1, 0, 2
  • Labels mapped: {0: "London", 1: "Paris", 2: "Berlin"}
  • Faster grouping, filtering, and memory savings

Deep Dive: Data Types

The Foundation: Bits and Bytes

The building blocks:

  • A bit = 0 or 1
  • With n bits → 2n possible values
  • 4 bits = 16 combinations →
Bits Value
0000 0
0001 1
0010 2
0011 3
0100 4
0101 5
0110 6
0111 7
Bits Value
1000 8
1001 9
1010 10
1011 11
1100 12
1101 13
1110 14
1111 15

Data Size Units

Unit Abbr. Size Real-world example
bit b 1 bit A single light switch
Byte B 8 bits One character (‘A’)
Kilobyte KB 1,024 bytes A paragraph of text
Megabyte MB 1,024 KB A high-res photo
Gigabyte GB 1,024 MB An HD movie
Terabyte TB 1,024 GB 500 hours of HD video
Petabyte PB 1,024 TB All of Netflix’s content

Numbers: Integers

Whole numbers (ℤ): no decimal points

  • Examples: -2, -1, 0, +1, +2
  • Size matters for range:
    • 8-bit: -128 to +127
    • 16-bit: -32,768 to +32,767
    • 32-bit: ±2.1 billion
    • 64-bit: ±9.2 quintillion

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Ariane 5 Rocket (1996):

  • Horizontal velocity calculated as 64-bit number
  • Converted to 16-bit signed integer → overflow
  • Guidance system crashed 39s after launch
  • Cost: rocket exploded, $370 million loss
  • Video explanation

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Boeing 787 Dreamliner (2015):

  • Engine control uptime stored as 32-bit signed integer (hundredths of a second)
  • After 248 days → counter reached max (2,147,483,647) → failsafe triggered
  • Engines could automatically shut down
  • Mitigation: periodic reboots and software fix

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Patriot Missile System (1991):

  • Clock measured in tenths of seconds, converted to floating-point
  • Small rounding errors accumulated over 100+ hours
  • Mis-tracked incoming Scud missile
  • Consequence: missile hit US Army barracks, 28 killed, 98 injured

When Numbers Hit Their Limits

Source: (Baraniuk 2015)

Key lessons:

  • Data type limits can crash systems or cause critical errors
  • Overflow or precision drift is subtle but dangerous
  • Design for scale, and monitor counters/precision closely

Gangnam Style Breaks YouTube

Source: (BBC News 2014)

What happened (2014):

  • YouTube view counter: 32-bit signed integer
  • Maximum: 2,147,483,647 views
  • Gangnam Style exceeded this
  • Counter displayed negative values!

YouTube’s fix:

  • Upgraded to 64-bit integers
  • New maximum: 9,223,372,036,854,775,807
  • (Over 9 quintillion views)

Key lessons:

  • Design systems for unexpected scale
  • Monitor approaching limits
  • Plan for growth beyond estimates
  • Users notice when counters break!

Numbers: Floating-Point

Real numbers (ℝ): includes decimals

  • Examples: -232.32, 3.14159, 1,402.92
  • Usually represented as:
    • float (32-bit)
    • double (64-bit)
  • Important: These are approximations, not exact values

Example of rounding errors:

>>> 0.1 + 0.2
0.30000000000000004  # Not 0.3!

>>> 0.1 + 0.2 == 0.3
False  # Dangerous for comparisons!

Why? Binary can’t represent 0.1 exactly (like 1/3 in decimal = 0.333…)

Practical impact: Never compare floats with ==. Instead use:

abs(a - b) < 0.0001  # Check if "close enough"

Text: Characters and Strings

Character (char)

  • A single symbol: 'B', '!', '🎉'
  • Wrapped in quotes

String

  • A sequence of characters
  • Example: "Data Science is fascinating"
  • Also wrapped in quotes

String Quote Options

Quote Flexibility in Programming

Single vs. double quotes – pick your flavor!

  • Most languages (Python, JavaScript, Ruby) happily accept either:
name1 = 'Alice'
name2 = "Bob"  # Both valid – no judgment here!
  • ⚠️ Watch out: some languages like C treat them differently ('A' vs "A").

String Quote Options

Quote Flexibility in Programming

Triple quotes in Python – the “multiline superhero”:

query = """
SELECT name, age
FROM students
WHERE grade > 80
"""

poem = '''
Roses are red,
Violets are blue,
Data is great,
And Python is too!
'''
  • Can store big text blocks without crying over line breaks or escaping quotes.

String Quote Options

Quote Flexibility in Programming

Why triple quotes are your friend:

  • Keep formatting intact (no messy \ns)
  • No need to escape quotes inside (your keyboard will thank you)
  • Ideal for SQL, documentation, or fancy ASCII art
  • Python loves them for docstrings – even your future self will thank you! 😎

How Text Becomes Binary: ASCII

The ASCII table encodes 128 characters using 7 bits

Dec Char Description
32 space Space
33 ! exclamation mark
48 0 zero
49 1 one
65 A Uppercase A
66 B Uppercase B
97 a Lowercase a
98 b Lowercase b

Key insight:

Every character you type is converted to numbers!

Example: “Hi” = 72 + 105 in ASCII

The ASCII table

Dec Binary Char Description
0 0000000 NUL Null
1 0000001 SOH Start of Header
2 0000010 STX Start of Text
3 0000011 ETX End of Text
4 0000100 EOT End of Transmission
5 0000101 ENQ Enquiry
6 0000110 ACK Acknowledge
7 0000111 BEL Bell
8 0001000 BS Backspace
9 0001001 HT Horizontal Tab
10 0001010 LF Line Feed
11 0001011 VT Vertical Tab
12 0001100 FF Form Feed
13 0001101 CR Carriage Return
14 0001110 SO Shift Out
15 0001111 SI Shift In
16 0010000 DLE Data Link Escape
17 0010001 DC1 Device Control 1
18 0010010 DC2 Device Control 2
19 0010011 DC3 Device Control 3
20 0010100 DC4 Device Control 4
21 0010101 NAK Negative Acknowledge
22 0010110 SYN Synchronize
23 0010111 ETB End of Transmission Block
24 0011000 CAN Cancel
25 0011001 EM End of Medium
26 0011010 SUB Substitute
27 0011011 ESC Escape
28 0011100 FS File Separator
29 0011101 GS Group Separator
30 0011110 RS Record Separator
31 0011111 US Unit Separator
Dec Binary Char Description
32 0100000 space Space
33 0100001 ! exclamation mark
34 0100010 double quote
35 0100011 # number
36 0100100 $ dollar
37 0100101 % percent
38 0100110 & ampersand
39 0100111 single quote
40 0101000 ( left parenthesis
41 0101001 ) right parenthesis
42 0101010 * asterisk
43 0101011 + plus
44 0101100 , comma
45 0101101 - minus
46 0101110 . period
47 0101111 / slash
48 0110000 0 zero
49 0110001 1 one
50 0110010 2 two
51 0110011 3 three
52 0110100 4 four
53 0110101 5 five
54 0110110 6 six
55 0110111 7 seven
56 0111000 8 eight
57 0111001 9 nine
58 0111010 : colon
59 0111011 ; semicolon
60 0111100 < less than
61 0111101 = equality sign
62 0111110 > greater than
63 0111111 ? question mark
Dec Binary Char Description
64 1000000 @ at sign
65 1000001 A
66 1000010 B
67 1000011 C
68 1000100 D
69 1000101 E
70 1000110 F
71 1000111 G
72 1001000 H
73 1001001 I
74 1001010 J
75 1001011 K
76 1001100 L
77 1001101 M
78 1001110 N
79 1001111 O
80 1010000 P
81 1010001 Q
82 1010010 R
83 1010011 S
84 1010100 T
85 1010101 U
86 1010110 V
87 1010111 W
88 1011000 X
89 1011001 Y
90 1011010 Z
91 1011011 [ left square bracket
92 1011100 \ backslash
93 1011101 ] right square bracket
94 1011110 ^ caret / circumflex
95 1011111 _ underscore
Dec Binary Char Description
96 1100000 ` grave / accent
97 1100001 a
98 1100010 b
99 1100011 c
100 1100100 d
101 1100101 e
102 1100110 f
103 1100111 g
104 1101000 h
105 1101001 i
106 1101010 j
107 1101011 k
108 1101100 l
109 1101101 m
110 1101110 n
111 1101111 o
112 1110000 p
113 1110001 q
114 1110010 r
115 1110011 s
116 1110100 t
117 1110101 u
118 1110110 v
119 1110111 w
120 1111000 x
121 1111001 y
122 1111010 z
123 1111011 { left curly bracket
124 1111100
125 1111101 } right curly bracket
126 1111110 ~ tilde
127 1000001 DEL delete

ASCII Art: When Text Becomes Visual

      |\      _,,,---,,_
ZZZzz /,`.-'`'    -.  ;-;;,_
     |,4-  ) )-,_. ,\ (  `'-'
    '---''(_/--'  `-'\_) 

Sleeping cat by Felix Lee

  ;)( ;
 :----:     o8Oo./
C|====| ._o8o8o8Oo_.
 |    |  \========/
 `----'   `------'

Coffee by Hayley Jane Wakenshaw

Beyond ASCII: Modern Text Encoding

Other encoding standards:

  • UTF-8 (most common today)
    • Supports all languages
    • Backward compatible with ASCII
  • UTF-16, UTF-32
  • ISO-8859-1 (Latin-1)

Encoding Mismatches

Ever seen text like “Nestlé” instead of “Nestlé”?

That’s an encoding mismatch between UTF-8 and Latin-1!

Emojis Are Text Too! 😊

Emojis are part of the UTF-8 standard and can be used in data analysis!

Emojis in Data Science

Reference: (Felbo et al. 2017)

Research shows emojis improve sentiment analysis accuracy!

Dates and Times: Deceptively Complex

Date formats vary globally:

  • DD-MM-YYYY (UK: 06-10-2025)
  • MM-DD-YYYY (US: 10-06-2025)
  • YYYY-MM-DD (ISO: 2025-10-06) ✓ Recommended
  • YYYYMMDD (Compact: 20251006)

Timestamps add time and timezone:

  • 2025-10-06T14:30:00+01:00

Important

Always use ISO 8601 format (YYYY-MM-DD) when possible!

The Y2K Problem

Reference: (Oren 2019)

The issue:

  • Dates stored as YY not YYYY
  • 31/12/9901/01/00
  • Is that 1900 or 2000?

The result:

  • Billions spent on fixes
  • Global panic (largely avoided due to preparation)

More on the topic:

The Next Date Crisis: Y2038

Reference: (Gibbs 2014a)

Unix timestamps count seconds since January 1, 1970

32-bit systems will overflow on: January 19, 2038 at 03:14:07 UTC

Solution: Upgrade to 64-bit systems (happening now)

Working with Dates in Practice

The challenge:

  • Global format variations
  • Mixed formats in same dataset
  • Time zones and daylight saving
  • Leap years and calendar quirks

The solution: Use specialized libraries

  • R: lubridate package
  • Python: datetime module

These handle the complexity for you!

Break Time! ☕

Coming up after the break:

  • File formats demystified
  • Tidy data principles
  • Excel horror stories
  • Hands-on examples

Common File Formats

CSV: The Universal Format

Comma-Separated Values - the workhorse of data exchange

Characteristics:

  • Plain text format (.csv)
  • First row = column headers
  • Commas separate values (usually!)
  • Human-readable

Why use CSV?

  • Works everywhere
  • Simple to inspect
  • Version control friendly
  • No vendor lock-in

CSV in Action

Let’s see how our MP donations look as CSV:

Date,Member,Entity,Entity Category,Value,Nature
09/12/2024,John Milne,Isabella Tree,Individual,2500.00,cash
09/12/2024,Robert Jenrick,Colin Moynihan,Individual,2500.00,cash
09/12/2024,Mr James Cleverly,IPGL HOLDINGS LIMITED,Company,10000.00,cash
04/12/2024,Jeremy Corbyn,We Deserve Better,Unincorporated Association,5000.00,cash
09/11/2024,Richard Baker,Community Union,Trade Union,4000.00,cash

Live demo: Let’s open the MP donation database

Let’s open Camden open data

XML: Structured and Semantic

eXtensible Markup Language

Key features:

  • Uses tags: < >
  • Self-describing
  • Hierarchical structure
  • Industry standard
  • Files end with .xml
<?xml version="1.0" encoding="UTF-8"?>
<donation>
  <date>09/12/2024</date>
  <member>John Milne</member>
  <entity>Isabella Tree</entity>
  <category>Individual</category>
  <value currency="GBP">2500.00</value>
  <nature>cash</nature>
</donation>

XML Powers Office Documents

Note

Did you know? Microsoft Office files (.docx, .xlsx, .pptx) are actually ZIP files containing XML!

Try it: Rename a .docx to .zip and extract it

Real example: Let’s explore Camden open data XML →

HTML: The Web’s Language

HyperText Markup Language - like XML but for web pages

<!DOCTYPE html>
<html>
<head>
  <title>Data Science</title>
</head>
<body>
  <h1>Welcome!</h1>
  <p>This is HTML</p>
</body>
</html>

Every website uses HTML

  • Structured like XML
  • Designed for browsers
  • Can contain data
  • Scrape-able (with permission!)

JSON: The Modern Standard

JavaScript Object Notation - flexible and powerful

{
  "MP_donations": [
    {
      "date": "09/12/2024",
      "member": "John Milne",
      "entity": "Isabella Tree",
      "category": "Individual",
      "value": 2500.00,
      "nature": "cash"
    }
  ]
}

Why JSON is popular:

  • Web APIs use it extensively
  • Native to JavaScript
  • Supports nested structures
  • Human-readable and machine-parseable

JSON: Visualized

Flat structure - Simple and direct

JSON: Nested Structures

JSON can handle complex hierarchies:

{
  "MP_donations": [
    {
      "date": "09/12/2024",
      "nature": "cash",
      "donations": [
        {
          "member": "John Milne",
          "entity": "Isabella Tree",
          "category": "Individual",
          "value": 2500.00
        },
        {
          "member": "Robert Jenrick",
          "entity": "Colin Moynihan",
          "category": "Individual",
          "value": 2500.00
        }
      ]
    }
  ]
}

Nested JSON Visualized

Nested structure

  • Grouping related data

JSON vs XML: Compactness Compared

Same three donations, different formats:

XML (verbose):

<donations>
  <donation>
    <date>09/12/2024</date>
    <member>John Milne</member>
    <value>2500</value>
  </donation>
  <donation>
    <date>09/12/2024</date>
    <member>Robert Jenrick</member>
    <value>2500</value>
  </donation>
  <donation>
    <date>09/12/2024</date>
    <member>James Cleverly</member>
    <value>10000</value>
  </donation>
</donations>

367 characters

JSON (compact):

{
  "donations": [
    {
      "date": "09/12/2024",
      "member": "John Milne",
      "value": 2500
    },
    {
      "date": "09/12/2024",
      "member": "Robert Jenrick",
      "value": 2500
    },
    {
      "date": "09/12/2024",
      "member": "James Cleverly",
      "value": 10000
    }
  ]
}

268 characters (27% smaller)

Why this matters: At 1 million records, JSON saves ~99MB in storage/bandwidth

File Format Best Practices

For sharing data, prefer:

CSV

  • Simple tabular data (single table, flat structure)
  • Best for: Spreadsheet-like data, time series, survey results
  • Example: Sales records, sensor data, student grades

JSON

  • Nested/hierarchical data with relationships
  • Best for: API responses, configuration files, complex data structures
  • Example: Social media posts with comments, product catalogs with variants

XML

  • Industry-mandated standards and document exchange
  • Best for: Healthcare (HL7), finance (FpML), government data interchange
  • Example: Legal documents, RSS feeds, SOAP web services

Plain text

  • Always human-readable, future-proof
  • Best for: Documentation, logs, simple configurations

Avoid proprietary formats for sharing:
.xls, .sas7bdat, .rdata, .rds, software-specific formats (limit accessibility)

Why? Open formats ensure your data remains accessible regardless of software availability

Tidy Data Philosophy

Returning to Our Pipeline

start Start gather Gather data   start->gather store Store it somewhere gather->store       clean Clean & pre-process store->clean       build Build a dataset clean->build       eda Exploratory analysis build->eda ml Machine learning eda->ml       insight Extract insights ml->insight       communicate Share results insight->communicate       end End communicate->end

Building a tidy dataset is crucial for everything that follows!

What Makes Data Messy?

🤔 Discussion: What’s wrong with this spreadsheet for data analysis?

Take 30 seconds to think about it…

The Excel Problem

Excel is great for:

  • Quick data entry
  • Simple calculations
  • Basic visualization
  • Non-technical users

But dangerous for:

  • Data science workflows
  • Reproducible analysis
  • Large datasets
  • Collaborative projects

Let’s see why with some real examples…

Group Discussion Activity

Time for reflection! (10 minutes total)

  1. Break into assigned groups (~5 min reading)
  2. Group summary (~5 min discussion)
    • What went wrong?
    • What were the consequences?
    • How could it have been prevented?

Horror Story #1: Gene Names

Source: (Vincent 2020)

What happened:

  • Excel auto-converted gene names to dates
  • MARCH1 (short for “Membrane Associated Ring-CH-Type Finger 1”) → 1-Mar
  • SEPT2 (short for “Septin 2”)→ 2-Sep

The scale:

The fix: 27 genes officially renamed (MARCH1 → MARCHF1)

Horror Story #2: COVID-19 Data Loss

Source: (Hern 2020)

The incident:

  • UK Public Health England used Excel as exchange format with local trusts
  • Hit the 65,536 row limit
  • Lost 16,000 COVID cases
  • Delayed contact tracing

Impact: Lives at risk

Horror Story #3: Economic Policy

Reinhart-Rogoff spreadsheet error:

  • Influential economics paper
  • Original results showed average real economic growth slows (a 0.1% decline) when a country’s debt rises to more than 90% of gross domestic product
  • Influenced austerity policies globally
  • But, Excel formula excluded 5 rows (i.e 5 countries: Australia, Austria, Belgium, Canada and Denmark)
  • Wrong conclusions about debt/GDP (when that error was corrected, the “0.1% decline” data became a 2.2% average increase in economic growth)

Lesson: Always check your formulas!

Horror Story #4: The London Whale

Source: (Kwak 2013)

Horror Story #4: The London Whale

JP Morgan VaR Model (2012)

  • Built entirely in Excel
  • Manual copy-paste & fragile formulas
  • Formula error: divided by sum instead of average → VaR understated
  • Model review flagged issues but flaws never fixed
  • Resulted in $6.2 billion trading loss

Lesson:

Even sophisticated financial models fail on fragile, manually managed software. Critical calculations need robust, auditable systems.

Horror Story #5: Microsoft and Finance

Source: (Beales 2013)

A pattern of Excel-related failures:

  1. 2007 CPDOs (i.e Constant proportion debt obligations): Moody’s coding error inflated structured finance ratings (pre-crisis)

  2. 2012 London Whale: JPMorgan’s Excel-based risk model failures ($6.2B loss)

  3. 2013 Reinhart-Rogoff: Spreadsheet error undermined influential austerity research

The pattern: Office tools used as substitutes for proper systems and critical thinking

The Excel Takeaway

Excel is NOT for:

  • Serious data science
  • Reproducible analysis
  • Large-scale data processing
  • Production systems

Excel IS fine for:

  • Quick data exploration
  • Small personal datasets
  • Initial data entry
  • Simple calculations

Just know its limitations!

Beyond Excel Horror: Real Messiness

The “Excel horror stories” show one type of problem…

But there’s another kind of “messy”:

Data that’s technically correct but structured in ways that make analysis difficult or impossible

This is where “tidy data” principles become essential

What Is Tidy Data?

A specific structure that makes analysis easier

Credit: Allison Horst Stats Illustrations

The Three Rules of Tidy Data

  1. Each variable forms a column
    • One column = one type of measurement
    • Example violation: Multiple years as separate columns
  2. Each observation forms a row
    • One row = one distinct data point
    • Example violation: Multiple measurements in one cell
  3. Each type of observational unit forms a table
    • Related data stays together
    • Example violation: Mixing patient data with hospital data

The Three Rules of Tidy Data

Warning

Real-world data commonly violates these rules! Examples:

  • Column headers are values, not variable names (e.g., “2020”, “2021”, “2022”)
  • Multiple variables in one column (e.g., “Country_Year”)
  • Multiple types of observations in same table (e.g., mixing individual and summary statistics)

Tidy vs. Messy: Example

Messy (wide format):

| Patient | Baseline_BP | Week1_BP | Week2_BP |
|---------|-------------|----------|----------|
| John    | 140         | 135      | 130      |
| Mary    | 150         | 148      | 145      |

Problem: Time periods are columns, not observations. Hard to plot or calculate trends.

Tidy (long format):

| Patient | Week     | Blood_Pressure |
|---------|----------|----------------|
| John    | Baseline | 140            |
| John    | Week1    | 135            |
| John    | Week2    | 130            |
| Mary    | Baseline | 150            |
| Mary    | Week1    | 148            |
| Mary    | Week2    | 145            |

Now we can easily: group by patient, plot BP over time, calculate average change

Important Caveat: Context Matters

“Tidy” isn’t always optimal

Sometimes you need different structures for:

  • Performance (e.g., matrix operations in ML)
  • Specialized domains (e.g., genomics formats)
  • Specific algorithms (e.g., sparse matrices)

The key: Understand why tidy data is useful, then make informed choices about when to deviate

Further reading: Chapter 12, Section 12.7 in (Wickham and Grolemund 2016), (Leek 2016)

Flat Files vs. Databases

Feature Flat File Database
Structure Simple, unstructured format with records stored in plain text files Highly structured format with tables, rows, and columns
Organization Records are stored in a single file or multiple files, but there is no relationship between them Records are organized into tables, with relationships established between tables through keys and indexes
Access Data is accessed by reading the entire file sequentially, making it less efficient for large datasets Data is accessed using SQL (Structured Query Language) commands, which are more efficient for large datasets and allow for more advanced querying and manipulation
Scalability Limited scalability, as adding more data requires creating new files or appending to existing ones High scalability, as data can be easily added, updated, and deleted without affecting the overall structure of the database
Security Limited built-in security features, and data is vulnerable to corruption or loss if the file is not backed up properly Built-in security features, such as user accounts and permissions, and data is protected against corruption or loss through automatic backups and transaction logs
Examples CSV, TSV files SQL databases, like MySQL, Oracle, SQL Server

Source: DatabaseTown

Rule of thumb: Start with CSV, graduate to databases as complexity grows

Key Takeaways

  1. Data types matter: Understand bits, bytes, and how data is represented

  2. Choose the right format: CSV for simplicity, JSON for flexibility

  3. Dates are tricky: Use ISO 8601 format and proper libraries

  4. Excel has limits: Great for exploration, dangerous for production

  5. Tidy data enables analysis: Structure your data for the tools you’ll use

  6. Context matters: Rules are guidelines, not absolute laws

Looking Ahead

Next week:

  • What is computational thinking?
  • What do we mean by algorithms?

Resources & Further Reading

Explore:

Practice:

Resources & Further Reading

Essential reading:

Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu, and Mark Ziemann. 2021. “Gene Name Errors: Lessons Not Learned.” PLoS Computational Biology 17 (7): e1008984.
Bailey, David H., and Jonathan Borwein (Jon). 2013. “The Reinhart-Rogoff Error – or How Not to Excel at Economics. The Conversation.” April 22, 2013. http://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646.
Baraniuk, Chris. 2015. “The Number Glitch That Can Lead to Catastrophe.” BBC Future, May. https://www.bbc.com/future/article/20150505-the-numbers-that-lead-to-disaster.
BBC News. 2014. “Gangnam Style Music Video ’Broke’ YouTube View Limit.” BBC News, December. https://www.bbc.co.uk/news/world-asia-30288542.
Beales, Richard. 2013. “Blame Microsoft.” New York Times (Online). 2013. https://www.proquest.com/blogs-podcasts-websites/blame-microsoft/docview/2215333604/se-2.
Felbo, Bjarke, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. “Using Millions of Emoji Occurrences to Learn Any-Domain Representations for Detecting Sentiment, Emotion and Sarcasm.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1615–25. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1169.
Gibbs, Samuel. 2014a. “Is the Year 2038 Problem the New Y2K Bug?” The Guardian (London). https://www.theguardian.com/technology/2014/dec/17/is-the-year-2038-problem-the-new-y2k-bug.
———. 2014b. “Y2K Bug Triggers Army Conscription Notices Sent to 14,000 Dead Men.” The Guardian (London). https://www.theguardian.com/technology/2014/jul/11/y2k-bug-us-army-conscription-1800s-pennsylvania.
Hern, Alex. 2020. “Covid: How Excel May Have Caused Loss of 16,000 Test Results in England.” The Guardian. https://www.theguardian.com/politics/2020/oct/05/how-excel-may-have-caused-loss-of-16000-covid-tests-in-england.
Kwak, James. 2013. “The Importance of Excel.” "The Baseline Scenario" Blog, February. https://baselinescenario.com/2013/02/09/the-importance-of-excel/.
Leek, Jeff. 2016. “Non-Tidy Data.” "Simply Statistics" Blog. https://simplystatistics.org/posts/2016-02-17-non-tidy-data/.
Oren, Nir. 2019. “If You Think the Millennium Bug Was a Hoax, Here Comes a History Lesson.” The Conversation. http://theconversation.com/if-you-think-the-millennium-bug-was-a-hoax-here-comes-a-history-lesson-129042.
Reuters. 1999. “Nuclear ’Glitches’ in Japan.” Wired, December. https://www.wired.com/1999/12/nuclear-glitches-in-japan/.
Ted Bridis. 2000. “’Business as Usual’ in Y2K. Standard-Times.” January 3, 2000. https://www.southcoasttoday.com/story/news/nation-world/2000/01/03/business-as-usual-in/50500031007/.
“The Excel Depression.” 2013. New York Times (Online). https://gate3.library.lse.ac.uk/login?url=https://www.proquest.com/docview/2215248827?pq-origsite=primo&accountid=9630.
Thomas, Martyn. 2019. “The Millennium Bug Was Real – and 20 Years Later We Face the Same Threats.” The Guardian (Online). https://www.theguardian.com/commentisfree/2019/dec/31/millennium-bug-face-fears-y2k-it-systems.
Uenuma, Francine. 2019. “Years Later, the Y2K Bug Seems Like a Joke–Because Those Behind the Scenes Took It Seriously.” Time. https://time.com/5752129/y2k-bug-history/.
Vincent, James. 2020. “Scientists Rename Human Genes to Stop Microsoft Excel from Misreading Them as Dates.” The Verge, August. https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly. https://r4ds.had.co.nz/.
Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1): 1–3.

Questions?

Office hours: Check StudentHub/course website for times

Discussion forum: Post questions on Slack for peer support or for support from teachers

Email or Slack private messages: For personal queries (prefer Slack as there is some latency in responding to emails)

Remember: Your first summative is due Week 04 (October 21st)