🗓️ Week 02
Data types and the concept of tidy data

DS101 – Fundamentals of Data Science

02 Oct 2023

Dataset

📒 A dataset is simply a collection of data.

We need organised datasets

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

An example

Now we will start distinguishing:​

  • the types of data we can find in a dataset;​
  • the best practices to assemble a dataset.​

Structured data

📒 Structured data is data that can be organised in a tabular format.

  • Data that adheres to a pre-defined data model. ​
  • As the name suggests, it has a structure.
  • Tabular data is a common data structure ⏭️

Example

London Air air pollution monitoring dataset

Site Species ReadingDateTime Value Units Provisional or Ratified
BL0 PM2.5 19/08/2023 20:00 5.9 ug m-3 P
BL0 PM2.5 19/08/2023 21:00 6.1 ug m-3 P
BL0 PM2.5 19/08/2023 22:00 5.8 ug m-3 P
BL0 PM2.5 19/08/2023 23:00 4.9 ug m-3 P
BL0 SO2 19/08/2023 20:00 1.5 ug m-3 P
BL0 SO2 19/08/2023 21:00 0.9 ug m-3 P
BL0 SO2 19/08/2023 22:00 1.3 ug m-3 P
BL0 SO2 19/08/2023 23:00 1.3 ug m-3 P
BL0 NO2 19/08/2023 20:00 10.3 ug m-3 P
BL0 NO2 19/08/2023 21:00 13 ug m-3 P
BL0 NO2 19/08/2023 22:00 14.7 ug m-3 P
BL0 NO2 19/08/2023 23:00 14.2 ug m-3 P

🔗Link to data

Example (cont.)

Data type: numeric

Site Species ReadingDateTime Value Units Provisional or Ratified
BL0 PM2.5 19/08/2023 20:00 5.9 ug m-3 P
BL0 PM2.5 19/08/2023 21:00 6.1 ug m-3 P
BL0 PM2.5 19/08/2023 22:00 5.8 ug m-3 P
BL0 PM2.5 19/08/2023 23:00 4.9 ug m-3 P
BL0 SO2 19/08/2023 20:00 1.5 ug m-3 P
BL0 SO2 19/08/2023 21:00 0.9 ug m-3 P
BL0 SO2 19/08/2023 22:00 1.3 ug m-3 P
BL0 SO2 19/08/2023 23:00 1.3 ug m-3 P
BL0 NO2 19/08/2023 20:00 10.3 ug m-3 P
BL0 NO2 19/08/2023 21:00 13 ug m-3 P
BL0 NO2 19/08/2023 22:00 14.7 ug m-3 P
BL0 NO2 19/08/2023 23:00 14.2 ug m-3 P

Example (cont.)

Data type: string (text) but could be treated as categorical

Site Species ReadingDateTime Value Units Provisional or Ratified
BL0 PM2.5 19/08/2023 20:00 5.9 ug m-3 P
BL0 PM2.5 19/08/2023 21:00 6.1 ug m-3 P
BL0 PM2.5 19/08/2023 22:00 5.8 ug m-3 P
BL0 PM2.5 19/08/2023 23:00 4.9 ug m-3 P
BL0 SO2 19/08/2023 20:00 1.5 ug m-3 P
BL0 SO2 19/08/2023 21:00 0.9 ug m-3 P
BL0 SO2 19/08/2023 22:00 1.3 ug m-3 P
BL0 SO2 19/08/2023 23:00 1.3 ug m-3 P
BL0 NO2 19/08/2023 20:00 10.3 ug m-3 P
BL0 NO2 19/08/2023 21:00 13 ug m-3 P
BL0 NO2 19/08/2023 22:00 14.7 ug m-3 P
BL0 NO2 19/08/2023 23:00 14.2 ug m-3 P

Example (cont.)

Data type: date

Site Species ReadingDateTime Value Units Provisional or Ratified
BL0 PM2.5 19/08/2023 20:00 5.9 ug m-3 P
BL0 PM2.5 19/08/2023 21:00 6.1 ug m-3 P
BL0 PM2.5 19/08/2023 22:00 5.8 ug m-3 P
BL0 PM2.5 19/08/2023 23:00 4.9 ug m-3 P
BL0 SO2 19/08/2023 20:00 1.5 ug m-3 P
BL0 SO2 19/08/2023 21:00 0.9 ug m-3 P
BL0 SO2 19/08/2023 22:00 1.3 ug m-3 P
BL0 SO2 19/08/2023 23:00 1.3 ug m-3 P
BL0 NO2 19/08/2023 20:00 10.3 ug m-3 P
BL0 NO2 19/08/2023 21:00 13 ug m-3 P
BL0 NO2 19/08/2023 22:00 14.7 ug m-3 P
BL0 NO2 19/08/2023 23:00 14.2 ug m-3 P

Basic data types

Bits and Bytes

  • Basic unit of data:​ a bit
  • Either a 0 or 1​
  • With \(n\) bits, one can represent \(2^n\) combinations
  • For example, with \(n=4\) bits we can represent \(16\) unique values ➡️
Bits Value
0000 0
0001 1
0010 2
0011 3
0100 4
0101 5
0110 6
0111 7
Bits Value
1000 8
1001 9
1010 10
1011 11
1100 12
1101 13
1110 14
1111 15​

Units of data

Unit of measurement​ Abbr. Conversion​
bit ​ b ​ 1 bit​
nibble​ -​ 4 bits (or half a byte)​
Byte​ B​ 8 bits​
Kilobyte​ KB​ 1024 bytes​
Megabyte​ MB​ 1024 kilobytes (or 1048576 bytes)​
Gigabyte​ GB​ 1024 megabytes​
Terabyte​ TB​ 1024 gigabytes​
Petabyte​ PB​ 1024 terabytes​
Exabyte EB 1024 petabytes
Zettabyte ZB 1024 exabytes
Yottabyte YB 1024 zettabytes

How are numbers represented?

Whole Numbers (\(\mathbb{Z}\))

  • Integers (int): Numbers that do not contain a decimal digit
  • Examples: \(-2, -1, 0, +1, +2\)
  • What range of numbers do you need?
    • 8-bits (\(2^8\) numbers): range from \(-128\) to \(+127\)
    • 16-bits (\(2^{16}\) numbers): range from \(-32,678\) to \(+32,767\)
    • 32-bits (\(2^{32}\) numbers): range from \(-2,147,483,648\) to \(+2,147,483,647\)
    • 64-bits (\(2^{64}\) numbers): range from \(-9,223,372,036,854,775,808\) to \(+9,223,372,036,854,775,807\)

What could possibly go wrong?

Source: (Baraniuk 2015)

Source: (BBC News 2014)

In practice, it is unlikely you will ever have to think about it!

Other types of numbers

Real Numbers (\(\mathbb{R}\)), either Rational (\(\mathbb{Q}\)) or Irrational (\(\mathbb{Q'}\))

  • In computing, these are represented by Floating-point numbers

    • Unlike the sets in mathematics, these are almost always approximate numbers
  • Random examples:

    • \(-232.32\)
    • \(+3.14159\)
    • \(0\)
    • \(+1,402.9202818779\)
  • Most commonly, represented by 32-bits (float) or 64-bits (double)

    • In the programming language R, these are called numeric
    • In the programming language Python, the most common type for such numbers is float (represented by 64-bits and not 32 bits unlike as in other languages e.g C). For more on Python floats, see 🔗this page

Text (data types)

  • Character (char):
    • a single character, as a represented letter, number, punctuation mark, emoji, symbol, etc.
    • Generally expressed wrapped by single quotes (') or double (") quotes in most tools and programming languages.
    • Examples: 'B', 'a', '!', '.'
  • String
    • a sequence of characters together. Raw text.
    • Typically represented under quotes, too.
    • Examples: "The text inside these quotes is a string."

Text (data types)

  • How is text represented in binary language?
    • The ASCII table is one of early standards
    • It encodes characters using 7-bits; therefore, it can represent a total of 128 unique characters

The ASCII table

Dec Binary Char Description
0 0000000 NUL Null
1 0000001 SOH Start of Header
2 0000010 STX Start of Text
3 0000011 ETX End of Text
4 0000100 EOT End of Transmission
5 0000101 ENQ Enquiry
6 0000110 ACK Acknowledge
7 0000111 BEL Bell
8 0001000 BS Backspace
9 0001001 HT Horizontal Tab
10 0001010 LF Line Feed
11 0001011 VT Vertical Tab
12 0001100 FF Form Feed
13 0001101 CR Carriage Return
14 0001110 SO Shift Out
15 0001111 SI Shift In
16 0010000 DLE Data Link Escape
17 0010001 DC1 Device Control 1
18 0010010 DC2 Device Control 2
19 0010011 DC3 Device Control 3
20 0010100 DC4 Device Control 4
21 0010101 NAK Negative Acknowledge
22 0010110 SYN Synchronize
23 0010111 ETB End of Transmission Block
24 0011000 CAN Cancel
25 0011001 EM End of Medium
26 0011010 SUB Substitute
27 0011011 ESC Escape
28 0011100 FS File Separator
29 0011101 GS Group Separator
30 0011110 RS Record Separator
31 0011111 US Unit Separator
Dec Binary Char Description
32 0100000 space Space
33 0100001 ! exclamation mark
34 0100010 double quote
35 0100011 # number
36 0100100 $ dollar
37 0100101 % percent
38 0100110 & ampersand
39 0100111 single quote
40 0101000 ( left parenthesis
41 0101001 ) right parenthesis
42 0101010 * asterisk
43 0101011 + plus
44 0101100 , comma
45 0101101 - minus
46 0101110 . period
47 0101111 / slash
48 0110000 0 zero
49 0110001 1 one
50 0110010 2 two
51 0110011 3 three
52 0110100 4 four
53 0110101 5 five
54 0110110 6 six
55 0110111 7 seven
56 0111000 8 eight
57 0111001 9 nine
58 0111010 : colon
59 0111011 ; semicolon
60 0111100 < less than
61 0111101 = equality sign
62 0111110 > greater than
63 0111111 ? question mark
Dec Binary Char Description
64 1000000 @ at sign
65 1000001 A
66 1000010 B
67 1000011 C
68 1000100 D
69 1000101 E
70 1000110 F
71 1000111 G
72 1001000 H
73 1001001 I
74 1001010 J
75 1001011 K
76 1001100 L
77 1001101 M
78 1001110 N
79 1001111 O
80 1010000 P
81 1010001 Q
82 1010010 R
83 1010011 S
84 1010100 T
85 1010101 U
86 1010110 V
87 1010111 W
88 1011000 X
89 1011001 Y
90 1011010 Z
91 1011011 [ left square bracket
92 1011100 \ backslash
93 1011101 ] right square bracket
94 1011110 ^ caret / circumflex
95 1011111 _ underscore
Dec Binary Char Description
96 1100000 ` grave / accent
97 1100001 a
98 1100010 b
99 1100011 c
100 1100100 d
101 1100101 e
102 1100110 f
103 1100111 g
104 1101000 h
105 1101001 i
106 1101010 j
107 1101011 k
108 1101100 l
109 1101101 m
110 1101110 n
111 1101111 o
112 1110000 p
113 1110001 q
114 1110010 r
115 1110011 s
116 1110100 t
117 1110101 u
118 1110110 v
119 1110111 w
120 1111000 x
121 1111001 y
122 1111010 z
123 1111011 { left curly bracket
124 1111100
125 1111101 } right curly bracket
126 1111110 ~ tilde
127 1000001 DEL delete

ASCII art (?)

You can make “art” with it

🔗 link

Examples from ASCII Art website:

  • Coffee and cereal by Hayley Jane Wakenshaw
  ;)( ;
 :----:     o8Oo./
C|====| ._o8o8o8Oo_.
 |    |  \========/
 `----'   `------'
  • Cat by Felix Lee (?)
      |\      _,,,---,,_
ZZZzz /,`.-'`'    -.  ;-;;,_
     |,4-  ) )-,_. ,\ (  `'-'
    '---''(_/--'  `-'\_)  Felix Lee 
  • Art (sunset) by Joan G. Stark
                   \       /            _\/_
                     .-'-.              //o\  _\/_
  _  ___  __  _ --_ /     \ _--_ __  __ _ | __/o\\ _
=-=-_=-=-_=-=_=-_= -=======- = =-=_=-=_,-'|"'""-|-,_ 
 =- _=-=-_=- _=-= _--=====- _=-=_-_,-"          |
jgs=- =- =-= =- = -  -===- -= - ."

Other text formats

ASCII is not the only standard. There are other ways to encode text using binary.

Below is a non-comprehensive list of other text (encoding):

  • Unicode
  • UTF-8 (Most common of all)
  • UTF-16
  • UTF-32
  • ISO-8859-1
  • Latin-1

Note

💡 You might have come across encoding mismatches before if you ever opened a file and the text looked like this:

“Nestlé and Mötley Crüe”

Where it should have read

“Nestlé and Mötley Crüe”

Emojis 😀

Emojis are text! They are part of UTF-8

Emojis can aid sentiment analysis

Dates and timestamps

  • Date: a calendar date
  • Format examples:
    • DD-MM-YYYY
    • DD/MM/YY
    • MM-DD-YYYY
    • YYYY-MM-DD
    • YYYYMMDD
  • Time: the time of day
    • Example of format: HH:mm:SS
  • Timestamp: Date + Time + marker of timezone
    • Example: 2022-01-11T18:49:05+00:00

What can go wrong?

  • Say we adopt DD/MM/YY
  • The next day after 31/12/99 is 01/01/00
  • But is it:
    • 31 December 1899 → 01 January 1900, or
    • 31 December 1999 → 01 January 2000?

The Millenium bug

After the Y2K (i.e Millenium bug), the Year 2038 bug?

Reference: (Gibbs 2014a)

Dates and data pre-processing

In sum:

  • Handling dates can be very frustrating!
  • Each country has their own favourite date format
  • If date is collected unstructured (as strings), there can be a mix of formats in the same file. Etc.

  • There are tools to help with that in all major programming languages related to data science:
    • lubridate package in R
    • datetime module in Python

Time for a break 🍵

After the break:

  • Common file formats
  • The concept of “tidy data”
  • ✍️ This week’s coursework

Common File Formats

CSV (Comma-Separated Values)

📄 CSV files are some of the simplest and most common formats to export rectangular data from databases and spreadsheets.

XML (eXtensible Markup Language):

📄 XML files are made of tags demarcated by the characters: < and >.

XML Example

  • It tells you the standard version and text encoding.​
  • Data is wrapped within tags​
  • Flexible

You can find yet another example of XML in Camden open data ➡️

HTML

Websites are encoded as HTML, a format similar to XML.

Example ➡️

JSON (JavaScript Object Notation)

📄 JSON files are also saved in text format but structured as name-value pairs. Filenames end with .json

{
  "London Air Pollution Data": 
  [
  {"Site": "BLO",
  "Species": "NO2",
  "ReadingDateTime": "19/08/2023 20:00",
  "Value": 10.3,
  "Units": "ug m-3",
  "Provisional or Ratified" : "P"},
  {
    "Site": "BLO",
    "Species": "SO2",
    "ReadingDateTime": "19/08/2023 20:00",
    "Value": 1.5,
    "Units": "ug m-3",
    "Provisional or Ratified": "P"
  },
  {
    "Site": "BLO",
    "Species": "PM2.5",
    "ReadingDateTime": "19/08/2023 20:00",
    "Value": 5.9,
    "Units": "ug m-3",
    "Provisional or Ratified": "P"
  }
  ]
}

If we were to visualise it:

Another way to write the same JSON

📄 The previous JSON file had a flat structure but JSON files can have more flexible, complex, nested structures.

{
{
  "London Air Pollution Data": 
    [
      {
        "Site": "BLO",
        "Measurements":
        [
          {
            "Species": "NO2",
            "ReadingDateTime": "19/08/2023 20:00",
            "Value": 10.3,
            "Units": "ug m-3",
            "Provisional or Ratified": "P"
          },
          { 
            "Species": "SO2",
            "ReadingDateTime": "19/08/2023 20:00",
            "Value": 1.5,
            "Units": "ug m-3",
            "Provisional or Ratified": "P"
          },
          {
            "Species": "PM2.5",
            "ReadingDateTime": "19/08/2023 20:00",
            "Value": 5.9,
            "Units": "ug m-3",
            "Provisional or Ratified": "P"
          }
        ]
      }
    ]
}

If we were to visualise this (nested) JSON?

Final note about file formats

  • If you are producing or sharing data, use open text formats like CSV, XML or JSON​
  • All major software packages and programming languages can parse these​
  • It is easy to inspect for errors, if necessary, manually​
  • Avoid proprietary, binary file formats:​
    • Tabular data: closed-source Excel .xls
    • Software-specific: .rdata, .rds, SAS7BDAT

Definition

.json, .html, .xml, .csv, etc. are called file extensions.

Tidy Data

Remember, we need organised datasets

💡 In fact, more than organised, we often need datasets to be tidy

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

Structured data can still be messy

What is wrong with using the spreadsheet below for data analysis ❓

Excel horror stories

By the way, although Excel can be useful, it can also be quite dangerous.​

(Be aware of its limitations!!)​

Source: Vincent (2020)

  • some 27 human genes have been renamed as Microsoft Excel kept misreading their symbols as dates e.g MARCH1 (short for “Membrane Associated Ring-CH-Type Finger 1”) renamed as MARCHF1 (Excel would convert value to date (1-Mar))
  • A study from 2016 (Ziemann, Eren, and El-Osta (2016)) reviewed the genetic data shared alongside 3,597 published papers and found that about one-fifth had been affected by Excel gene name conversion errors.
  • A second study (Abeysooriya et al. (2021)) shows the same problem to still persist years later: gene name errors were found in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists screened.

Excel horror stories

By the way, although Excel can be useful, it can also be quite dangerous.​

(Be aware of its limitations!!)​

Source: Hern (2020)

Excel horror stories

By the way, although Excel can be useful, it can also be quite dangerous.​

(Be aware of its limitations!!)​

Source: Kwak (2013)

Source: Beales (2013)

For more, read Excel horror stories

Excel horror stories

Excel is not a good tool for real data science and data analysis/manipulation. ​

But it is fine to use it to explore data if you are aware of its pitfalls.​

Messy Data

But I want to highlight a different kind of “messiness” today:​

  • In this context, messy data is data formatted in such a way that a lot of preprocessing is required before it can be used with:
    • statistical tools, or
    • Machine Learning packages​
  • Most collected data will be messy, but you can make it tidy.​

Caveat : Tidy data but not always!!

  • The pejorative term “messy” was used to refer to non-tidy data and to mean something to avoid at all costs.
  • That’s quite an oversimplification: there are many useful and well-founded data structures that are not tidy data. On the whole, there are two main reasons to use non-tidy data structures rather than tidy data structures:
    • alternative representations that may have substantial performance or space advantages (e.g matrice representation for algorithms optimized to work on matrices e.g PCA, single value decomposition or association rule mining)
    • specialised fields that have evolved their own conventions for storing data that may be quite different to the conventions of tidy data (e.g genomics).

For more on this, see Chapter 12, Section 12.7 in (Wickham and Grolemund 2016) and (Leek 2016).

✍️ Your formative

  • How you will put this knowledge to work this week

References

Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu, and Mark Ziemann. 2021. “Gene Name Errors: Lessons Not Learned.” PLoS Computational Biology 17 (7): e1008984.
Baraniuk, Chris. 2015. “The Number Glitch That Can Lead to Catastrophe.” BBC Future, May. https://www.bbc.com/future/article/20150505-the-numbers-that-lead-to-disaster.
BBC News. 2014. “Gangnam Style Music Video ’Broke’ YouTube View Limit.” BBC News, December. https://www.bbc.co.uk/news/world-asia-30288542.
Beales, Richard. 2013. “Blame Microsoft.” New York Times (Online). 2013. https://www.proquest.com/blogs-podcasts-websites/blame-microsoft/docview/2215333604/se-2.
Felbo, Bjarke, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. “Using Millions of Emoji Occurrences to Learn Any-Domain Representations for Detecting Sentiment, Emotion and Sarcasm.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1615–25. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1169.
Gibbs, Samuel. 2014a. “Is the Year 2038 Problem the New Y2K Bug?” The Guardian (London). https://www.theguardian.com/technology/2014/dec/17/is-the-year-2038-problem-the-new-y2k-bug.
———. 2014b. “Y2K Bug Triggers Army Conscription Notices Sent to 14,000 Dead Men.” The Guardian (London). https://www.theguardian.com/technology/2014/jul/11/y2k-bug-us-army-conscription-1800s-pennsylvania.
Hern, Alex. 2020. “Covid: How Excel May Have Caused Loss of 16,000 Test Results in England.” The Guardian. https://www.theguardian.com/politics/2020/oct/05/how-excel-may-have-caused-loss-of-16000-covid-tests-in-england.
Kwak, James. 2013. “The Importance of Excel.” "The Baseline Scenario" Blog, February. https://baselinescenario.com/2013/02/09/the-importance-of-excel/.
Leek, Jeff. 2016. “Non-Tidy Data.” "Simply Statistics" Blog. https://simplystatistics.org/posts/2016-02-17-non-tidy-data/.
Oren, Nir. 2019. “If You Think the Millennium Bug Was a Hoax, Here Comes a History Lesson.” The Conversation. http://theconversation.com/if-you-think-the-millennium-bug-was-a-hoax-here-comes-a-history-lesson-129042.
Thomas, Martyn. 2019. “The Millennium Bug Was Real – and 20 Years Later We Face the Same Threats.” The Guardian (Online). https://www.theguardian.com/commentisfree/2019/dec/31/millennium-bug-face-fears-y2k-it-systems.
Uenuma, Francine. 2019. “Years Later, the Y2K Bug Seems Like a Joke–Because Those Behind the Scenes Took It Seriously.” Time. https://time.com/5752129/y2k-bug-history/.
Vincent, James. 2020. “Scientists Rename Human Genes to Stop Microsoft Excel from Misreading Them as Dates.” The Verge, August. https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly. https://r4ds.had.co.nz/.
Williams, Martyn. 2000a. “Computer Problems Hit Three Nuclear Plants in Japan.” CNN. https://edition.cnn.com/2000/TECH/computing/01/03/japan.nukes.y2k.idg/index.html.
———. 2000b. “Y2K Bug Hits Heating System in Korean Apartments.” CNN. https://edition.cnn.com/2000/TECH/computing/01/03/korea.heat.y2k.idg/index.html.
Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1): 1–3.