🗓️ Week 02
Data types and the concept of tidy data

DS101 – Fundamentals of Data Science

23 Jan 2023

Dataset

📒 A dataset is simply a collection of data.

We need organised datasets

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

An example

Now we will start distinguishing:​

  • the types of data we can find in a dataset;​
  • the best practices to assemble a dataset.​

Structured data

📒 Structured data is data that can be organised in a tabular format.

  • Data that adheres to a pre-defined data model. ​
  • As the name suggests, it has a structure.
  • Tabular data is a common data structure ⏭️

Example

Camden crime dataset

Category Street ID Street Name Outcome Category Outcome Date
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Other theft 2343955 Holborn (lu Station) Under investigation 01/11/2022
Theft from the person 2343569 Euston (station) Under investigation 01/11/2022
Shoplifting 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343101 Camden Town (lu Station) Under investigation 01/11/2022
Violence and sexual offences 2343568 Euston (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343164 Chancery Lane (lu Station) Under investigation 01/11/2022
Public order 2343569 Euston (station) Under investigation 01/11/2022
Other theft 2343569 Euston (station) Under investigation 01/11/2022
Theft from the person 2343955 Holborn (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Robbery 2343623 Finchley Road (lu Station) Under investigation 01/11/2022
Theft from the person 1678851 On or near Shelton Street Investigation complete; no suspect identified 01/11/2022

Example (cont.)

Data type: numeric

Category Street ID Street Name Outcome Category Outcome Date
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Other theft 2343955 Holborn (lu Station) Under investigation 01/11/2022
Theft from the person 2343569 Euston (station) Under investigation 01/11/2022
Shoplifting 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343101 Camden Town (lu Station) Under investigation 01/11/2022
Violence and sexual offences 2343568 Euston (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343164 Chancery Lane (lu Station) Under investigation 01/11/2022
Public order 2343569 Euston (station) Under investigation 01/11/2022
Other theft 2343569 Euston (station) Under investigation 01/11/2022
Theft from the person 2343955 Holborn (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Robbery 2343623 Finchley Road (lu Station) Under investigation 01/11/2022
Theft from the person 1678851 On or near Shelton Street Investigation complete; no suspect identified 01/11/2022

Example (cont.)

Data type: string (text) but could be treated as categorical

Category Street ID Street Name Outcome Category Outcome Date
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Other theft 2343955 Holborn (lu Station) Under investigation 01/11/2022
Theft from the person 2343569 Euston (station) Under investigation 01/11/2022
Shoplifting 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343101 Camden Town (lu Station) Under investigation 01/11/2022
Violence and sexual offences 2343568 Euston (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343164 Chancery Lane (lu Station) Under investigation 01/11/2022
Public order 2343569 Euston (station) Under investigation 01/11/2022
Other theft 2343569 Euston (station) Under investigation 01/11/2022
Theft from the person 2343955 Holborn (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Robbery 2343623 Finchley Road (lu Station) Under investigation 01/11/2022
Theft from the person 1678851 On or near Shelton Street Investigation complete; no suspect identified 01/11/2022

Example (cont.)

Data type: date

Category Street ID Street Name Outcome Category Outcome Date
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Other theft 2343955 Holborn (lu Station) Under investigation 01/11/2022
Theft from the person 2343569 Euston (station) Under investigation 01/11/2022
Shoplifting 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343101 Camden Town (lu Station) Under investigation 01/11/2022
Violence and sexual offences 2343568 Euston (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Public order 2343164 Chancery Lane (lu Station) Under investigation 01/11/2022
Public order 2343569 Euston (station) Under investigation 01/11/2022
Other theft 2343569 Euston (station) Under investigation 01/11/2022
Theft from the person 2343955 Holborn (lu Station) Under investigation 01/11/2022
Other theft 2344108 Kings Cross (station) Under investigation 01/11/2022
Robbery 2343623 Finchley Road (lu Station) Under investigation 01/11/2022
Theft from the person 1678851 On or near Shelton Street Investigation complete; no suspect identified 01/11/2022

Basic data types

Bits and Bytes

  • Basic unit of data:​ a bit
  • Either a 0 or 1​
  • With \(n\) bits, one can represent \(n^2\) combinations
  • For example, with \(n=4\) bits we can represent \(16\) unique values ➡️
Bits Value
0000 0
0001 1
0010 2
0011 3
0100 4
0101 5
0110 6
0111 7
Bits Value
1000 8
1001 9
1010 10
1011 11
1100 12
1101 13
1110 14
1111 15​

Units of data

Unit of measurement​ Abbr. Conversion​
bit ​ b ​ 1 bit​
nibble​ -​ 4 bits (or half a byte)​
Byte​ B​ 8 bits​
Kilobyte​ KB​ 1024 bytes​
Megabyte​ MB​ 1024 kilobytes (or 1048576 bytes)​
Gigabyte​ GB​ 1024 megabytes​
Terabyte​ TB​ 1024 gigabytes​
Petabyte​ PB​ 1024 terabytes​

How are numbers represented?

Whole Numbers (\(\mathbb{Z}\))

  • Integers (int): Numbers that do not contain a decimal digit
  • Examples: \(-2, -1, 0, +1, +2\)
  • What range of numbers do you need?
    • 8-bits (\(2^8\) numbers): range from \(-128\) to \(+127\)
    • 16-bits (\(2^{16}\) numbers): range from \(-32,678\) to \(+32,767\)
    • 32-bits (\(2^{32}\) numbers): range from \(-2,147,483,648\) to \(+2,147,483,647\)

What could possibly go wrong?

In practice, it is unlikely you will ever have to think about it!

Other types of numbers

Real Numbers (\(\mathbb{R}\)), either Rational (\(\mathbb{Q}\)) or Irrational (\(\mathbb{Q'}\))

  • In computing, these are represented by Floating-point numbers

    • Different to the sets in mathematics, these are almost always approximate numbers
  • Random examples:

    • \(-232.32\)
    • \(+7.18\)
    • \(0\)
    • \(+1,402.9202818779\)
  • Most commonly, represented by 32-bits (float) or 64-bits (double)

    • In the programming language R, these are called numeric

Text (data types)

  • Character (char):
    • a single character, as a represented letter, number, punctuation mark, emoji, symbol, etc.
    • Generally expressed wrapped by single quotes (') or double (") quotes in most tools and programming languages.
    • Examples: 'B', 'a', '!', '.'
  • String
    • a sequence of characters together. Raw text.
    • Typically represented under quotes, too.
    • Examples: "The text inside these quotes is a string."

Text (data types)

  • How is text represented in binary language?
    • The ASCII table is one of early standards
    • It encodes characters using 7-bits; therefore, it can represent a total of 128 unique characters

The ASCII table

Dec Binary Char Description
0 0000000 NUL Null
1 0000001 SOH Start of Header
2 0000010 STX Start of Text
3 0000011 ETX End of Text
4 0000100 EOT End of Transmission
5 0000101 ENQ Enquiry
6 0000110 ACK Acknowledge
7 0000111 BEL Bell
8 0001000 BS Backspace
9 0001001 HT Horizontal Tab
10 0001010 LF Line Feed
11 0001011 VT Vertical Tab
12 0001100 FF Form Feed
13 0001101 CR Carriage Return
14 0001110 SO Shift Out
15 0001111 SI Shift In
16 0010000 DLE Data Link Escape
17 0010001 DC1 Device Control 1
18 0010010 DC2 Device Control 2
19 0010011 DC3 Device Control 3
20 0010100 DC4 Device Control 4
21 0010101 NAK Negative Acknowledge
22 0010110 SYN Synchronize
23 0010111 ETB End of Transmission Block
24 0011000 CAN Cancel
25 0011001 EM End of Medium
26 0011010 SUB Substitute
27 0011011 ESC Escape
28 0011100 FS File Separator
29 0011101 GS Group Separator
30 0011110 RS Record Separator
31 0011111 US Unit Separator
Dec Binary Char Description
32 0100000 space Space
33 0100001 ! exclamation mark
34 0100010 double quote
35 0100011 # number
36 0100100 $ dollar
37 0100101 % percent
38 0100110 & ampersand
39 0100111 single quote
40 0101000 ( left parenthesis
41 0101001 ) right parenthesis
42 0101010 * asterisk
43 0101011 + plus
44 0101100 , comma
45 0101101 - minus
46 0101110 . period
47 0101111 / slash
48 0110000 0 zero
49 0110001 1 one
50 0110010 2 two
51 0110011 3 three
52 0110100 4 four
53 0110101 5 five
54 0110110 6 six
55 0110111 7 seven
56 0111000 8 eight
57 0111001 9 nine
58 0111010 : colon
59 0111011 ; semicolon
60 0111100 < less than
61 0111101 = equality sign
62 0111110 > greater than
63 0111111 ? question mark
Dec Binary Char Description
64 1000000 @ at sign
65 1000001 A
66 1000010 B
67 1000011 C
68 1000100 D
69 1000101 E
70 1000110 F
71 1000111 G
72 1001000 H
73 1001001 I
74 1001010 J
75 1001011 K
76 1001100 L
77 1001101 M
78 1001110 N
79 1001111 O
80 1010000 P
81 1010001 Q
82 1010010 R
83 1010011 S
84 1010100 T
85 1010101 U
86 1010110 V
87 1010111 W
88 1011000 X
89 1011001 Y
90 1011010 Z
91 1011011 [ left square bracket
92 1011100 \ backslash
93 1011101 ] right square bracket
94 1011110 ^ caret / circumflex
95 1011111 _ underscore
Dec Binary Char Description
96 1100000 ` grave / accent
97 1100001 a
98 1100010 b
99 1100011 c
100 1100100 d
101 1100101 e
102 1100110 f
103 1100111 g
104 1101000 h
105 1101001 i
106 1101010 j
107 1101011 k
108 1101100 l
109 1101101 m
110 1101110 n
111 1101111 o
112 1110000 p
113 1110001 q
114 1110010 r
115 1110011 s
116 1110100 t
117 1110101 u
118 1110110 v
119 1110111 w
120 1111000 x
121 1111001 y
122 1111010 z
123 1111011 { left curly bracket
124 1111100
125 1111101 } right curly bracket
126 1111110 ~ tilde
127 1000001 DEL delete

ASCII art (?)

You can make art (I guess) with it

🔗 link

Examples from ASCII Art website:

  • Coffee and cereal by Hayley Jane Wakenshaw
  ;)( ;
 :----:     o8Oo./
C|====| ._o8o8o8Oo_.
 |    |  \========/
 `----'   `------'
  • Castles (unknown)
                                                |>>>
                                                |
                                            _  _|_  _
                                           |;|_|;|_|;|
                                           \\.    .  /
                                            \\:  .  /
                                             ||:   |
                                             ||:.  |
                                             ||:  .|
                                             ||:   |       \,/
                                             ||: , |            /`\
                                             ||:   |
                                             ||: . |
              __                            _||_   |
     ____--`~    '--~~__            __ ----~    ~`---,              ___
-~--~                   ~---__ ,--~'                  ~~----_____-~'   `~----~~

Other text formats

ASCII is not the only standard. There are other ways to encode text using binary.

Below is a non-comprehensive list of other text (encoding):

  • Unicode
  • UTF-8 (Most common of all)
  • UTF-16
  • UTF-32
  • ISO-8859-1
  • Latin-1

Note

💡 You might have come across encoding mismatches before if you ever opened a file and the text looked like this:

“Nestlé and Mötley Crüe”

Where it should have read

“Nestlé and Mötley Crüe”

Emojis 😀

Emojis are text! They are part of UTF-8

Emojis can aid sentiment analysis

Dates and timestamps

  • Date: a calendar date
  • Format examples:
    • DD-MM-YYYY
    • DD/MM/YY
    • MM-DD-YYYY
    • YYYY-MM-DD
    • YYYYMMDD
  • Time: the time of day
    • Example of format: HH:mm:SS
  • Timestamp: Date + Time + marker of timezone
    • Example: 2022-01-11T18:49:05+00:00

What can go wrong?

  • Say we adopt DD/MM/YY
  • The next day after 31/12/99 is 01/01/00
  • But is it:
    • 31 December 1899 → 01 January 1900, or
    • 31 December 1999 → 01 January 2000?

The Millenium bug

Dates and data pre-processing

In sum:

  • Handling dates can be very frustrating!
  • Each country has their own favourite date format
  • If date is collected unstructured (as strings), there can be a mix of formats in the same file. Etc.

  • There are tools to help with that in all major programming languages related to data science:
    • lubridate package in R
    • datetime module in Python

Time for coffee ☕

After the break:

  • Common file formats
  • The concept of “tidy data”
  • ✍️ This week’s coursework

Common File Formats

CSV (Comma-Separated Values)

📄 CSV files are some of the simplest and most common formats to export rectangular data from databases and spreadsheets.

  • Filenames end with .csv
  • The first line represents the columns​
  • A comma separates values
    • or at least that is how it should be
  • Let me show you an example from Camden open data ➡️

XML (eXtensible Markup Language):

📄 XML files are made of tags demarcated by the characters: < and >.

XML Example

  • It tells you the standard version and text encoding.​
  • Data is wrapped within tags​
  • Flexible

HTML

Websites are encoded as HTML, a format similar to XML.

Example ➡️

JSON (JavaScript Object Notation)

📄 JSON files are also saved in text format but structured as name-value pairs. Filenames end with .json

{
  "squadName": "Super hero squad",
  "homeTown": "Metro City",
  "formed": 2016,
  "secretBase": "Super tower",
  "active": true,
  "members": [
    {
      "name": "Molecule Man",
      "age": 29,
      "secretIdentity": "Dan Jukes",
      "powers": [
        "Radiation resistance",
        "Turning tiny",
        "Radiation blast"
      ]
    },
    {
      "name": "Madame Uppercut",
      "age": 39,
      "secretIdentity": "Jane Wilson",
      "powers": [
        "Million tonne punch",
        "Damage resistance",
        "Superhuman reflexes"
      ]
    }
  ]
}

If we were to visualise it:

Final note about file formats

  • If you are producing or sharing data, use open text formats like CSV, XML or JSON​
  • All major software packages and programming languages can parse these​
  • It is easy to inspect for errors, if necessary, manually​
  • Avoid proprietary, binary file formats:​
    • Tabular data: closed-source Excel .xls
    • Software-specific: .rdata, .rds, SAS7BDAT

Tidy Data

Remember, we need organised datasets

💡 In fact, more than organised, we often need datasets to be tidy

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

Structured data can still be messy

What is wrong with using the spreadsheet below for data analysis ❓

Excel horror stories

By the way, although Excel can be useful, it might also be dangerous.​

(Be aware of its limitations)​

Excel horror stories

Excel is not a good tool for real data science. ​

But it is OK to use it to explore data if you know its pitfalls.​

Messy Data

But I want to highlight a different kind of “messiness” today:​

  • In this context, messy data is data formatted that requires a lot of preprocessing before it can be used with:
    • stats tools, or
    • Machine Learning packages​
  • Most collected data will be messy, but you can make it tidy.​

✍️ Your formative

  • How you will put this knowledge to work this week

References

Baraniuk, Chris. 2015. “The Number Glitch That Can Lead to Catastrophe.” BBC Future, May. https://www.bbc.com/future/article/20150505-the-numbers-that-lead-to-disaster.
Felbo, Bjarke, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. “Using Millions of Emoji Occurrences to Learn Any-Domain Representations for Detecting Sentiment, Emotion and Sarcasm.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1615–25. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1169.
Oren, Nir. 2019. “If You Think the Millennium Bug Was a Hoax, Here Comes a History Lesson.” The Conversation. http://theconversation.com/if-you-think-the-millennium-bug-was-a-hoax-here-comes-a-history-lesson-129042.