🗓️ Week 01 – Day 02
Data types and common file formats

09 Jul 2024

Datasets

📒 A dataset is simply a collection of data.

But just having a bunch of data is not enough…

Important

Soon, you will start collecting data. If you’re not careful with how you organise it, you will end up with a mess that will be difficult to work with.

Garbage In -> Garbage Out

We need organised datasets

An example

Now we will start distinguishing:

the types of data we can find in a dataset;
the best practices to assemble a dataset.

Data Types

Structured
Unstructured
Semi-structured

Structured Data

📒 Structured data is data that adheres to a pre-defined data model.

As the name suggests, it has a structure.
Tabular data is a common data structure ⏭️

Example

Camden crime dataset

Category	Street ID	Street Name	Outcome Category	Outcome Date
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Other theft	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Theft from the person	2343569	Euston (station)	Under investigation	01/11/2022
Shoplifting	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343101	Camden Town (lu Station)	Under investigation	01/11/2022
Violence and sexual offences	2343568	Euston (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343164	Chancery Lane (lu Station)	Under investigation	01/11/2022
Public order	2343569	Euston (station)	Under investigation	01/11/2022
Other theft	2343569	Euston (station)	Under investigation	01/11/2022
Theft from the person	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Robbery	2343623	Finchley Road (lu Station)	Under investigation	01/11/2022
Theft from the person	1678851	On or near Shelton Street	Investigation complete; no suspect identified	01/11/2022

Example (cont.)

Data type: numeric^(*)

Category	Street ID	Street Name	Outcome Category	Outcome Date
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Other theft	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Theft from the person	2343569	Euston (station)	Under investigation	01/11/2022
Shoplifting	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343101	Camden Town (lu Station)	Under investigation	01/11/2022
Violence and sexual offences	2343568	Euston (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343164	Chancery Lane (lu Station)	Under investigation	01/11/2022
Public order	2343569	Euston (station)	Under investigation	01/11/2022
Other theft	2343569	Euston (station)	Under investigation	01/11/2022
Theft from the person	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Robbery	2343623	Finchley Road (lu Station)	Under investigation	01/11/2022
Theft from the person	1678851	On or near Shelton Street	Investigation complete; no suspect identified	01/11/2022

* In R, use as.integer(df$column) or as.numeric(df$column) depending on the type of number you need.
* In Python, use df['column'].astype(int) or df['column'].astype(float).

Example (cont.)

Data type: string but could be treated as categorical^(*)

Category	Street ID	Street Name	Outcome Category	Outcome Date
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Other theft	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Theft from the person	2343569	Euston (station)	Under investigation	01/11/2022
Shoplifting	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343101	Camden Town (lu Station)	Under investigation	01/11/2022
Violence and sexual offences	2343568	Euston (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343164	Chancery Lane (lu Station)	Under investigation	01/11/2022
Public order	2343569	Euston (station)	Under investigation	01/11/2022
Other theft	2343569	Euston (station)	Under investigation	01/11/2022
Theft from the person	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Robbery	2343623	Finchley Road (lu Station)	Under investigation	01/11/2022
Theft from the person	1678851	On or near Shelton Street	Investigation complete; no suspect identified	01/11/2022

* In R, use as.factor(df$column) to convert a column to a categorical type.
* In Python, use df['column'].astype('category').

Example (cont.)

Data type: date^(*)

Category	Street ID	Street Name	Outcome Category	Outcome Date
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Other theft	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Theft from the person	2343569	Euston (station)	Under investigation	01/11/2022
Shoplifting	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343101	Camden Town (lu Station)	Under investigation	01/11/2022
Violence and sexual offences	2343568	Euston (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Public order	2343164	Chancery Lane (lu Station)	Under investigation	01/11/2022
Public order	2343569	Euston (station)	Under investigation	01/11/2022
Other theft	2343569	Euston (station)	Under investigation	01/11/2022
Theft from the person	2343955	Holborn (lu Station)	Under investigation	01/11/2022
Other theft	2344108	Kings Cross (station)	Under investigation	01/11/2022
Robbery	2343623	Finchley Road (lu Station)	Under investigation	01/11/2022
Theft from the person	1678851	On or near Shelton Street	Investigation complete; no suspect identified	01/11/2022

* The best way to deal with date-time data in R is to use the lubridate package. Base R date-time functions are NOT GREAT.
* In Python, use the pd.to_datetime(df['column']) function (but you might need to specify the format).

Text (data types)

How is text represented in binary language?
- The ASCII table is one of early standards
- It encodes characters using 7-bits; therefore, it can represent a total of 128 unique characters

The ASCII table

Dec	Binary	Char	Description
0	0000000	NUL	Null
1	0000001	SOH	Start of Header
2	0000010	STX	Start of Text
3	0000011	ETX	End of Text
4	0000100	EOT	End of Transmission
5	0000101	ENQ	Enquiry
6	0000110	ACK	Acknowledge
7	0000111	BEL	Bell
8	0001000	BS	Backspace
9	0001001	HT	Horizontal Tab
10	0001010	LF	Line Feed
11	0001011	VT	Vertical Tab
12	0001100	FF	Form Feed
13	0001101	CR	Carriage Return
14	0001110	SO	Shift Out
15	0001111	SI	Shift In
16	0010000	DLE	Data Link Escape
17	0010001	DC1	Device Control 1
18	0010010	DC2	Device Control 2
19	0010011	DC3	Device Control 3
20	0010100	DC4	Device Control 4
21	0010101	NAK	Negative Acknowledge
22	0010110	SYN	Synchronize
23	0010111	ETB	End of Transmission Block
24	0011000	CAN	Cancel
25	0011001	EM	End of Medium
26	0011010	SUB	Substitute
27	0011011	ESC	Escape
28	0011100	FS	File Separator
29	0011101	GS	Group Separator
30	0011110	RS	Record Separator
31	0011111	US	Unit Separator

Dec	Binary	Char	Description
32	0100000	space	Space
33	0100001	!	exclamation mark
34	0100010	”	double quote
35	0100011	#	number
36	0100100	$	dollar
37	0100101	%	percent
38	0100110	&	ampersand
39	0100111	’	single quote
40	0101000	(	left parenthesis
41	0101001	)	right parenthesis
42	0101010	*	asterisk
43	0101011	+	plus
44	0101100	,	comma
45	0101101	-	minus
46	0101110	.	period
47	0101111	/	slash
48	0110000	0	zero
49	0110001	1	one
50	0110010	2	two
51	0110011	3	three
52	0110100	4	four
53	0110101	5	five
54	0110110	6	six
55	0110111	7	seven
56	0111000	8	eight
57	0111001	9	nine
58	0111010	:	colon
59	0111011	;	semicolon
60	0111100	<	less than
61	0111101	=	equality sign
62	0111110	>	greater than
63	0111111	?	question mark

Dec	Binary	Char	Description
64	1000000	@	at sign
65	1000001	A
66	1000010	B
67	1000011	C
68	1000100	D
69	1000101	E
70	1000110	F
71	1000111	G
72	1001000	H
73	1001001	I
74	1001010	J
75	1001011	K
76	1001100	L
77	1001101	M
78	1001110	N
79	1001111	O
80	1010000	P
81	1010001	Q
82	1010010	R
83	1010011	S
84	1010100	T
85	1010101	U
86	1010110	V
87	1010111	W
88	1011000	X
89	1011001	Y
90	1011010	Z
91	1011011	[	left square bracket
92	1011100	\	backslash
93	1011101	]	right square bracket
94	1011110	^	caret / circumflex
95	1011111	_	underscore

Dec	Binary	Char	Description
96	1100000	`	grave / accent
97	1100001	a
98	1100010	b
99	1100011	c
100	1100100	d
101	1100101	e
102	1100110	f
103	1100111	g
104	1101000	h
105	1101001	i
106	1101010	j
107	1101011	k
108	1101100	l
109	1101101	m
110	1101110	n
111	1101111	o
112	1110000	p
113	1110001	q
114	1110010	r
115	1110011	s
116	1110100	t
117	1110101	u
118	1110110	v
119	1110111	w
120	1111000	x
121	1111001	y
122	1111010	z
123	1111011	{	left curly bracket
124	1111100
125	1111101	}	right curly bracket
126	1111110	~	tilde
127	1000001	DEL	delete

Other text formats

ASCII is not the only standard. There are other ways to encode text using binary.

Below is a non-comprehensive list of other text (encoding):

Unicode
UTF-8 (Most common of all)
UTF-16
UTF-32
ISO-8859-1
Latin-1

Note

💡 You might have come across encoding mismatches before if you ever opened a file and the text looked like this:

“NestlÃ© and MÃ¶tley CrÃ¼e”

Where it should have read

“Nestlé and Mötley Crüe”

Emojis 😀

Emojis are text! They are part of UTF-8

Structured data can still be messy

In this context, messy data refers to data that requires extensive preprocessing before it can be effectively used in:
- Exploratory data analysis,
- Statistical tools, or
- Training machine learning algorithms.
It is important to establish standards to simplify our work.
One such standard for data handling is that of tidy data ➡️

What this means for you

Most data you will collect will come “messy”
- but you can tidy it up using tidyverse (R) or pandas (Python)

You will be asked over and over and over again to make your data tidy!

tidy data will be a requirement for your coursework

Note

📘 Read more about tidy data

Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. (Wickham 2014)

What about Unstructured Data?

📒 Unstructured data does NOT adhere to a pre-defined data model.

Rule of thumb: it is not easy to think of this type of data as a table.
Let’s see a few examples ➡️

Examples

Pure “raw” text

Photos

Videos

Text in scanned images

Semi-structured data

For example, e-mails:

Parts of the data have a structure
But the main content itself is usually just raw text

Note

Have you heard of the Enron e-mail scandal story before?

Common File Formats

I will show you how the following file formats are structured and how to read them in R:

CSV
XML
JSON

CSV

CSV stands for Comma Separated Values

CSV - Example

Which means a table like this:

Category	Street ID	Street Name	Outcome Category	Outcome Date	Location
Other theft	1489515	Kings Cross (station)	Status update unavailable	01/08/2017	(51.5318, -0.123189)
Anti-social behaviour	960522	On or near Wellesley Place			(51.528169, -0.131558)
Theft from the person	965233	On or near Avenue Road	Investigation complete; no suspect identified	01/08/2015	(51.542741, -0.174124)
Anti-social behaviour	960974	On or near Birkenhead Street			(51.529611, -0.121652)
Drugs	972275	On or near Oakeshott Avenue	Offender given a drugs possession warning	01/06/2015	(51.565544, -0.149851)
Anti-social behaviour	965090	On or near Hawley Mews			(51.542507, -0.146589)
Violence and sexual offences	967816	On or near Maitland Park Villas	Under investigation	01/06/2017	(51.548264, -0.156365)
Vehicle crime	967555	On or near Ferncroft Avenue	Investigation complete; no suspect identified	01/02/2016	(51.558504, -0.191468)
Theft from the person	965140	On or near Nightclub	Under investigation	01/04/2018	(51.541352, -0.14629)
Other theft	965052	On or near Brocas Close	Investigation complete; no suspect identified	01/05/2016	(51.543954, -0.165408)

CSV - Example

Becomes this:

Category,Street ID,Street Name,Outcome Category,Outcome Date,Location
Other theft,1489515,Kings Cross (station),Status update unavailable,01/08/2017,"(51.5318, -0.123189)"
Anti-social behaviour,960522,On or near Wellesley Place,,,"(51.528169, -0.131558)"
Theft from the person,965233,On or near Avenue Road,Investigation complete; no suspect identified,01/08/2015,"(51.542741, -0.174124)"
Anti-social behaviour,960974,On or near Birkenhead Street,,,"(51.529611, -0.121652)"
Drugs,972275,On or near Oakeshott Avenue,Offender given a drugs possession warning,01/06/2015,"(51.565544, -0.149851)"
Anti-social behaviour,965090,On or near Hawley Mews,,,"(51.542507, -0.146589)"
Violence and sexual offences,967816,On or near Maitland Park Villas,Under investigation,01/06/2017,"(51.548264, -0.156365)"
Vehicle crime,967555,On or near Ferncroft Avenue,Investigation complete; no suspect identified,01/02/2016,"(51.558504, -0.191468)"
Theft from the person,965140,On or near Nightclub,Under investigation,01/04/2018,"(51.541352, -0.14629)"
Other theft,965052,On or near Brocas Close,Investigation complete; no suspect identified,01/05/2016,"(51.543954, -0.165408)"

CSV - Example

CSV files encode tabular data, so it is easy to read them into a data frame.

library(readr)

# tidyverse using readr
filename <- "data/On_Street_Crime_In_Camden_Map.csv"
csv_data <- read_csv(filename)

Python

# pandas
import pandas as pd

filename = "data/On_Street_Crime_In_Camden_Map.csv"
csv_data = pd.read_csv()

Tip

CSV - Inspecting data types (R)

You can use str(csv_data) to inspect data types.
On base R, you will get something like this:

'data.frame':   376132 obs. of  20 variables:
 $ Category        : chr  "Other theft" "Anti-social behaviour" "Theft from the person" "Anti-social behaviour" ...
 $ Street.ID       : int  1489515 960522 965233 960974 972275 965090 967816 967555 965140 965052 ...
 $ Street.Name     : chr  "Kings Cross (station)" "On or near Wellesley Place" "On or near Avenue Road" "On or nea. 
 $ Context         : logi  NA NA NA NA NA NA ...
 $ Outcome.Category: chr  "Status update unavailable" "" "Investigation complete; no suspect identified" "" ...
 $ Outcome.Date    : chr  "01/08/2017" "" "01/08/2015" "" ...
 $ Service         : chr  "British Transport Police" "Police Force" "Police Force" "Police Force" ...
 $ Location.Subtype: chr  "Station" "" "" "" ...
 $ ID              : int  64777250 51520755 42356413 59431385 41931981 51522064 58014826 46231592 64334450 4850859.
 $ Persistent.ID   : chr  "" "" "915131bf174019fd2fcf5aa4af305f7b2b34a763d8fcb0ccc4050503b82fa0fb" "" ...
 $ Epoch           : chr  "01/04/2017" "01/09/2016" "01/07/2015" "01/08/2017" ...
 $ Ward.Code       : chr  "E05000143" "E05000143" "E05000144" "E05000141" ...
 $ Ward.Name       : chr  "St Pancras and Somers Town" "St Pancras and Somers Town" "Swiss Cottage" "King's Cross".
 $ Easting         : num  530277 529707 526717 530390 528336 ...
 $ Northing        : num  183101 182683 184228 182861 186806 ...
 $ Longitude       : num  -0.123 -0.132 -0.174 -0.122 -0.15 ...
 $ Latitude        : num  51.5 51.5 51.5 51.5 51.6 ...
 $ Spatial.Accuracy: chr  "This is only an approximation of where the crime happened" "This is only an approximati.
 $ Last.Uploaded   : chr  "11/07/2018" "11/07/2018" "05/05/2016" "03/11/2017" ...
 $ Location        : chr  "(51.5318, -0.123189)" "(51.528169, -0.131558)" "(51.542741, -0.174124)" "(51.529611, -0.

R is pretty good at guessing data types, but it won’t get things like dates right and it won’t always use the most efficient data type (e.g. int vs num).

CSV - Inspecting data types (tidyverse)

If you read the data frame with tidyverse, str(csv_data) will look differently:

spc_tbl_ [376,132 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Category        : chr [1:376132] "Other theft" "Anti-social behaviour" "Theft from the person" "Anti-social beh.
 $ Street ID       : num [1:376132] 1489515 960522 965233 960974 972275 ...
 $ Street Name     : chr [1:376132] "Kings Cross (station)" "On or near Wellesley Place" "On or near Avenue Road" .
 $ Context         : logi [1:376132] NA NA NA NA NA NA ...
 $ Outcome Category: chr [1:376132] "Status update unavailable" NA "Investigation complete; no suspect identified".
 $ Outcome Date    : chr [1:376132] "01/08/2017" NA "01/08/2015" NA ...
 $ Service         : chr [1:376132] "British Transport Police" "Police Force" "Police Force" "Police Force" ...    
 $ Location Subtype: chr [1:376132] "Station" NA NA NA ...
 $ ID              : num [1:376132] 64777250 51520755 42356413 59431385 41931981 ...
 $ Persistent ID   : chr [1:376132] NA NA "915131bf174019fd2fcf5aa4af305f7b2b34a763d8fcb0ccc4050503b82fa0fb" NA ...
 $ Epoch           : chr [1:376132] "01/04/2017" "01/09/2016" "01/07/2015" "01/08/2017" ...
 $ Ward Code       : chr [1:376132] "E05000143" "E05000143" "E05000144" "E05000141" ...
 $ Ward Name       : chr [1:376132] "St Pancras and Somers Town" "St Pancras and Somers Town" "Swiss Cottage" "Kin.
 $ Easting         : num [1:376132] 530277 529707 526717 530390 528336 ...
 $ Northing        : num [1:376132] 183101 182683 184228 182861 186806 ...
 $ Longitude       : num [1:376132] -0.123 -0.132 -0.174 -0.122 -0.15 ...
 $ Latitude        : num [1:376132] 51.5 51.5 51.5 51.5 51.6 ...
 $ Spatial Accuracy: chr [1:376132] "This is only an approximation of where the crime happened" "This is only an a.
 $ Last Uploaded   : chr [1:376132] "11/07/2018" "11/07/2018" "05/05/2016" "03/11/2017" ...
 $ Location        : chr [1:376132] "(51.5318, -0.123189)" "(51.528169, -0.131558)" "(51.542741, -0.174124)" "(51..
 - attr(*, "spec")=
  .. cols(
  ..   Category = col_character(),
  ..   `Street ID` = col_double(),
  ..   `Street Name` = col_character(),
  ..   Context = col_logical(),
  ..   `Outcome Category` = col_character(),
  ..   `Outcome Date` = col_character(),
  ..   Service = col_character(),
  ..   `Location Subtype` = col_character(),
  ..   ID = col_double(),
  ..   `Persistent ID` = col_character(),
  ..   Epoch = col_character(),
  ..   `Ward Code` = col_character(),
  ..   `Ward Name` = col_character(),
  ..   Easting = col_double(),
  ..   Northing = col_double(),
  ..   Longitude = col_double(),
  ..   Latitude = col_double(),
  ..   `Spatial Accuracy` = col_character(),
  ..   `Last Uploaded` = col_character(),
  ..   Location = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

CSV - Inspecting data types (tidyverse)

In sum:

the output is more verbose
tidyverse didn’t use int even in the most explicit cases (e.g. Street ID)
- it used num instead
similarly, it didn’t use factors for obviously categorical variables
(e.g. Outcome Category)
it didn’t decode the dates (e.g. Outcome Date)

But I really want to draw your attention to the following line:

spc_tbl_ [376,132 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

which indicates that this is a tibble object.

From the `tibble` documentation:

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.

Tibbles are data.frames that are lazy and surly:

they do less (i.e. they don’t change variable names or types, and don’t do partial matching)

and complain more (e.g. when a variable does not exist).

This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

If you are new to tibbles, the best place to start is the tibbles chapter in R for data science (Wickham and Grolemund 2016, chap. 10).

CSV - Inspecting data types (Python)

You can use csv_data.dtypes to inspect data types.
In pandas, you will get something like this:

Category             object
Street ID             int64
Street Name          object
Context             float64
Outcome Category     object
Outcome Date         object
Service              object
Location Subtype     object
ID                    int64
Persistent ID        object
Epoch                object
Ward Code            object
Ward Name            object
Easting             float64
Northing            float64
Longitude           float64
Latitude            float64
Spatial Accuracy     object
Last Uploaded        object
Location             object
dtype: object

CSV - Inspecting data types (`df.info()`)

In sum:

the output is more verbose
pandas didn’t use the most memory-efficient int
- it used int64 even for small ranges
similarly, it didn’t use category for obviously categorical variables
(e.g. Outcome Category)
it didn’t decode the dates (e.g. Outcome Date)

Category             object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 419273 entries, 0 to 419272
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Category          419273 non-null  object 
 1   Street ID         419273 non-null  int64  
 2   Street Name       419273 non-null  object 
 3   Context           0 non-null       float64
 4   Outcome Category  324353 non-null  object 
 5   Outcome Date      324353 non-null  object 
 6   Service           419273 non-null  object 
 7   Location Subtype  15697 non-null   object 
 8   ID                419273 non-null  int64  
 9   Persistent ID     298195 non-null  object 
 10  Epoch             419273 non-null  object 
 11  Ward Code         419273 non-null  object 
 12  Ward Name         419273 non-null  object 
 13  Easting           419273 non-null  float64
 14  Northing          419273 non-null  float64
 15  Longitude         419273 non-null  float64
 16  Latitude          419273 non-null  float64
 17  Spatial Accuracy  419273 non-null  object 
 18  Last Uploaded     419273 non-null  object 
 19  Location          419273 non-null  object 
dtypes: float64(5), int64(2), object(13)
memory usage: 64.0+ MB

Our approach to data processing

This philosophy aligns with the principles of this course.
We want you using the most appropriate data types and addressing issues proactively early on.
Even if they are not super necessary (i.e. you don’t have Big Data)
From a pedagogical standpoint, it is crucial for us to teach you the best practices.
Let me be more specific ➡️

Principle 1: Ensure data types make sense (R)

Reflect on the `character` variables (R)

How many unique values each of them have?

csv_data %>%
  select(where(is.character)) %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), 
               names_to = "Column", 
               values_to = "# Distinct") %>%
  mutate(Percentage=sprintf("%.2f%%",(`# Distinct`/dim(csv_data)[1])*100)) %>%
  arrange(desc(`# Distinct`))

🗣️ CLASS DISCUSSION

What does this summary tell us?

Column	# Distinct	Percentage
Persistent ID	257160	68.37%
Location	2917	0.78%
Street Name	1311	0.35%
Outcome Date	101	0.03%
Epoch	100	0.03%
Ward Code	38	0.01%
Outcome Category	27	0.01%
Ward Name	27	0.01%
Last Uploaded	27	0.01%
Category	14	0.00%
Location Subtype	3	0.00%
Service	2	0.00%
Spatial Accuracy	1	0.00%

🛣️ First, let’s take a little detour to understand the code above.

🛣️ Detour: Code explainer

First, we select only the character variables.

csv_data %>%
  select(where(is.character)) %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), 
               names_to = "Column", 
               values_to = "# Distinct") %>%
  mutate(Percentage=sprintf("%.2f%%",(`# Distinct`/dim(csv_data)[1])*100)) %>%
  arrange(desc(`# Distinct`))

We end up with these columns:

"Category"        
"Street Name"     
"Outcome Category"
"Outcome Date"    
"Service"
"Location Subtype"
"Persistent ID"
"Epoch"
"Ward Code"
"Ward Name"
"Spatial Accuracy"
"Last Uploaded"
"Location"

🛣️ Detour: Code explainer (R)

Then we summarise each selected column using the across() function.

csv_data %>%
  select(where(is.character)) %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), 
               names_to = "Column", 
               values_to = "# Distinct") %>%
  mutate(Percentage=sprintf("%.2f%%",(`# Distinct`/dim(csv_data)[1])*100)) %>%
  arrange(desc(`# Distinct`))

This returns the following tibble:

# A tibble: 1 x 13
  Category `Street Name` `Outcome Category` `Outcome Date` Service `Location Subtype` `Persistent ID` Epoch `Ward Code` `Ward Name`
     <int>         <int>              <int>          <int>   <int>              <int>           <int> <int>       <int>       <int>
1       14          1311                 27            101       2                  3          257160   100          38          27
# i 3 more variables: `Spatial Accuracy` <int>, `Last Uploaded` <int>, Location <int>

🛣️ Detour: Code explainer (R)

We use the pivot_longer() function from the tidyr package to ‘rotate’ the data:

csv_data %>%
  select(where(is.character)) %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), 
               names_to = "Column", 
               values_to = "# Distinct") %>%
  mutate(Percentage=sprintf("%.2f%%",(`# Distinct`/dim(csv_data)[1])*100)) %>%
  arrange(desc(`# Distinct`))

Columns becomes rows:

# A tibble: 13 x 2
   Column           `# Distinct`
   <chr>                   <int>
 1 Category                   14
 2 Street Name              1311
 3 Outcome Category           27
 4 Outcome Date              101
 5 Service                     2
 6 Location Subtype            3
 7 Persistent ID          257160
 8 Epoch                     100
 9 Ward Code                  38
10 Ward Name                  27
11 Spatial Accuracy            1
12 Last Uploaded              27
13 Location                 2917

🛣️ Detour: Code explainer (R)

Now it’s easy to add a new column with the percentages using mutate:

csv_data %>%
  select(where(is.character)) %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), 
               names_to = "Column", 
               values_to = "# Distinct") %>%
  mutate(Percentage=sprintf("%.2f%%",(`# Distinct`/dim(csv_data)[1])*100)) %>%
  arrange(desc(`# Distinct`))

This produces:

# A tibble: 13 x 3
   Column           `# Distinct` Percentage
   <chr>                   <int> <chr>     
 1 Category                   14 0.00%
 2 Street Name              1311 0.35%
 3 Outcome Category           27 0.01%
 4 Outcome Date              101 0.03%
 5 Service                     2 0.00%
 6 Location Subtype            3 0.00%
 7 Persistent ID          257160 68.37%
 8 Epoch                     100 0.03%
 9 Ward Code                  38 0.01%
10 Ward Name                  27 0.01%
11 Spatial Accuracy            1 0.00%
12 Last Uploaded              27 0.01%
13 Location                 2917 0.78%

🛣️ Detour: Code explainer (R)

To communicate the results better, we sort the rows by # Distinct:

csv_data %>%
  select(where(is.character)) %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), 
               names_to = "Column", 
               values_to = "# Distinct") %>%
  mutate(Percentage=sprintf("%.2f%%",(`# Distinct`/dim(csv_data)[1])*100)) %>%
  arrange(desc(`# Distinct`))

which leads us to our desired output:

# A tibble: 13 x 3
   Column           `# Distinct` Percentage
   <chr>                   <int> <chr>
 1 Persistent ID          257160 68.37%
 2 Location                 2917 0.78%
 3 Street Name              1311 0.35%
 4 Outcome Date              101 0.03%
 5 Epoch                     100 0.03%
 6 Ward Code                  38 0.01%
 7 Outcome Category           27 0.01%
 8 Ward Name                  27 0.01%
 9 Last Uploaded              27 0.01%
10 Category                   14 0.00%
11 Location Subtype            3 0.00%
12 Service                     2 0.00%
13 Spatial Accuracy            1 0.00%

Peek at the data

🗣️ How would you change the type of these columns?

Persistent ID	Location	Street Name	Service	Spatial Accuracy
NA	(51.5318, -0.123189)	Kings Cross (station)	British Transport Police	This is only an approx
NA	(51.528169, -0.131558)	On or near Wellesley Place	Police Force	This is only an approx
915131bf174019fd2fcf5aa4af305f7b2b34a763d8fcb0ccc4050503b82fa0fb	(51.542741, -0.174124)	On or near Avenue Road	Police Force	This is only an approx
NA	(51.529611, -0.121652)	On or near Birkenhead Street	Police Force	This is only an approx
bd5bef6ee7b3711e69ecfc40c1c256d45336f23aeda337b24616efed2eafddf1	(51.565544, -0.149851)	On or near Oakeshott Avenue	Police Force	This is only an approx
NA	(51.542507, -0.146589)	On or near Hawley Mews	Police Force	This is only an approx
32d289676240e51a8f0adcb1fd796547820cd59d5dcdeb6e1f6678d21dce9865	(51.548264, -0.156365)	On or near Maitland Park Villas	Police Force	This is only an approx
6149304809fb465afd840bd6a35bdfac1a06915d2bcb2baa7361719e5467d8bd	(51.558504, -0.191468)	On or near Ferncroft Avenue	Police Force	This is only an approx
34795948a0731cd1f445b59eb237e93146a784f108a81bceccfea8e2ffcd91d5	(51.541352, -0.14629)	On or near Nightclub	Police Force	This is only an approx
05d796c77702b5cd2f5c03c32ff51c49c95a7ecc0f4f404a701e141ece858160	(51.543954, -0.165408)	On or near Brocas Close	Police Force	This is only an approx
NA	(51.5325, -0.12601)	St Pancras International (station)	British Transport Police	This is only an approx
da582a168109e9c5e41f06add7c629083f0d5cd9ed0da5679732cce1fab9f6fe	(51.524465, -0.124891)	On or near Marchmont Street	Police Force	This is only an approx
8c71420dfdcee3dc390e78791189d931f72ed677a510f377e274dfd0cf205c29	(51.545363, -0.133277)	On or near Petrol Station	Police Force	This is only an approx
8421b68d78559638e4bb243feb673a076b145a44e9da98cd91f289560509b109	(51.541133, -0.192229)	On or near Quex Road	Police Force	This is only an approx
adef0ec085796c97a9dbca411af4b41adc33838025e704862cbb56580558827a	(51.545807, -0.162579)	On or near Supermarket	Police Force	This is only an approx
f374661faf45420007111e26473b5812c0ad22326e1e13aeece0c862467d3dda	(51.54288, -0.147425)	On or near Petrol Station	Police Force	This is only an approx
29b4b7f0a8e2a31a5be0dafe0003b25d57ab0b41695f92064f6d185e9d78919a	(51.523469, -0.126114)	On or near Parking Area	Police Force	This is only an approx
5416cc3cb894805a81221f43a42a47042a7c5baadf0871a43d04d8023e337718	(51.552811, -0.135698)	On or near Montpelier Grove	Police Force	This is only an approx
4e70cb3468ee72ab29bbadfb8141a6ea2ed04f1c2470d0890d1df6c4eeaaa500	(51.540012, -0.151997)	On or near Edis Street	Police Force	This is only an approx
3dc9881558b7c0d56141a49baf1afd1ce06abe0b6a78d393c972df50e240b714	(51.545174, -0.194145)	On or near Gladys Road	Police Force	This is only an approx
e7f17dc1694fd5e79b0845f155bcf8d1f06cf849e182dc4c263982b57e0f12d9	(51.548622, -0.20086)	On or near Barlow Road	Police Force	This is only an approx
ecc285067c5653ee74b31b52d94b5ff25b7086773bf9f8f4d427685259462888	(51.544104, -0.149754)	On or near Mead Close	Police Force	This is only an approx
f22771d69e7ab3c163b1e095d6af93af2ec93d4979262f591f0116e3951bfeb3	(51.538704, -0.141509)	On or near Greenland Place	Police Force	This is only an approx
152d6dc7d7830074f58a93b37547e514b52f56bc032a03f72a01e99cf9739b93	(51.539681, -0.142998)	On or near Nightclub	Police Force	This is only an approx
042b4e6d44c1dc6b57b7fbdb992ec7bc36ee9640084ad44221e486c7e18e329d	(51.518544, -0.119008)	On or near Yorkshire Grey Yard	Police Force	This is only an approx

Convert to categorical (`factor`) (R)

Normally, whenever the number of unique values is small, consider converting the variable to a factor.
But pay attention to the context.
- Never convert dates to factor!
- Also probably not a good idea to convert ID variables to factor!

Object sizes after conversion (R)

How many bytes does the csv_data object occupy in memory?

format(object.size(csv_data), units = "auto")

returns:

"99.5 Mb"

And after converting selected columns to categorical?

selected_columns <- c("Spatial Accuracy", "Service", "Location Subtype", 
                      "Category", "Ward Name", "Outcome Category", "Ward Code")

csv_data <- csv_data %>%
  mutate(across(all_of(selected_columns), factor))

format(object.size(csv_data), units = "auto")

returns:

"89.4 Mb"

Tip

A reduction of $10.1\%$ might not seem significant for a ‘small’ dataset, but remember: we’re building the habit of using the most efficient and most appropriate data types.

Use the `forcats` package (R)

The forcats package has useful functions for working with factors.

Example: Counting the number of occurrences of each level with fct_count()

library(forcats) # or library(tidyverse)
fct_count(csv_data$Service)

yields:

# A tibble: 2 x 2
  f                             n
  <fct>                     <int>
1 British Transport Police  21110
2 Police Force             355022

Or in case you need that inside a pipe chain for some reason:

csv_data %>% 
  select(Service) %>% 
  pull() %>%
  forcats::fct_count()

yields the same:

# A tibble: 2 x 2
  f                             n
  <fct>                     <int>
1 British Transport Police  21110
2 Police Force             355022

Convert to `Date` objects (R)

In base R:

# It works fine, but it is a bit slow
tmp_date_example <- 
  as.Date(csv_data$`Outcome Date`, format = "%d/%m/%Y")


class(tmp_date_example)

returns:

"Date"

In tidyverse:

library(lubridate) 

# It is MUCH faster and more flexible
tmp_date_example <- dmy(csv_data$`Outcome Date`)

class(tmp_date_example)

also produces the same data type:

"Date"

Welcome to yet another tidyverse package: lubridate.

What can you do with lubridate? (R)

This time we didn’t save much on memory, but we gained a lot in terms of flexibility.

selected_columns <- c("Last Uploaded", "Epoch", "Outcome Date")

csv_data <- 
  csv_data %>%
  mutate(across(all_of(selected_columns), lubridate::dmy))

format(object.size(csv_data), units = "auto")

returns the same as before

"89.4 Mb"

➡️ Let’s check out lubridate’s cheat sheet to see what we can do with dates: lubridate

Finally, let’s spot truly int variables (R)

Important

is.integer() DOES NOT test if the number can be safely converted to an integer.

We can read about it in the relevant R documentation:

?is.integer

Using their suggested made-up function to identify integer columns:

is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol

csv_data %>%
  summarise(across(where(is.numeric), ~all(is.wholenumber(.))))

we obtain:

# A tibble: 1 x 6
  `Street ID` ID    Easting Northing Longitude Latitude
  <lgl>       <lgl> <lgl>   <lgl>    <lgl>     <lgl>
1 TRUE        TRUE  FALSE   FALSE    FALSE     FALSE

Finally, let’s spot truly int variables

Let’s then convert the necessary columns to integer:

csv_data <- 
  csv_data %>%
  mutate(across(all_of(c("Street ID", "ID")), as.integer))

This leads to a further ~3% reduction in memory usage:

format(object.size(csv_data), units = "auto")

“86.5 Mb”

Our current data frame

str(csv_data)

returns:

tibble [376,132 x 20] (S3: tbl_df/tbl/data.frame)
 $ Category        : Factor w/ 14 levels "Anti-social behaviour",..: 7 1 12 1 5 1 14 13 12 7 ...
 $ Street ID       : int [1:376132] 1489515 960522 965233 960974 972275 965090 967816 967555 965140 965052 ...
 $ Street Name     : chr [1:376132] "Kings Cross (station)" "On or near Wellesley Place" "On or near Avenue Road" "On or near Birkenhead Street" ...
 $ Context         : logi [1:376132] NA NA NA NA NA NA ...
 $ Outcome Category: Factor w/ 26 levels "Action to be taken by another organisation",..: 23 NA 9 NA 14 NA 26 9 26 9 ...
 $ Outcome Date    : Date[1:376132], format: "2017-08-01" NA "2015-08-01" NA ...
 $ Service         : Factor w/ 2 levels "British Transport Police",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ Location Subtype: Factor w/ 2 levels "London Underground Station",..: 2 NA NA NA NA NA NA NA NA NA ...
 $ ID              : int [1:376132] 64777250 51520755 42356413 59431385 41931981 51522064 58014826 46231592 64334450 48508596 ...
 $ Persistent ID   : chr [1:376132] NA NA "915131bf174019fd2fcf5aa4af305f7b2b34a763d8fcb0ccc4050503b82fa0fb" NA ...
 $ Epoch           : Date[1:376132], format: "2017-04-01" "2016-09-01" "2015-07-01" "2017-08-01" ...
 $ Ward Code       : Factor w/ 38 levels "E05000128","E05000129",..: 16 16 17 14 10 3 9 6 3 1 ...
 $ Ward Name       : Factor w/ 27 levels "Belsize","Bloomsbury",..: 25 25 26 19 13 5 12 9 5 1 ...
 $ Easting         : num [1:376132] 530277 529707 526717 530390 528336 ...
 $ Northing        : num [1:376132] 183101 182683 184228 182861 186806 ...
 $ Longitude       : num [1:376132] -0.123 -0.132 -0.174 -0.122 -0.15 ...
 $ Latitude        : num [1:376132] 51.5 51.5 51.5 51.5 51.6 ...
 $ Spatial Accuracy: Factor w/ 1 level "This is only an approximation of where the crime happened": 1 1 1 1 1 1 1 1 1 1 ...
 $ Last Uploaded   : Date[1:376132], format: "2018-07-11" "2018-07-11" "2016-05-05" "2017-11-03" ...
 $ Location        : chr [1:376132] "(51.5318, -0.123189)" "(51.528169, -0.131558)" "(51.542741, -0.174124)" "(51.529611, -0.121652)" ...

Principle 1: Ensure data types make sense (Python)

Reflect on the `object` variables (Python)

How many unique values each of them have?

(
  df_street_crime
    .select_dtypes(include=['object'])
    .nunique()
    .sort_values(ascending=False)
    .assign(Percentage=lambda x: x/x.sum())
    .to_frame(name='# Distinct')
    .style.format({'Percentage': '{:.2%}'})
)

🗣️ CLASS DISCUSSION

What does this summary tell us?

Column	# Distinct	Percentage
Persistent ID	257160	68.37%
Location	2917	0.78%
Street Name	1311	0.35%
Outcome Date	101	0.03%
Epoch	100	0.03%
Ward Code	38	0.01%
Outcome Category	27	0.01%
Ward Name	27	0.01%
Last Uploaded	27	0.01%
Category	14	0.00%
Location Subtype	3	0.00%
Service	2	0.00%
Spatial Accuracy	1	0.00%

🛣️ First, let’s take a little detour to understand the code above.

🛣️ Detour: Code explainer (Python)

First, we select only the object variables.

(
  df_street_crime
    .select_dtypes(include=['object'])
    .nunique()
    .sort_values(ascending=False)
    .assign(Percentage=lambda x: x/x.sum())
    .to_frame(name='# Distinct')
    .style.format({'Percentage': '{:.2%}'})
)

We end up with these columns:

"Category"        
"Street Name"     
"Outcome Category"
"Outcome Date"    
"Service"
"Location Subtype"
"Persistent ID"
"Epoch"
"Ward Code"
"Ward Name"
"Spatial Accuracy"
"Last Uploaded"
"Location"

🛣️ Detour: Code explainer (Python)

Then we count the number of unique values each selected column using the .nunique() method and sort them in descending order:

(
  df_street_crime
    .select_dtypes(include=['object'])
    .nunique()
    .sort_values(ascending=False)
    .assign(Percentage=lambda x: x/x.sum())
    .to_frame(name='# Distinct')
    .style.format({'Percentage': '{:.2%}'})
)

This returns the following table:

Persistent ID       286221
Location              2936
Street Name           1317
Outcome Date           111
Epoch                  111
Ward Code               38
Last Uploaded           36
Ward Name               27
Outcome Category        26
Category                14
Service                  2
Location Subtype         2
Spatial Accuracy         1
dtype: int64

🛣️ Detour: Code explainer (Python)

We use the .assign() method to calculate the percentage of each column using an anonymous lambda function:

(
  df_street_crime
    .select_dtypes(include=['object'])
    .nunique()
    .sort_values(ascending=False)
    .assign(Percentage=lambda x: x/x.sum())
    .to_frame(name='# Distinct')
    .style.format({'Percentage': '{:.2%}'})
)

🛣️ Detour: Code explainer (Python)

We then convert the result to an html-formatted data frame using the .to_frame() method and format the Percentage column to shorten the decimal places:

(
  df_street_crime
    .select_dtypes(include=['object'])
    .nunique()
    .sort_values(ascending=False)
    .assign(Percentage=lambda x: x/x.sum())
    .to_frame(name='# Distinct')
    .style.format({'Percentage': '{:.2%}'})
)

This returns the following table:

	# Distinct	Percentage
Persistent ID	286221	98.41%
Location	2936	1.01%
Street Name	1317	0.45%
Outcome Date	111	0.04%
Epoch	111	0.04%
Ward Code	38	0.01%
Last Uploaded	36	0.01%
Ward Name	27	0.01%
Outcome Category	26	0.01%
Category	14	0.00%
Service	2	0.00%
Location Subtype	2	0.00%
Spatial Accuracy	1	0.00%

Convert to categorical (`category`) (Python)

Normally, whenever the number of unique values is small, consider converting the variable to a category.
But pay attention to the context.
- Never convert dates to category!
- Also probably not a good idea to convert ID variables to category!

Object sizes after conversion (Python)

How many bytes (rounded to MB) does the df_street_crime object occupy in memory?

df_street_crime.info(memory_usage='deep')

returns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 419273 entries, 0 to 419272
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Category          419273 non-null  object 
 1   Street ID         419273 non-null  int64  
# ... and so on
 18  Last Uploaded     419273 non-null  object 
 19  Location          419273 non-null  object 
dtypes: float64(5), int64(2), object(13)
memory usage: 403.6 MB

And after converting selected columns to categorical?

cat_columns = ["Spatial Accuracy", "Service", 
              "Location Subtype", "Category", 
              "Ward Name", "Outcome Category", 
              "Ward Code"]
df_street_crime[cat_columns] = df_street_crime[cat_columns] \
                                  .astype('category')
df_street_crime.info(memory_usage='deep')

returns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 419273 entries, 0 to 419272
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Category          419273 non-null  category
# ... and so on
 19  Location          419273 non-null  object  
dtypes: category(7), float64(5), int64(2), object(6)
memory usage: 204.7 MB

Tip

We see a substantial reduction: almost half the memory usage! But even if it were only a few MB, we’re building the habit of using the most efficient and most appropriate data types.

Counting the number of occurrences (Python)

The .value_counts() method is useful for counting the number of occurrences of each category in a column.

df_street_crime['Service'].value_counts()

returns:

Service
Police Force                392645
British Transport Police     26628
Name: count, dtype: int64

Note

.method() refers to a function that belongs to an object. In this case, value_counts() is a method of the Series or a DataFrame object.

Convert to `datetime64` objects (Python)

We can convert the Outcome Date column to a datetime64 object.

df_street_crime["Outcome Date"] = pd.to_datetime(df_street_crime["Outcome Date"])

so that

df_street_crime.info(memory_usage='deep')

returns:

# ... other stuff
 19  Location          419273 non-null  object        
dtypes: category(7), datetime64[ns](1), float64(5), int64(2), object(5)
memory usage: 184.3 MB

That is, we saved another 20 MB from just one column!

Convert to `datetime64` objects (Python)

If we do the same for the other datetime columns:

date_cols = ["Last Uploaded", "Epoch", "Outcome Date"]
df_street_crime[date_cols] = df_street_crime[date_cols].apply(pd.to_datetime, 
                                                              format='%d/%m/%Y', 
                                                              errors='coerce')
df_street_crime.info(memory_usage='deep')

we get:

dtypes: category(7), datetime64[ns](3), float64(5), int64(2), object(3)
memory usage: 137.1 MB

We have now saved another 50 MB from the other two columns.

What can you do with `pd.to_datetime()`? (Python)

You can convert a wide range of date formats to datetime64 objects, even if they are not in the standard YYYY-MM-DD format.

pd.to_datetime('2021-01-01')

returns:

Timestamp('2021-01-01 00:00:00')

as does

pd.to_datetime('2021/01/01')

or even

pd.to_datetime('1 Jan 2021')

Finally, let’s spot which `int` variables can be safely converted (Python)

df_street_crime.select_dtypes(include=['number']).columns

returns

Index(['Street ID', 'Context', 'ID', 'Easting', 'Northing', 'Longitude',
       'Latitude'],
      dtype='object')

Finally, let’s spot which `int` variables can be safely converted (Python)

import numpy as np

def check_int_conversion(col):
    col_min = df_street_crime[col].min()
    col_max = df_street_crime[col].max()
    
    dtypes = [
        (np.int8, 'int8'),
        (np.int16, 'int16'),
        (np.int32, 'int32'),
        (np.int64, 'int64')
    ]
    
    for dtype, dtype_name in dtypes:
        if col_min > np.iinfo(dtype).min and col_max < np.iinfo(dtype).max:
            print( f"{col} can be safely downcast to {dtype_name}")
            return col
    print( f"{col} cannot be safely downcast to int")

num_cols = df_street_crime.select_dtypes(include=['int']).columns

Finally, let’s spot which `int` variables can be safely converted (Python)

# downcast int columns

safe_int_cols = [check_int_conversion(col) for col in num_cols]
df_street_crime[safe_int_cols] = df_street_crime[safe_int_cols].apply(pd.to_numeric,
                                                                      downcast='integer')

# downcast float columns
safe_float_cols = df_street_crime.select_dtypes(include=['float']).columns
df_street_crime[safe_float_cols] = df_street_crime[safe_float_cols].apply(pd.to_numeric, downcast='float')

df_street_crime.info(memory_usage='deep')

returns:

# ... other stuff
 19  Location          419273 non-null  object        
dtypes: category(7), datetime64[ns](3), float32(3), float64(2), int32(2), object(3)
memory usage: 129.1 MB

This is a marginal improvement compared to categorical variables, but it’s still a good practice to follow when you are working with larger datasets.

Our current data frame (Python)

df_street_crime.info(memory_usage='deep')

returns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 419273 entries, 0 to 419272
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Category          419273 non-null  category      
 1   Street ID         419273 non-null  int32         
 2   Street Name       419273 non-null  object        
 3   Context           0 non-null       float32       
 4   Outcome Category  324353 non-null  category      
 5   Outcome Date      324353 non-null  datetime64[ns]
 6   Service           419273 non-null  category      
 7   Location Subtype  15697 non-null   category      
 8   ID                419273 non-null  int32         
 9   Persistent ID     298195 non-null  object        
 10  Epoch             419273 non-null  datetime64[ns]
 11  Ward Code         419273 non-null  category      
 12  Ward Name         419273 non-null  category      
 13  Easting           419273 non-null  float64       
 14  Northing          419273 non-null  float64       
 15  Longitude         419273 non-null  float32       
 16  Latitude          419273 non-null  float32       
 17  Spatial Accuracy  419273 non-null  category      
 18  Last Uploaded     275763 non-null  datetime64[ns]
 19  Location          419273 non-null  object        
dtypes: category(7), datetime64[ns](3), float32(3), float64(2), int32(2), object(3)
memory usage: 129.1 MB

Principle 2: Drop unnecessary columns (R)

Some columns are obviously not useful (R)

Every single row in the data frame has

Spatial Accuracy == "This is only an approximation of where the crime happened"

But it’s not the only one!

csv_data %>% 
  summarise(across(everything(), n_distinct)) %>% 
  pivot_longer(everything(), names_to = "Column", values_to = "# Distinct") %>% 
  filter(`# Distinct` == 1)

Column Context is also meaningless:

# A tibble: 2 x 2
  Column           `# Distinct`
  <chr>                   <int>
1 Context                     1
2 Spatial Accuracy            1

Some columns are obviously not useful (R)

Every single row in the data frame has

csv_data %>% select(Context) %>% distinct()

yields

# A tibble: 1 x 1
  Context
  <lgl>  
1 NA

Let’s drop them:

csv_data <- csv_data %>% select(-`Spatial Accuracy`, `Context`)

Yay! A few more Mb saved!

format(object.size(csv_data), units = "auto")

"83.7 Mb"

Others are context-dependent (R)

I don’t care much about Persistent ID
- It looks like it’s an internal database ID thing
I couldn’t find any documentation about Epoch
If I don’t plan to plot maps, I can drop Location.
- I could always rebuild it from Latitude and Longitude if I need it later.

csv_data %>% 
  select(Latitude, Longitude, Location) %>% 
  head()

# A tibble: 6 x 3
  Latitude Longitude Location
     <dbl>     <dbl> <chr>
1 51.5318  -0.123189 (51.5318, -0.123189)  
2 51.52817 -0.131558 (51.528169, -0.131558)
3 51.54274 -0.174124 (51.542741, -0.174124)
4 51.52961 -0.121652 (51.529611, -0.121652)
5 51.56554 -0.149851 (51.565544, -0.149851)
6 51.54251 -0.146589 (51.542507, -0.146589)

Drop’em (R)

csv_data <- 
  csv_data %>% 
  select(-c(`Persistent ID`, `Epoch`, `Location`))

Now check out those savings!

format(object.size(csv_data), units = "auto")

"31.7 Mb"

Look at our beautiful data frame: (R)

str(csv_data)

tibble [376,132 x 15] (S3: tbl_df/tbl/data.frame)
 $ Category        : Factor w/ 14 levels "Anti-social behaviour",..: 7 1 12 1 5 1 14 13 12 7 ...
 $ Street ID       : int [1:376132] 1489515 960522 965233 960974 972275 965090 967816 967555 965140 965052 ...
 $ Street Name     : chr [1:376132] "Kings Cross (station)" "On or near Wellesley Place" "On or near Avenue Road" "On or near Birkenhead Street" ...
 $ Outcome Category: Factor w/ 26 levels "Action to be taken by another organisation",..: 23 NA 9 NA 14 NA 26 9 26 9 ...
 $ Outcome Date    : Date[1:376132], format: "2017-08-01" NA "2015-08-01" NA ...
 $ Service         : Factor w/ 2 levels "British Transport Police",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ Location Subtype: Factor w/ 2 levels "London Underground Station",..: 2 NA NA NA NA NA NA NA NA NA ...
 $ ID              : int [1:376132] 64777250 51520755 42356413 59431385 41931981 51522064 58014826 46231592 64334450 48508596 ...
 $ Ward Code       : Factor w/ 38 levels "E05000128","E05000129",..: 16 16 17 14 10 3 9 6 3 1 ...
 $ Ward Name       : Factor w/ 27 levels "Belsize","Bloomsbury",..: 25 25 26 19 13 5 12 9 5 1 ...
 $ Easting         : num [1:376132] 530277 529707 526717 530390 528336 ...
 $ Northing        : num [1:376132] 183101 182683 184228 182861 186806 ...
 $ Longitude       : num [1:376132] -0.123 -0.132 -0.174 -0.122 -0.15 ...
 $ Latitude        : num [1:376132] 51.5 51.5 51.5 51.5 51.6 ...
 $ Last Uploaded   : Date[1:376132], format: "2018-07-11" "2018-07-11" "2016-05-05" "2017-11-03" ...

Principle 2: Drop unnecessary columns (Python)

Some columns are obviously not useful (Python)

Every single row in the data frame has

Spatial Accuracy
This is only an approximation of where the crime happened    419273
Name: count, dtype: int64

But it’s not the only one!

df_street_crime['Context'].value_counts()

Column Context is also meaningless:

Series([], Name: count, dtype: int64)

Let’s drop them:

df_street_crime.drop(columns=['Context', 'Spatial Accuracy'], inplace=True)

Yay! A few more MB saved!

df_street_crime.info(memory_usage='deep')

dtypes: category(6), datetime64[ns](3), float32(2), float64(2), int32(2), object(3)
memory usage: 127.1 MB

Others are context-dependent (Python)

I don’t care much about Persistent ID
- It looks like it’s an internal database ID thing
I couldn’t find any documentation about Epoch
If I don’t plan to plot maps, I can drop Location.
- I could always rebuild it from Latitude and Longitude if I need it later.

df_street_crime[['Latitude', 'Longitude', 'Location']].head()

Latitude	Longitude	Location
51.5318	-0.123189	(51.5318, -0.123189)
51.52817	-0.131558	(51.528169, -0.131558)
51.54274	-0.174124	(51.542741, -0.174124)
51.52961	-0.121652	(51.529611, -0.121652)
51.56554	-0.149851	(51.565544, -0.149851)
51.54251	-0.146589	(51.542507, -0.146589)

Drop’em (Python)

df_street_crime.drop(columns=['Epoch', 'Location', 'Persistent ID'],  inplace=True)

Now check out those savings!

df_street_crime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 419273 entries, 0 to 419272
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Category          419273 non-null  category      
# ... other stuff 
 14  Last Uploaded     275763 non-null  datetime64[ns]
dtypes: category(6), datetime64[ns](2), float32(2), float64(2), int32(2), object(1)
memory usage: 54.3 MB

Look at our beautiful data frame: (Python)

df_street_crime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 419273 entries, 0 to 419272
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Category          419273 non-null  category      
 1   Street ID         419273 non-null  int32         
 2   Street Name       419273 non-null  object        
 3   Outcome Category  324353 non-null  category      
 4   Outcome Date      324353 non-null  datetime64[ns]
 5   Service           419273 non-null  category      
 6   Location Subtype  15697 non-null   category      
 7   ID                419273 non-null  int32         
 8   Ward Code         419273 non-null  category      
 9   Ward Name         419273 non-null  category      
 10  Easting           419273 non-null  float64       
 11  Northing          419273 non-null  float64       
 12  Longitude         419273 non-null  float32       
 13  Latitude          419273 non-null  float32       
 14  Last Uploaded     275763 non-null  datetime64[ns]
dtypes: category(6), datetime64[ns](2), float32(2), float64(2), int32(2), object(1)
memory usage: 54.3 MB

A quick look at JSON and XML

JSON

JSON stands for JavaScript Object Notation

It looks like this:

{
  "meta" : {
    "view" : {
      "id" : "893b-tp33",
      "name" : "On Street Crime In Camden Map",
      "assetType" : "map",
      "createdAt" : 1432283465,
      "description" : "Data provided by the Police API (https://data.police.uk/docs/).",
      "displayType" : "map",
      "downloadCount" : 185,
      
      ...

      "viewType" : "tabular",
      "approvals" : [ {
        "reviewedAt" : 1440144247,
        "state" : "approved",
        "submissionId" : 247349,
        "submissionDetails" : {
          "permissionType" : "READ"
        },
      } ],
    },
  },
}

XML

XML stands for Extensible Markup Language
It looks like this:

<response>
  <row>
    <row _id="row-rbgw-qsj4.fr34" _uuid="00000000-0000-0000-0C14-790FCD1D8E66" _position="0" _address="https://opendata.camden.gov.uk/resource/_893b-tp33/row-rbgw-qsj4.fr34">
      <category>Other theft</category>
      <street_id>1489515</street_id>
      <street_name>Kings Cross (station)</street_name>
      <outcome_category>Status update unavailable</outcome_category>
      <outcome_date>2017-08-01T00:00:00</outcome_date>
      <service>British Transport Police</service>
      <location_subtype>Station</location_subtype>
      <id>64777250</id>
      <epoch>2017-04-01T00:00:00</epoch>
      <ward_code>E05000143</ward_code>
      <ward_name>St Pancras and Somers Town</ward_name>
      <easting>530277.37</easting>
      <northing>183101.39</northing>
      <longitude>-0.123189</longitude>
      <latitude>51.5318</latitude>
      <spatial_accuracy>This is only an approximation of where the crime happened</spatial_accuracy>
      <last_uploaded>2018-07-11T00:00:00</last_uploaded>
      <location latitude="51.5318" longitude="-0.123189"/>
    </row>
    <row>
      ...
    </row>
    ...
  </row>
</response>

Other file formats

Parquet
AVRO
ORC

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st edition. Sebastopol [CA]: O’Reilly. https://r4ds.had.co.nz/.

🗓️ Week 01 – Day 02Data types and common file formats

Datasets

We need organised datasets

An example

Data Types

Structured Data

Example

Example (cont.)

Example (cont.)

Example (cont.)

Text (data types)

The ASCII table

Other text formats

Emojis 😀

Structured data can still be messy

What this means for you

What about Unstructured Data?

Examples

Semi-structured data

Common File Formats

CSV

CSV - Example

CSV - Example

CSV - Example

CSV - Inspecting data types (R)

CSV - Inspecting data types (tidyverse)

CSV - Inspecting data types (tidyverse)

From the tibble documentation:

CSV - Inspecting data types (Python)

CSV - Inspecting data types (df.info())

Our approach to data processing

Principle 1: Ensure data types make sense (R)

Reflect on the character variables (R)

🛣️ Detour: Code explainer

🛣️ Detour: Code explainer (R)

🛣️ Detour: Code explainer (R)

🛣️ Detour: Code explainer (R)

🛣️ Detour: Code explainer (R)

Peek at the data

Convert to categorical (factor) (R)

Object sizes after conversion (R)

Use the forcats package (R)

Convert to Date objects (R)

What can you do with lubridate? (R)

Finally, let’s spot truly int variables (R)

Finally, let’s spot truly int variables

Our current data frame

Principle 1: Ensure data types make sense (Python)

Reflect on the object variables (Python)

🛣️ Detour: Code explainer (Python)

🛣️ Detour: Code explainer (Python)

🛣️ Detour: Code explainer (Python)

🛣️ Detour: Code explainer (Python)

Convert to categorical (category) (Python)

Object sizes after conversion (Python)

Counting the number of occurrences (Python)

Convert to datetime64 objects (Python)

Convert to datetime64 objects (Python)

What can you do with pd.to_datetime()? (Python)

Finally, let’s spot which int variables can be safely converted (Python)

Finally, let’s spot which int variables can be safely converted (Python)

Finally, let’s spot which int variables can be safely converted (Python)

Our current data frame (Python)

Principle 2: Drop unnecessary columns (R)

Some columns are obviously not useful (R)

Some columns are obviously not useful (R)

Others are context-dependent (R)

Drop’em (R)

Look at our beautiful data frame: (R)

Principle 2: Drop unnecessary columns (Python)

Some columns are obviously not useful (Python)

Others are context-dependent (Python)

Drop’em (Python)

Look at our beautiful data frame: (Python)

A quick look at JSON and XML

JSON

XML

Other file formats

References

🗓️ Week 01 – Day 02
Data types and common file formats

From the `tibble` documentation:

CSV - Inspecting data types (`df.info()`)

Reflect on the `character` variables (R)

Convert to categorical (`factor`) (R)

Use the `forcats` package (R)

Convert to `Date` objects (R)

Reflect on the `object` variables (Python)

Convert to categorical (`category`) (Python)

Convert to `datetime64` objects (Python)

Convert to `datetime64` objects (Python)

What can you do with `pd.to_datetime()`? (Python)

Finally, let’s spot which `int` variables can be safely converted (Python)

Finally, let’s spot which `int` variables can be safely converted (Python)

Finally, let’s spot which `int` variables can be safely converted (Python)