# Say we have a vector of categorical strings
<- c("blue", "green", "blue", "red")
sample_vector typeof(sample_vector)
[1] "character"
DS202 - Data Science for Social Scientists
Dr. Jon Cardoso-Silva
01 November 2022
OBJECTIVE: Support with R programming skills
# Say we have a vector of categorical strings
sample_vector <- c("blue", "green", "blue", "red")
typeof(sample_vector)
[1] "character"
# Factor is a bult-in R feature
# This is best way of representing categorical variables
factor(sample_vector)
[1] blue green blue red
Levels: blue green red
You can force an order:
[1] blue green blue red
Levels: red green blue
They have the same length
Elements in the same index represents the same βdayβ
Now, put it all together:
Keyword: Logical Operators
& stands for an AND operation
| stands for an OR operation
! stands for a NOT operation
Read more about it here
If I wanted to do it all in a single line:
If I have the same info but now represented as a data frame, how would I count the number of blues in common?
# A random dataframe
df <- data.frame(colourA=c("blue", "red", "green", "green", "blue", "red"),
colourB=c("red", "red", "blue", "blue", "blue", "green"))
df
colourA colourB
1 blue red
2 red red
3 green blue
4 green blue
5 blue blue
6 red green
You can access each column by using the $
:
# A random dataframe
df1 <- data.frame(observation=c(1, 2, 3, 4, 5, 6),
colour=c("blue", "red", "green", "green", "blue", "red"))
df1
observation colour
1 1 blue
2 2 red
3 3 green
4 4 green
5 5 blue
6 6 red
# A random dataframe
df2 <- data.frame(observation=c(1, 2, 3, 4, 5, 5, 6),
colour=c("red", "red", "blue", "blue", "red", "blue", "green"))
df2
observation colour
1 1 red
2 2 red
3 3 blue
4 4 blue
5 5 red
6 5 blue
7 6 green
First, letβs calculate whether there was at least one βblueβ in each observation.
ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.0 β readr 2.1.4
β forcats 1.0.0 β stringr 1.5.0
β ggplot2 3.4.1 β tibble 3.1.8
β lubridate 1.9.2 β tidyr 1.3.0
β purrr 1.0.1
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
observation
1 1
2 2
3 3
4 4
5 5
6 5
7 6
The pipe
I could do exactly the same thing using the pipe %>%
observation
1 1
2 2
3 3
4 4
5 5
6 5
7 6
Check the idea of group_by
(a tidyverse feature)
summarise
and n()
only works with groupings (group_by
).
# How many colours are there, per observation?
df2 %>% group_by(observation) %>% summarise(count=n())
# A tibble: 6 Γ 2
observation count
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 2
6 6 1
df2
have at least one colour βblueβ?df1
have at least one colour βblueβ?Both dataframes now have the same number of rows, representing the same βobservationsβ and both have a column called has_blue
. I can compare both like this:
Keyword: Logical Operators
& stands for an AND operation
| stands for an OR operation
! stands for a NOT operation
Read more about it here
Useful if the two dataframes are not aligned
mutate
s! observation colour is_blue
1 1 red FALSE
2 2 red FALSE
3 3 blue TRUE
4 4 blue TRUE
5 5 red FALSE
6 5 blue TRUE
7 6 green FALSE
Note: mutate
will add a new column but it will NOT update the dataframe. If you want to re-use the new column, you have to save the new dataframe:
observation colour
1 1 red
2 2 red
3 3 blue
4 4 blue
5 5 red
6 5 blue
7 6 green
If I want to updated it to the SAME dataframe, I have to reassign it (using <-
)
By manual inspection:
I will use iris
as an example:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
Generate a histogram of Petal.Length
:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How do I colour the histogram according to the Species
?
g <- (
ggplot(iris, aes(x=Petal.Length, fill=Species))
+ geom_histogram()
# Customize
+ theme_minimal()
)
g
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Each geom_
βlistensβ to a different set of aesthetics. (Check Chapter 3 of R for Data Science for more info)
For example, geom_point
does not understand the fill
:
Places to find colours: https://www.color-hex.com/color-palettes/popular.php
You can use built-in palettes, you just need to know their names/numbers.
Check the documentation https://ggplot2.tidyverse.org/reference/scale_brewer.html -> for the different settings
To understand which colour palettes are available, check: https://colorbrewer2.org/
Useful when you want to plot two charts in the same image.
Observation: you might need to combine (append) the two dataframes first. Use the tidyverse function bind_rows
(same as rbind
)
tidyverse
tidyverse
is a set of R packages that have several functions and facilities for working with data. I find tidyverse
more intuitive than base R, and thereβs an entire book available for free online (R for Data Science) that contains a lot of helpful tutorials about tidyverse. Let me point to a few specific chapters: