π» Week 02 - Lab Roadmap (90 min)
DS202 - Data Science for Social Scientists
This week, we will recap some basic R commands for social data science and then apply these commands to a practical case. We will learn about data structures and some simple data visualisation skills in R .
It is expected that R has been downloaded locally. We recommend that you run R within an integrated development environment (IDE) such as RStudio, which can be freely downloaded.
Step 1: Basic commands (15 min)
Step 1: Basic commands
We will follow the instructions below step by step together while answering whatever questions you might encounter along the way.
Open R or RStudio. You can either run the folllowing commands in a R script or in the console window.
Create a vetor of numbers with the function
c()
and name it x. When we type x, it gives us back the vector:> x <- c(1, 3, 2, 5) > x 1] 1 3 2 5 [
Note that the
>
is not part of the command; rather, it is printed by R to indicate that it is ready for another command to be entered. We can also save things using=
rather than<-
:> x = c(1, 3, 2, 5)
Check the length of vector
x
using thelength()
function:> length(x) 1] 4 [
Create a matrix of numbers with the function
matrix()
and name ity
. When we typey
, it gives us back the matrix:> y <- matrix(data = c(1:16), nrow = 4, ncol = 4) > y 1] [,2] [,3] [,4] [,1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 [
If you want to learn about the meaning of some arguments like
nrow
orncol
:> ?matrix
Select one element in the matrix
y
:> y[2,3] 1] 10 [
The first number after the open-bracket symbol
[
always refers to the row, and the second number always refers to the columnSelect multiple rows and column at a time in the matrix
y
:> y[c(1, 3), c(2, 4)] 1] [,2] [,1,] 5 13 [2,] 7 15 [> y[1:3, 2:4] 1] [,2] [,3] [,1,] 5 9 13 [2,] 6 10 14 [3,] 7 11 15 [> y[1:2, ] 1] [,2] [,3] [,4] [,1,] 1 5 9 13 [2,] 2 6 10 14 [> y[-c(1, 3), ] 1] [,2] [,3] [,4] [,1,] 2 6 10 14 [2,] 4 8 12 16 [
No index for the columns or the rows indicates that R should include all columns or all rows, respectively. The use of a negative sign
-
in the index tells R to keep all rows or columns except those indicated in the index.Check the number of rows and columns in a matrix:
> dim(y) 1] 4 4 [
Generate a vector of random normal variables:
> set.seed(1303) > x <- rnorm(50) > y <- x + rnorm(50, mean = 50, sd = .1) > cor(x, y) 1] 0.9942128 [
By default,
rnorm()
creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered as illustrated above.Each time we call the function
rnorm()
, we will get a different answer. However, sometimes we want our code to reproduce the exact same set of random numbers; we can use theset.seed()
function to do this. We useset.seed()
throughout the labs whenever we perform calculations involving random quantities.Letβs check some descriptive statistics of these vectors:
> mean(y) 1] 50.18446 [> var(y) 1] 0.8002002 [> sqrt ( var (y)) 1] 0.8945391 [> sd(y) 1] 0.8945391 [> cor (x, y) 1] 0.9942128 [
The
mean()
andvar()
functions can be used to compute the mean and variance of a vector of numbers. Applyingsqrt()
to the output ofvar()
will give the standard deviation. Or we can simply use thesd()
function. Thecor()
function is to compute the correlation between vectorx
andy
.
Step 2: Graphics (15 min)
Step 2: Graphics
We will plot and save plots in R.
Produce a scatterplot between two vectors of numbers using the function
plot()
:> set.seed(1303) > x <- rnorm(100) > y <- rnorm(100) > plot(x,y) > plot(x, y, xlab = " this is the x- axis ", ylab = " this is the y- axis ", main = " Plot of X vs Y")
By default, the output plot will show in Plots window in the lower right cornor.
Save the scatterplot in a pdf or a jpeg file:
> pdf("Figure.pdf") > plot(x, y, col = "green") > dev.off() null device1
To create a jpeg, we use the function
jpeg()
instaed ofpdf()
. The functiondev.off()
indicates to R that we are done creating the plot.Produce a contour plot (like a topographical map) to represent 3-Dimentional data using the function
contour()
:> x <- seq(1, 10) > y <- x > f <- outer(x, y, function (x, y) cos(y) / (1 + x^2)) > contour(x, y, f) > contour(x, y, f, nlevels = 45, add = T) > fa <- (f - t(f)) / 2 > contour(x, y, fa, nlevels = 15)
The
image()
function works the same way ascontour()
. Explore it if you are interested.Using
ggplot2
package for graphic:In R, the data is stored in a structure called
dataframe
. Dataframe can be seen as a 2-dimensional table consisting of rows and columns and their values. These values might be in different types such asnumeric
,character
orlogical
. However, each column should have the exactly same data type.We can use the open-source data visualization package - ggplot2 to construct aesthetic mappings based on our data.
- Since
tidyverse
library includes ggplot2, if you installtidyverse
you will have access to ggplot2; installation can be done;
> install.packages("tidyverse")
- Alternatively,
ggplot2
package can be installed
> install.packages("ggplot2")
After the installation is completed, it should be called in R environment:
> library(ggplot2)
There are some ready datasets to play with in the package
ggplot2
. Letβs explore and plot a dataset calleddiamonds
showing the prices and some features of over 50000 diamonds. You can explore the meanings of the variables with?diamonds
command.Please type:
> View(diamonds)
the
View()
function can be used to view it in a spreadsheet-like window.we can plot this dataset with desired variables.
> ggplot(diamonds[0:50,], aes(x=carat, y=price)) + geom_point() + geom_text(label=diamonds[0:50,]$cut)
x and y in
aes
shows the axis which are the carat and the price info each diamond.diamonds
is the dataframe used in the plot and We used only the first 50 lines for clear visualisation.geom_point
defines the shape of data to be plot andgeom_text
adds the labels. With$
sign, you can access a column in your dataset.We can also plot a histogram showing
price
> ggplot(diamonds,aes(x=price)) + geom_histogram(binwidth=100)
This time all dataset is used for the visualisation.. For more detailed information and some examples you can use
?ggplot
and?aes
- Since
Step 3: Loading data (15 min)
Step 3: Loading data
Now, we will learn how to import a data set into R and explore the data set. For this lab session, we will use a ready-to-use dataset AUTO
in the book βIntroduction to Statistical Learning, with Applications in Rβ. With the package ISLR2
, we can use all the datasets in the book.
First, we need to install
ISLR2
into our R environment for future use.> install.packages("ISLR2")
To use
ISLR2
package and the datasets in our analyses, we need to call it in each R session with;> library(ISLR2)
Thatβs it! We now can use all datasets by calling them by their names. The package includes numerous datasets and you can explore them with R.
AUTO
dataset is ready to be used in the analyse. You can explore the dataset by using:> View(Auto) > head(Auto)
The
head()
function can also be used to view the first few rows of the dataYou may want to save this dataset on a local computer, which is useful for your future analyses while doing some changes on it. To save a dataset as a csv file:
> write.csv(DataFrameName, file="Path to save the DataFrame//File Name.csv", row.names = FALSE)
The option
row.names = FALSE
deletes the row names when you are saving the dataset. In this case, it will remove basic incremental indexes such as 1,2,β¦ from the data. A detailed explanation ofwrite.csv
and its options could be found by typing ?write.csvTo use the dataset in the future, you need to load it into a dataframe by importing the csv file.
We will load this dataset in a dataframe called
Auto
. Dataframe name is changable, however we would like to use words understandable and readable.> Auto <- read.csv("C://Users//LSE//Desktop//Test//autodataset.csv", na.strings = "?")
Using the option
na.strings
tells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix.You can check the dataset:
> View(Auto) > head(Auto)
Deal with the missing data by removing rows with missing observations:
> Auto <- na.omit(Auto) > dim(Auto) 1] 392 9 [
The function
dim()
is to check the size of the data frame.Produce a numerical summary of each variable in the particular data frame:
> summary(Auto)
Step 4: Practical exercises (45 min)
Step 4: Practical exercises (in pairs)
So far, we have learnt some basic commands in R. In this practical case, we will continues with the data set Auto studied in Step 3. Make sure that the missing values have been removed from the data.
Six questions are listed below. You are required to try to answer these questions in pair using R commands. We will go over the solutions once everyone has finished these questions.
π― Questions
Which of the predictors are quantitative, and which are qualitative?
What is the range of each quantitative predictor? (hint: You can answer this using the
range()
function)What is the mean and standard deviation of each quantitative predictor?
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
Suppose that we wish to predict gas mileage (
mpg
) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predictingmpg
? Justify your answer.