π» Week 02 - Lab Roadmap (90 min)
DS202 - Data Science for Social Scientists
This week, we will recap some basic R commands for social data science and then apply these commands to a practical case. We will learn about data structures and some simple data visualisation skills in R .
It is expected that R has been downloaded locally. We recommend that you run R within an integrated development environment (IDE) such as RStudio, which can be freely downloaded.
You might to install R and RStudio before you come to the lab. We think the instructions contained in the following might help you: βBefore we startβ (part of Data analysis and R programming course by Laurent Gatto).
Also, do not forget that as a DSI student, you have access to a premium license to Dataquestβs R courses. Read the access instructions on Moodle carefully!
Step 1: Basic commands (15 min)
Step 1: Basic commands
We will follow the instructions below step by step together while answering whatever questions you might encounter along the way.
- Open R or RStudio. You can either run the folllowing commands in a R script or in the console window. 
- Create a vetor of numbers with the function - c()and name it x. When we type x, it gives us back the vector:- > x <- c(1, 3, 2, 5) > x [1] 1 3 2 5- Note that the - >is not part of the command; rather, it is printed by R to indicate that it is ready for another command to be entered. We can also save things using- =rather than- <-:- > x = c(1, 3, 2, 5)
- Check the length of vector - xusing the- length()function:- > length(x) [1] 4
- Create a matrix of numbers with the function - matrix()and name it- y. When we type- y, it gives us back the matrix:- > y <- matrix(data = c(1:16), nrow = 4, ncol = 4) > y [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16- If you want to learn about the meaning of some arguments like - nrowor- ncol:- > ?matrix
- Select one element in the matrix - y:- > y[2,3] [1] 10- The first number after the open-bracket symbol - [always refers to the row, and the second number always refers to the column
- Select multiple rows and column at a time in the matrix - y:- > y[c(1, 3), c(2, 4)] [,1] [,2] [1,] 5 13 [2,] 7 15 > y[1:3, 2:4] [,1] [,2] [,3] [1,] 5 9 13 [2,] 6 10 14 [3,] 7 11 15 > y[1:2, ] [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 > y[-c(1, 3), ] [,1] [,2] [,3] [,4] [1,] 2 6 10 14 [2,] 4 8 12 16- No index for the columns or the rows indicates that R should include all columns or all rows, respectively. The use of a negative sign - -in the index tells R to keep all rows or columns except those indicated in the index.
- Check the number of rows and columns in a matrix: - > dim(y) [1] 4 4
- Generate a vector of random normal variables: - > set.seed(1303) > x <- rnorm(50) > y <- x + rnorm(50, mean = 50, sd = .1) > cor(x, y) [1] 0.9942128- By default, - rnorm()creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered as illustrated above.- Each time we call the function - rnorm(), we will get a different answer. However, sometimes we want our code to reproduce the exact same set of random numbers; we can use the- set.seed()function to do this. We use- set.seed()throughout the labs whenever we perform calculations involving random quantities.
- Letβs check some descriptive statistics of these vectors: - > mean(y) [1] 50.18446 > var(y) [1] 0.8002002 > sqrt ( var (y)) [1] 0.8945391 > sd(y) [1] 0.8945391 > cor (x, y) [1] 0.9942128- The - mean()and- var()functions can be used to compute the mean and variance of a vector of numbers. Applying- sqrt()to the output of- var()will give the standard deviation. Or we can simply use the- sd()function. The- cor()function is to compute the correlation between vector- xand- y.
Step 2: Graphics (15 min)
Step 2: Graphics
We will plot and save plots in R.
- Produce a scatterplot between two vectors of numbers using the function - plot():- > set.seed(1303) > x <- rnorm(100) > y <- rnorm(100) > plot(x,y) > plot(x, y, xlab = " this is the x- axis ", ylab = " this is the y- axis ", main = " Plot of X vs Y")- By default, the output plot will show in Plots window in the lower right cornor. 
- Save the scatterplot in a pdf or a jpeg file: - > pdf("Figure.pdf") > plot(x, y, col = "green") > dev.off() null device 1- To create a jpeg, we use the function - jpeg()instaed of- pdf(). The function- dev.off()indicates to R that we are done creating the plot.
- Produce a contour plot (like a topographical map) to represent 3-Dimentional data using the function - contour():- > x <- seq(1, 10) > y <- x > f <- outer(x, y, function (x, y) cos(y) / (1 + x^2)) > contour(x, y, f) > contour(x, y, f, nlevels = 45, add = T) > fa <- (f - t(f)) / 2 > contour(x, y, fa, nlevels = 15)- The - image()function works the same way as- contour(). Explore it if you are interested.
- Using - ggplot2package for graphic:- In R, the data is stored in a structure called - dataframe. Dataframe can be seen as a 2-dimensional table consisting of rows and columns and their values. These values might be in different types such as- numeric,- characteror- logical. However, each column should have the exactly same data type.- We can use the open-source data visualization package - ggplot2 to construct aesthetic mappings based on our data. - Since tidyverselibrary includes ggplot2, if you installtidyverseyou will have access to ggplot2; installation can be done;
 - > install.packages("tidyverse")- Alternatively, ggplot2package can be installed
 - > install.packages("ggplot2")- After the installation is completed, it should be called in R environment: - > library(ggplot2)- There are some ready datasets to play with in the package - ggplot2. Letβs explore and plot a dataset called- diamondsshowing the prices and some features of over 50000 diamonds. You can explore the meanings of the variables with- ?diamondscommand.- Please type: - > View(diamonds)- the - View()function can be used to view it in a spreadsheet-like window.- we can plot this dataset with desired variables. - > ggplot(diamonds[0:50,], aes(x=carat, y=price)) + geom_point() + geom_text(label=diamonds[0:50,]$cut)- x and y in - aesshows the axis which are the carat and the price info each diamond.- diamondsis the dataframe used in the plot and We used only the first 50 lines for clear visualisation.- geom_pointdefines the shape of data to be plot and- geom_textadds the labels. With- $sign, you can access a column in your dataset.- We can also plot a histogram showing - price- > ggplot(diamonds,aes(x=price)) + geom_histogram(binwidth=100)- This time all dataset is used for the visualisation.. For more detailed information and some examples you can use - ?ggplotand- ?aes
- Since 
Creating a heatmap with ggplot2 package:
This time we will create a dummy dataframe with country names, a time period and random GDP for each country.
countries <- c("Canada", "France","Greece","Libya","Malta")
years <- c(2012:2021)Letβs gather them together and see what our dataframe looks like:
data <- expand.grid(Country=countries, Year=years)
dataexpand.grid creates a dataframe from all combinations of the supplied vectors.
to create random GPD for each country and for each year, and to add these values into our dataframe as GDP column::
gdps  <- runif(50, min=20000, max=500000)
data$GDP = gdpsrunif generates a certain number of random values between min and maximum values with a uniform distribution. Since we have 5 countries and 10 year, we generated 50 random GPD value.
To check the data and type of the variable data:
View(data)
class(data)We can plot now a very basic heatmap
ggplot(data, aes(Year, Country, fill= GDP)) + geom_tile()To create a heatmap, our dataframe should look like a tabular dataset with three columns. aes defines X,Y axis and the values filling these pairs in the heatmap. geom_tile creates a heatmap with rectangulars with different options. For detailed information ?geom_tile
Step 3: Loading data (15 min)
Step 3: Loading data
Now, we will learn how to import a data set into R and explore the data set. For this lab session, we will use a ready-to-use dataset AUTO in the book βIntroduction to Statistical Learning, with Applications in Rβ. With the package ISLR2, we can use all the datasets in the book.
- First, we need to install - ISLR2into our R environment for future use.- > install.packages("ISLR2")
- To use - ISLR2package and the datasets in our analyses, we need to call it in each R session with;- > library(ISLR2)- Thatβs it! We now can use all datasets by calling them by their names. The package includes numerous datasets and you can explore them with R. 
- AUTOdataset is ready to be used in the analyse. You can explore the dataset by using:- > View(Auto) > head(Auto)- The - head()function can also be used to view the first few rows of the data
- You may want to save this dataset on a local computer, which is useful for your future analyses while doing some changes on it. To save a dataset as a csv file: - > write.csv(DataFrameName, file="Path to save the DataFrame//File Name.csv", row.names = FALSE)- The option - row.names = FALSEdeletes the row names when you are saving the dataset. In this case, it will remove basic incremental indexes such as 1,2,β¦ from the data. A detailed explanation of- write.csvand its options could be found by typing ?write.csvExample- You need to include the path where you would like to save the dataset on your computer. For example, if you work in a folder called - Testin your desktop in a Windows machine. The code:- > write.csv(Auto, "C:Users//LSE//Desktop//Test//autodataset.csv", row.names = FALSE )
- To use the dataset in the future, you need to load it into a dataframe by importing the csv file. - We will load this dataset in a dataframe called - Auto. Dataframe name is changable, however we would like to use words understandable and readable.- > Auto <- read.csv("C://Users//LSE//Desktop//Test//autodataset.csv", na.strings = "?")- Using the option - na.stringstells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix.- You can check the dataset: - > View(Auto) > head(Auto)
- Deal with the missing data by removing rows with missing observations: - > Auto <- na.omit(Auto) > dim(Auto) [1] 392 9- The function - dim()is to check the size of the data frame.
- Produce a numerical summary of each variable in the particular data frame: - > summary(Auto)
Step 4: Practical exercises (45 min)
Step 4: Practical exercises (in pairs)
So far, we have learnt some basic commands in R. In this practical case, we will continues with the data set Auto studied in Step 3. Make sure that the missing values have been removed from the data.
Six questions are listed below. You are required to try to answer these questions in pair using R commands. We will go over the solutions once everyone has finished these questions.
π― Questions
- Which of the predictors are quantitative, and which are qualitative? 
- What is the range of each quantitative predictor? (hint: You can answer this using the - range()function)
- What is the mean and standard deviation of each quantitative predictor? 
- Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains? 
- Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings. 
- Suppose that we wish to predict gas mileage ( - mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting- mpg? Justify your answer.
