πŸ’» Week 02 - Lab Roadmap (90 min)

DS202 - Data Science for Social Scientists

Author

Yijun Wang/Mustafa Can Ozkan/Dr. Jon Cardoso-Silva

Published

03 October 2022

This week, we will recap some basic R commands for social data science and then apply these commands to a practical case. We will learn about data structures and some simple data visualisation skills in R .

It is expected that R has been downloaded locally. We recommend that you run R within an integrated development environment (IDE) such as RStudio, which can be freely downloaded.

You might to install R and RStudio before you come to the lab. We think the instructions contained in the following might help you: β€œBefore we start” (part of Data analysis and R programming course by Laurent Gatto).

Also, do not forget that as a DSI student, you have access to a premium license to Dataquest’s R courses. Read the access instructions on Moodle carefully!

Step 1: Basic commands (15 min)

Step 1: Basic commands

We will follow the instructions below step by step together while answering whatever questions you might encounter along the way.

  1. Open R or RStudio. You can either run the folllowing commands in a R script or in the console window.

  2. Create a vetor of numbers with the function c() and name it x. When we type x, it gives us back the vector:

    > x <- c(1, 3, 2, 5)
    > x
    [1] 1 3 2 5

    Note that the > is not part of the command; rather, it is printed by R to indicate that it is ready for another command to be entered. We can also save things using = rather than <-:

    > x = c(1, 3, 2, 5)
  3. Check the length of vector x using the length() function:

    > length(x)
    [1] 4
  4. Create a matrix of numbers with the function matrix() and name it y. When we type y, it gives us back the matrix:

    > y <- matrix(data = c(1:16), nrow = 4, ncol = 4)
    > y
         [,1] [,2] [,3] [,4]
    [1,]    1    5    9   13
    [2,]    2    6   10   14
    [3,]    3    7   11   15
    [4,]    4    8   12   16

    If you want to learn about the meaning of some arguments like nrow or ncol:

    > ?matrix
  5. Select one element in the matrix y:

    > y[2,3]
    [1] 10

    The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column

  6. Select multiple rows and column at a time in the matrix y:

    > y[c(1, 3), c(2, 4)]
         [,1] [,2]
    [1,]    5   13
    [2,]    7   15
    > y[1:3, 2:4]
         [,1] [,2] [,3]
    [1,]    5    9   13
    [2,]    6   10   14
    [3,]    7   11   15
    > y[1:2, ]
         [,1] [,2] [,3] [,4]
    [1,]    1    5    9   13
    [2,]    2    6   10   14
    > y[-c(1, 3), ]
         [,1] [,2] [,3] [,4]
    [1,]    2    6   10   14
    [2,]    4    8   12   16

    No index for the columns or the rows indicates that R should include all columns or all rows, respectively. The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index.

  7. Check the number of rows and columns in a matrix:

    > dim(y)
    [1] 4 4
  8. Generate a vector of random normal variables:

    > set.seed(1303)
    > x <- rnorm(50)
    > y <- x + rnorm(50, mean = 50, sd = .1)
    > cor(x, y)
    [1] 0.9942128

    By default, rnorm() creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered as illustrated above.

    Each time we call the function rnorm(), we will get a different answer. However, sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this. We use set.seed() throughout the labs whenever we perform calculations involving random quantities.

  9. Let’s check some descriptive statistics of these vectors:

    > mean(y)
    [1] 50.18446
    > var(y)
    [1] 0.8002002
    > sqrt ( var (y))
    [1] 0.8945391
    > sd(y)
    [1] 0.8945391
    > cor (x, y)
    [1] 0.9942128

    The mean() and var() functions can be used to compute the mean and variance of a vector of numbers. Applying sqrt() to the output of var() will give the standard deviation. Or we can simply use the sd() function. The cor() function is to compute the correlation between vector x and y.

Step 2: Graphics (15 min)

Step 2: Graphics

We will plot and save plots in R.

  1. Produce a scatterplot between two vectors of numbers using the function plot():

    > set.seed(1303)
    > x <- rnorm(100)
    > y <- rnorm(100)
    > plot(x,y)
    > plot(x, y, xlab = " this is the x- axis ",
           ylab = " this is the y- axis ",
           main = " Plot of X vs Y")

    By default, the output plot will show in Plots window in the lower right cornor.

  2. Save the scatterplot in a pdf or a jpeg file:

    > pdf("Figure.pdf")
    > plot(x, y, col = "green")
    > dev.off()
    null device
            1    

    To create a jpeg, we use the function jpeg() instaed of pdf(). The function dev.off() indicates to R that we are done creating the plot.

  3. Produce a contour plot (like a topographical map) to represent 3-Dimentional data using the function contour():

    > x <- seq(1, 10)
    > y <- x
    > f <- outer(x, y, function (x, y) cos(y) / (1 + x^2))
    > contour(x, y, f)
    > contour(x, y, f, nlevels = 45, add = T)
    > fa <- (f - t(f)) / 2
    > contour(x, y, fa, nlevels = 15)

    The image() function works the same way as contour(). Explore it if you are interested.

  4. Using ggplot2 package for graphic:

    In R, the data is stored in a structure called dataframe. Dataframe can be seen as a 2-dimensional table consisting of rows and columns and their values. These values might be in different types such as numeric, character or logical. However, each column should have the exactly same data type.

    We can use the open-source data visualization package - ggplot2 to construct aesthetic mappings based on our data.

    • Since tidyverse library includes ggplot2, if you install tidyverse you will have access to ggplot2; installation can be done;
    > install.packages("tidyverse")
    • Alternatively, ggplot2 package can be installed
    > install.packages("ggplot2")

    After the installation is completed, it should be called in R environment:

    > library(ggplot2)

    There are some ready datasets to play with in the package ggplot2. Let’s explore and plot a dataset called diamonds showing the prices and some features of over 50000 diamonds. You can explore the meanings of the variables with ?diamonds command.

    Please type:

    > View(diamonds)

    the View() function can be used to view it in a spreadsheet-like window.

    we can plot this dataset with desired variables.

     > ggplot(diamonds[0:50,], aes(x=carat, y=price)) +
    geom_point() + 
    geom_text(label=diamonds[0:50,]$cut)

    x and y in aes shows the axis which are the carat and the price info each diamond. diamonds is the dataframe used in the plot and We used only the first 50 lines for clear visualisation. geom_point defines the shape of data to be plot and geom_text adds the labels. With $ sign, you can access a column in your dataset.

    We can also plot a histogram showing price

     > ggplot(diamonds,aes(x=price)) + geom_histogram(binwidth=100)

    This time all dataset is used for the visualisation.. For more detailed information and some examples you can use ?ggplot and ?aes

Creating a heatmap with ggplot2 package:

This time we will create a dummy dataframe with country names, a time period and random GDP for each country.

countries <- c("Canada", "France","Greece","Libya","Malta")
years <- c(2012:2021)

Let’s gather them together and see what our dataframe looks like:

data <- expand.grid(Country=countries, Year=years)
data

expand.grid creates a dataframe from all combinations of the supplied vectors.

to create random GPD for each country and for each year, and to add these values into our dataframe as GDP column::

gdps  <- runif(50, min=20000, max=500000)
data$GDP = gdps

runif generates a certain number of random values between min and maximum values with a uniform distribution. Since we have 5 countries and 10 year, we generated 50 random GPD value.

To check the data and type of the variable data:

View(data)
class(data)

We can plot now a very basic heatmap

ggplot(data, aes(Year, Country, fill= GDP)) + geom_tile()

To create a heatmap, our dataframe should look like a tabular dataset with three columns. aes defines X,Y axis and the values filling these pairs in the heatmap. geom_tile creates a heatmap with rectangulars with different options. For detailed information ?geom_tile

Step 3: Loading data (15 min)

Step 3: Loading data

Now, we will learn how to import a data set into R and explore the data set. For this lab session, we will use a ready-to-use dataset AUTO in the book β€œIntroduction to Statistical Learning, with Applications in R”. With the package ISLR2, we can use all the datasets in the book.

  1. First, we need to install ISLR2 into our R environment for future use.

    > install.packages("ISLR2")
  2. To use ISLR2 package and the datasets in our analyses, we need to call it in each R session with;

    > library(ISLR2)

    That’s it! We now can use all datasets by calling them by their names. The package includes numerous datasets and you can explore them with R.

  3. AUTO dataset is ready to be used in the analyse. You can explore the dataset by using:

    > View(Auto)
    > head(Auto)

    The head() function can also be used to view the first few rows of the data

  4. You may want to save this dataset on a local computer, which is useful for your future analyses while doing some changes on it. To save a dataset as a csv file:

    > write.csv(DataFrameName, file="Path to save the DataFrame//File Name.csv", row.names = FALSE)

    The option row.names = FALSE deletes the row names when you are saving the dataset. In this case, it will remove basic incremental indexes such as 1,2,… from the data. A detailed explanation of write.csv and its options could be found by typing ?write.csv

    Example

    You need to include the path where you would like to save the dataset on your computer. For example, if you work in a folder called Test in your desktop in a Windows machine. The code:

    > write.csv(Auto, "C:Users//LSE//Desktop//Test//autodataset.csv", row.names = FALSE )
  5. To use the dataset in the future, you need to load it into a dataframe by importing the csv file.

    We will load this dataset in a dataframe called Auto. Dataframe name is changable, however we would like to use words understandable and readable.

    > Auto <- read.csv("C://Users//LSE//Desktop//Test//autodataset.csv", na.strings = "?")

    Using the option na.strings tells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix.

    You can check the dataset:

    > View(Auto)
    > head(Auto)
  6. Deal with the missing data by removing rows with missing observations:

    > Auto <- na.omit(Auto)
    > dim(Auto)
    [1] 392 9

    The function dim() is to check the size of the data frame.

  7. Produce a numerical summary of each variable in the particular data frame:

    > summary(Auto)
Step 4: Practical exercises (45 min)

Step 4: Practical exercises (in pairs)

So far, we have learnt some basic commands in R. In this practical case, we will continues with the data set Auto studied in Step 3. Make sure that the missing values have been removed from the data.

Six questions are listed below. You are required to try to answer these questions in pair using R commands. We will go over the solutions once everyone has finished these questions.

🎯 Questions

  1. Which of the predictors are quantitative, and which are qualitative?

  2. What is the range of each quantitative predictor? (hint: You can answer this using the range() function)

  3. What is the mean and standard deviation of each quantitative predictor?

  4. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

  5. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

  6. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.