Machine Learning Cheatsheet

DS202 Blog Post

week09
tidymodels
cheatsheet
Author

Xiaowei Gao | Dr. Jon Cardoso-Silva

Published

27 November 2022

How do I do ... using tidyverse/tidymodels?”. This page has you covered.

Let me know if you would like me to add something here.

⚙️ Setup

What packages do you need?

In most cases, all you need is:

library(tidyverse)
library(tidymodels) 
📢 Important

Keep in mind that tidymodels is just a convenient wrapper for a multitude of other Machine Learning algorithms that exist in a myriad of other R packages. If the algorithm you want to use is not installed by default, you might get an error message saying that you need to install.packages(...) a package.

You can browse all supported algorithms on this page.

🧰 Part I Preprocessing data for ML using tidymodels

We will use the following important concepts from tidymodels:

A recipe is the tidymodels way of specifying how to preprocess and transform data. Recipes can help automate the data cleaning and preprocessing process, making it easier and more efficient for researchers to work with their data.

The recipes package provides a set of functions for defining and applying recipes to data sets. You will find that the functions of this package can be especially useful in social science research, where data sets often have missing or inconsistent values, and researchers need to create new variables or features for analysis.

A workflow is the tidymodels way of specifying how to perform complete data analysis, from data cleaning and preprocessing to modelling and visualization. Workflows can help researchers keep track of all the steps involved in their data analysis, and can help ensure that their analysis is transparent, replicable, and well-documented.

workflows can be especially useful in social science research, where researchers often need to perform complex data analyses and need to be able to reproduce their results, as it streamlines the data analysis process and ensure the reproducibility of results.

How recipes and workflow work together

Recipes can be used as a building block within workflows to specify the data preprocessing and transformation steps necessary for a particular analysis. In other words, recipes provide the “ingredients” for the data analysis, while workflows provide the overall plan for the analysis.

In a typical data science pipeline (workflow), you might want to perform all of the steps below:

  • read in data
  • clean and preprocess the data
  • create new features
  • fit a model to the pre-processed data
  • visualise the results.

Each of these steps would be specified by one or more recipes while the overall step-by-step procedure listed above would be accomplished by the workflows package.

By breaking down the analysis into smaller, more manageable steps, researchers can more easily understand and troubleshoot each step, and can more easily reproduce the entire analysis if necessary. Additionally, by using workflows to document the entire analysis process, researchers can ensure that their analyses are transparent and well-documented.

1.1 How to create a recipe

Suppose we have two preprocessing steps to perform on our data set before training an algorithm. We can create a recipe and make use of the ever-so-helpful pipe operator %>%:

#create the recipe for further processing
df_recipe <-
  recipe(formula, data= data) %>%
  step_{FILLIN}() %>% # image you have a recipe step here
  step_{FILLIN}() # image you have another recipe step here

# 📢 note that the order you write the steps matters.
  • formula should have two sides connected by the tilde(~). The left side should be the predicted column, and which right side contains all the predictors.

  • data is the dataset you will work with.

  • step_{FILLIN} is the actual data processing step. This can include things like missing data imputation, normalisation, filtering, etc. You could find all the available steps and pre-processing choices on the Step Function Reference page.

But this is not all. After you have created the recipe, you need to apply it to the data. This is done by using the prep() and bake() functions. Take a look at the example below for a better understanding.

Example of a recipe

Suppose we have a table as follows containing information about students, including their age, gender, and test scores.

age gender score
18 M 85
19 F 90
20 F 75
21 M 80
22 F 95

💡 Now suppose we want to create a new variable that indicates whether each student’s test score is above or below the average score. The code below shows how to do this using a recipe, and then how to apply the recipe to the data.


# Create a recipe
df_recipe <- recipe(score ~ ., data = data) %>%
  step_mutate(
    above_average = ifelse(score > mean(score), "yes", "no")
  )

# Apply the recipe to the data
data_transformed <- df_recipe %>% 
  prep() %>% 
  bake(data)

# View the transformed data
data_transformed

This is what the code above does:

  1. We use recipe() to specify the formula and data.

  2. We use the step_mutate() function to create a new variable called above_average based on the score variable, then specify that the above_average variable should take the value yes if the score is above the average, and no otherwise.

  3. We apply the df_recipe to the data using the prep() and bake() functions. The prep() function prepares the recipe for use, while the bake() function applies the recipe to the data, creating the above_average variable.

  4. We can now see the transformed data in the data_transformed variable.

💡 TIPS

This is just a simple example of how a recipe can be used to preprocess and transform data in the tidyverse. In practice, recipes can be much more complex and can be used to handle missing values, standardize variables, and perform a wide range of data preprocessing tasks.

About roles

🎯 This section is meant just to deepen your understanding of recipes; in practice, you might not need to tweak the roles of variables

When finishing the recipe, each variable will be given a role, typically predictor or outcome. This comes from the fact that recipes are mostly used in conjunction with workflows in a predictive modelling context. In this typical scenario, it is important to distinguish between the variables that will be used to predict the outcome variable (i.e., predictors) and the variable that we are trying to predict (i.e., the outcome or response variable).

When you create a recipe object, it automatically detects and assigns roles to each variable based on its data type. By default, any numeric variable that is not the outcome variable in the formula is assigned the predictor role. Categorical variables are assigned the predictorrole by default, but you can specify that a categorical variable should be used as the outcome variable by setting its role to outcome.

You could manually change the roles of each variable based on your model if you want. You can give them any name you want, not just predictor and outcome.

1.2 How to set up a workflow

Now let’s create a workflow to bind modelling and preprocessing objects together using the workflow() function. It will make your codes more logical and easier to read.

This is how you can create a workflow:

# the basic code
model_workflow <- workflow() %>%
    add_recipe(df_recipe) %>%
    add_model(model)

💡 Tip: The order of appearance of add_recipe() and add_model() should be based on your questions and working steps.

Important

Keep in mind that recipe and workflow will not act for any changes on your dataset unless you point out the input data and activate the process for the model fit or prediction.

Example of workflow

Using the same imaginary dataset, let’s apply a linear regression to help us examine whether the age and gender affect students’ scores.

# Define a workflow
student_workflow <- workflow() %>%
  add_model(lm, formula = score ~ age + gender) %>%
  add_recipe(recipe(score ~ ., data = data) %>%
    step_mutate(
      above_average = ifelse(score > mean(score), "yes", "no")
    ) %>%
    step_dummy(gender) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors())
  )

# Fit the workflow to the data
fitted_workflow <- student_workflow %>% fit(data)

# Summarize the model
summary(fitted_workflow$fit)
  1. We define a workflow using the workflow() function. The workflow includes a linear regression model (lm) with age and gender as predictors, and a recipe that preprocesses the data by creating a new variable above_average (step_mutate), dummy encoding gender (step_dummy), and scaling and centring all predictors (step_scale and step_center ).

  2. We then fit the workflow to the data using the fit() function.

  3. We finally summarize the results using the summary() function

🧰 Part II Resampling

To build the most accurate model, we might use a series of data sets as the training/testing, and also cross-validation or bootstrap methods to validate our model.

Main Functions from the rsample package:

Split the data into training and testing parts

#Split the data into training and testing parts
# Twek the `prop` to change the proportion of training/testing data
data_split <- initial_split(data, prop = 3/4) 

# Fit your model with this dataset
data_train <- training(data_split) 

#only use it when evaluating your models
data_test <- testing(data_split)

Create a cross-validation object with 10 folds

df_folds <- vfold_cv(traing_data, v = 10 )

df_folds

Apply the cross-validation object

  • Fit the model within each of the folds
# Create a custom metrics function based on your question
test_metrics <- metric_set(roc_auc, sens, spec,accuracy)

# Fit resamples
dt_rs_model <- wf %>% # wf means the workflow you created for model
  fit_resamples(resamples = df_folds,
                metrics = test_metrics)
  • Extract the performance metrics for each fold
# View performance metrics
dt_rs_model %>% 
  collect_metrics(summarize = FALSE) %>%
  filter(.metric == "accuracy") %>% # use filter if we want to stick on one matric
  select(-.estimator)

Construct a set of bootstrap replicates of the data

df_boot <- bootstraps(data, times = 10)

df_boot

Apply the bootstrap object

we could write down summarize = FALSE in collect_metrics() in order to allow us seeing the individual performance metrics for each fold. - Fit the model within each of the folds

# Create a custom metrics function based on your question
test_metrics <- metric_set(roc_auc, sens, spec)

# Fit resamples
dt_rs_model <- wf %>% # wf means the workflow you created for model
  fit_resamples(resamples = df_boot,
                metrics = test_metrics)
  • Extract the performance metrics for each fold

For each extraction, we could write down summarize = FALSE in collect_metrics() to allow us to see the individual performance metrics for each extraction.

# View performance metrics
dt_rs_model %>% 
  collect_metrics(summarize = FALSE)

🧰 Part III Algorithms

3.1 Regression

We use regression when we want to predict a numeric variable (for example number of accidents, house price, etc.). If the target variable is not numeric you will get an error, or worse you might get counterintuitive results.

📌 In 3.1, we do not provide any workflow part aiming at demonstrating diverse solutions for your reference. We will provide the workflow in 3.2

How to train:

model <-
    linear_reg() %>%
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Get a summary of the fitted model:

summary(model$fit)
Make Predictions


You can use your model to predict the outcome of a given data set. We use the augment() function for that:

df_augmented <- augment(model, df)
Keep in mind that the data frame MUST contain the same columns used to train the model. It is also common to “augment” the same data that was used for training the model. This is how we check if our model fits the data well.
Try a different implementation


By default, linear_reg() runs the same lm() algorithm we learned about in 💻 Week 03 - Lab but if you want you can use an alternative implementation from another R package. For example, to use a Bayesian implementation of linear regression (from stan), use:

model <-
    linear_reg() %>%
    set_engine("stan") %>% 
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Read about the alternatives in the linear_reg() documentation.
Diagnostic Plots


How well did the model fit the data?

par(mfrow = c(2, 2))
plot(model$fit)

How to train:

model <-
    decision_tree() %>%
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Get a summary of the fitted model:

summary(model$fit)

Plot the model

library(rpart.plot)

rpart.plot(model$fit, roundint=FALSE)
Make Predictions


You can use your model to predict the outcome of a given data set. We use the augment() function for that:

df_augmented <- augment(model, df)
Keep in mind that the data frame MUST contain the same columns used to train the model. It is also common to “augment” the same data that was used for training the model. This is how we check if our model fits the data well.
Try a different implementation


By default, decision_tree() runs the algorithm contained in the rpart package. This is the same package we learned about in 💻 Week 07 - Lab but if you want you can use an alternative implementation from another R package. For example, to use a C5.0:

model <-
    linear_reg() %>%
    set_engine("C5.0") %>% 
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))
Read about the alternatives in the decision_tree() documentation.
Change parameters


There are three main parameters you can tweak when using the rpart engine:

  • cost_complexity
  • tree_depth
  • min_n

tidymodels might use default values or attempt to guess the best values for the parameters. If you want to choose parameter values explicitly, pass those to the decision_tree() function. For example:

model <-
    decision_tree(tree_depth= integer(3)) %>%
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Note that the availability of parameters changes from engine to engine. The decision tree in the C5.0 engine, for example, has one tuning parameter, min_n. Always check the documentation!

How to train:

model <-
    svm_rbf() %>%
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Get a summary of the fitted model:

model
Make Predictions


You can use your model to predict the outcome of a given data set. We use the augment() function for that:

df_augmented <- augment(model, df)
Keep in mind that the data frame MUST contain the same columns used to train the model. It is also common to “augment” the same data that was used for training the model. This is how we check if our model fits the data well.
Plot the decision space

THIS ONLY WORKS WITH TWO PREDICTORS!

Step 1: Check the min and max values of the two SELECTED predictors

Replace <predictor1> and <predictor2> with the name of your selected predictors.

data %>% 
    select(c(<predictor1>, <predictor2>)) %>%
    summary()

Identify the min and max values of each predictor.

Step 2: Create a simulated dataset

You will need to find a suitable step_val1 and step_val2. Play with different values until you find one that you like.

sim.data <- 
  crossing(<predictor1>   = seq(min_val_predictor1, max_val_predictor1, step_val1),
           <predictor2>   = seq(min_val_predictor2, max_val_predictor2, step_val2))

Step 3: Run the fitted model on this simulated data

sim.data <- augment(model, sim.data)

Step 4: Run the fitted model on the data used to train the model

plot_df <- augment(model, data) 

Step 5: Build the plot

Remember to replace <predictor1> and <predictor2> with the name of your selected predictors.

g <- (
  plot_df %>%   
    ggplot()
  
    ## Tile the background of the plot with SVM predictions
    + geom_tile(data = sim.data, aes(x=<predictor1>, y=<predictor2>, fill = .pred), alpha = 0.45)
  
    ## Actual data
    + geom_point(aes(x=<predictor1>, y=<predictor2>), size=2.5, stroke=0.95)
  
    ## Define X and Os
    + scale_shape_manual(values = c(4, 1))
    + scale_fill_viridis_c()
    + scale_color_manual(values=c("black", "red"))
    + scale_alpha_manual(values=c(0.1, 0.7))
    
    ## (OPTIONAL) Customizing the colours and theme of the plot
    + theme_minimal()
    + theme(panel.grid = element_blank(), 
            legend.position = 'bottom', 
            plot.title = element_text(hjust = 0.5))
)

g
Try a different kernel


Simply replace svm_rbf() with one of the other kernels available in tidymodels (svm_poly() or svm_linear()).

Note that the availability of parameters changes from kernel to kernel. Always check the documentation!

Change parameters


There are two main parameters you can tweak when using the svm_rbf() function:

  • cost
  • rbf_sigma

tidymodels might use default values or attempt to guess the best values for the parameters. If you want to choose parameter values explicitly, pass those to the svm_rbf() function. For example:

model <-
    svm_rbf(cost= integer(1), rbf_sigma = 0.2) %>%
    set_mode("regression") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Note that the availability of parameters changes from kernel to kernel. Always check the documentation!

3.2 Classification

When considering a categorical outcome, which is usually represented as a class label either discrete (e.g., “spam email” or “not spam email”) or continuous (e.g., “high-risk street” or “low-risk street”), we will use classification models. In a classification problem, the goal is to train a model on a labelled dataset so that it can accurately predict the class labels for new, unseen data. This can be done using a variety of algorithms such as Decision Trees, Support Vector Machines (SVMs), Naive Bayes, and Logistic Regression.

For the Decision Trees and Support Vector Machines (SVMs) models, we have introduced the detailed codes in the aforementioned Regression part. The significant change is, set set_mode("classification"). You could double-check with Main Functions.

📒 This version of the code for Decision Trees and Support Vector Machines (SVMs) does not use workflows – it is an alternative

How to train:

model <-
    decision_tree() %>%
    set_mode("classification") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Get a summary of the fitted model:

summary(model$fit)

Plot the model

library(rpart.plot)

rpart.plot(model$fit, roundint=FALSE)
Make Predictions


You can use your model to predict the outcome of a given data set. We use the augment() function for that:

df_augmented <- augment(model, df)
Keep in mind that the data frame MUST contain the same columns used to train the model. It is also common to “augment” the same data that was used for training the model. This is how we check if our model fits the data well.

How to train:

model <-
    svm_rbf() %>%
    set_mode("classification") %>%
    fit(Today ~ ., data=ISLR2::Smarket %>% select(-Direction))

Get a summary of the fitted model:

model
Make Predictions


You can use your model to predict the outcome of a given data set. We use the augment() function for that:

df_augmented <- augment(model, df)
Keep in mind that the data frame MUST contain the same columns used to train the model. It is also common to “augment” the same data that was used for training the model. This is how we check if our model fits the data well.

Firstly, we have to point out which type of model we want to work with.

log_model<- # your model specification
  logistic_reg() %>%  # model type
  set_engine(engine = "glm") %>%  # model engine
  set_mode("classification") %>% # model mode

Then, we shall organize a workflow.

wf <- 
  workflow() %>% 
  add_model(log_model) %>% 
  add_recipe(df_recipe)

#show details of our workflow
wf
Train the model


This is the training process and will get our model with the results.

log_fit <- 
  wf %>% 
  fit(data = data_train)


#check the model's information
log_fit %>% 
    summary()
Make Predictions


A similar way with argumentfunction in this step. We will use the testing dataset to make the predictions.

#make predictions
df_augmented <- augment(log_model, data_test)

#show the details 
df_augmented
Try a different implementation


The default engine for this model is glm, but you could also choose to try another six types of engine. When considering another engine, just simply change the argument set_engine(engine = "glm").

show_engines("log_model")

#> # A tibble: 7 × 2
#>   engine    mode          
#>   <chr>     <chr>         
#> 1 glm       classification
#> 2 glmnet    classification
#> 3 LiblineaR classification
#> 4 spark     classification
#> 5 keras     classification
#> 6 stan      classification
#> 7 brulee    classification

Detailed codes for each engine could be found in the official documents, like Logistic regression via glmnet

As a similar idea to recipe and workflow, we will organize the data and prepare the model for Naive Bayes. For the data, we still use the df_recipe.

#prepare the model
nb_model<- # your model specification
  naive_Bayes() %>%  # model type
  set_engine(engine = "naivebayes") %>%  # model engine
  set_mode("classification") %>% # model mode

# workflow for the model
wf <- 
  workflow() %>% 
  add_model(nb_model) %>% 
  add_recipe(df_recipe)

#show details of our workflow
wf  
Train the model and make prediction


nb_fit <- 
  wf %>% 
  fit(data = data_train)


#check the model's information
nb_fit %>% 
    summary()

We use the augment again for the prediction.

#make predictions
df_augmented <- augment(nb_model, data_test)

#show the details 
df_augmented
Try a different implementation


By default, the naive_Bayes uses (engine = "naivebayes"). But it also provides another two engines for your reference.

🎯 Evaluation of the classification model

In the classification problem, we will utilise different assessment methods than those used for regression, specifically the confusion matrix. The specifics will reveal how well the model fits the data.

df_augmented %>% conf_mat(truth = , estimate =)
  • truth refers to the independent columns in the dataset when you fit the model.

  • estimate refers to the predicted columns, normally its column’s name starts with .pred_

You might want to plot the confusion matrix and also get the recall/precision/F1 score results.

#confusion matrix
df_augmented %>% 
    conf_mat(truth = , estimate =) %>%
    autoplot(type = 'heatmap')

# precision and recall
tibble(
  "precision" = 
     precision(df_augmented, truth = , estimate =) %>%
     select(.estimate),
  "recall" = 
     recall(df_augmented, truth = , estimate = ) %>%
     select(.estimate)
) %>%
  unnest() %>%
  kable()

# F1 score
df_augmented %>%
  f_meas(truth = , estimate = ) %>%
  select(-.estimator) %>%
  kable()

Another way is the ROC curve and AUC (the area under the curve).

# for the roc plot
df_augmented %>% 
  roc_curve(truth = , estimate =) %>% 
  autoplot()

 # for the auc number
df_augmented %>% 
  roc_auc(truth = , estimate =) 

Also with the metric_set function, we could easily get a set of the evaluation metrics.

metricsets <- metric_set(accuracy, recall,roc_auc, f_meas)
df_augmented %>% metricsets(truth = , estimate =)

3.2 Clustering

Clustering is a form of unsupervised machine learning in which observations are grouped into clusters based on similarities in their data values, or features; in a clustering model, the label is the cluster to which the observation is assigned, based purely on its features.

We introduce how to replicate the K-Means clustering method in R. But all the arguments and elements in the function would not be fully explored here, so you could read the Tutorial Resources and Main Function for more details.

K-Means Clustering Method

  • Main Function kmeans

    Required library

    library(cluster)    # clustering algorithms
    library(factoextra) # clustering algorithms & visualization
Train the model and make prediction


# train the k means model
km_model<- kmeans(data= , centers = , nstart = , iter.max = )

# show the details of the model
km_model

#summarize the k-means model

summary(km_model)
  • In the kmeans function, only data and centers are essential, specifically, centers define the number of cluster you want to achieve.

  • the results of summary() function normally return 8 types of information.

    • cluster contains information about each row of data,
    • centers, withinss, and size contain local information about each cluster
    • totss, tot.withinss, betweenss, and iter contain global information about the clustering process

Then, you can use the trained Kmeans model to predict the cluster of a given data set.

#make predictions
df_augmented <- augment(km_model, data)

#show the details 
summary(df_augmented)
Explore the clusters


Visualize the clusters and assign a unique colour to each cluster.

df_augmented %>%
  ggplot(aes(vector1, vector2, color = .pred_cluster)) +
  geom_point()

Visualize the centroids of each cluster.

df_centroids <- as_tibble(km_model$centers)
df_centroids$cluster_id <- factor(seq(1, nrow(df_centroids)))

df_augmented %>%
  ggplot(aes(vector1, vector2, color = .pred_cluster)) +

  geom_point(data=df_centroids, aes(color=cluster_id), size= , shape="X") 
  • Vector1/2 refers to the columns’ names which is the input data for clustering
Define the number of clusters


The key element is the number of clusters. We will show you one possible strategy to automate the process of finding an optimal number of clusters based on the total within-cluster sum of squares, which we are going to refer to as tot.withins. Furthermore, the elbow point plot will show us direct information.

# create a function
kmean_wss <- function(k) {
    kmeans(data, k)$tot.withinss
}

# Compute and plot wss for k = 1 to k = 10
k.values <- 1:10
# extract wss for 2-10 clusters
wss_values <- map_dbl(k.values, wss)

# Create a data frame to plot the graph
plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

fviz_nbclust function helps us with an easier way to have the elbow point plot.

fviz_nbclust(df, kmeans, method = "wss")

3.3 Dimensionality Reduction

Dimension reduction is the process of reducing the number of variables (also sometimes referred to as features or of course dimensions) to a set of values of variables called principal variables. It can be a good choice when you suspect there are “too many” variables. The main property of principal variables is the preservation of the structure and information carried by the original variables, to some extent.

💡 PCA “rearranges” the original data matrix, producing a new data matrix where all features are intentionally completely uncorrelated to each other.

Training PCA

Prepare a recipe for the PCA training.

pca_recipe <-
  # First we specify a recipe of what data we are using
  recipe(foluma, data = ) %>%
 
  # PCA requires that data have the same distribution
  # This can be achieved by normalizing the data (mean=0 and std=1)
  step_normalize(all_predictors()) %>% 
  
  # This is where we tell the recipe to run PCA and return 9 Principal Components
  step_pca(all_predictors(), num_comp= 9)

# pca_recipe created a recipe, but it didn't run any of those steps.
# To train the PCA, we have to prepare the recipe -- with prep()
pca_prep <- prep(pca_recipe)

Now that we have prepared our recipe, let’s bake it using our ingredients (the data):

new_df <- bake(pca_prep, data)

Look at the variance explained by each PC

Use the summary function. You will have Standard deviation, Proportion of Variance and Cumulative Proportion. Normally, we would have preserved 70% of the variance in this data.

summary(pca_prep$steps[[2]]$res)

Another way to show how many components to keep by the variance explained is to set threshold = 0.7 in step_pca().

recipe(~., data = USArrests) %>%
  step_normalize(all_numeric()) %>%
  step_pca(all_numeric(), threshold = 0.7) %>% # here is the threshold
  prep() %>%
  bake(pca_prep, data)

For text mining, we often use LSA to do the Dimensionality Reduction. It analyzes a set of documents and the terms within to find common or divergent concepts related to the documents and terms. We can extend this definition of LSA to be used as a method for classifying documents into different topics.

Train the LSA

However, before using LSA, you should create a document-feature matrix first.

library("quanteda.textmodels")

df_lsa <- textmodel_lsa(dfm_pac, nd=2)$docs %>% #nd is the number of dimensions
    as.data.frame()

Visualize the results

We only consider two dimensions here.

plot_ly(data =  bind_cols(df_lsa, data), 
        x = ~V1, 
        y = ~V2, 
        type="scatter", 
        mode="markers", 
        text=~paste('Doc ID:', docname, '\nDescription:\n', post))

📚 Resources

  • 💡 Exploratory Data Analysis with R This tutorial contains the basic skills to use R for data processing and modelling. You could start from this to get familiar with R and R studio. But you might see different code patterns hereby, as it mainly shows base R code rather than tidyverse. Overall, this tutorial is good to start with, as it is easy to follow.

  • 💡 Tidymodels and Tidy Modeling with R

    These two tutorials are super useful and important.

    They contain almost all the models you could refer to in the tidymodel packages. They also contain recipe for data wrangling and workflow for model construction.

  • 💡 Lisa Lendway’s COMP/STAT 112 website-tidyverse (with code examples and videos)

    This online resource institutively shows how the codes will work on our data and models.