🗓️ Week 08:
Ensemble methods and unsupervised learning I

Main theme: Unsupervised learning

2023-11-17

⏪ Recap time: Bias-variance trade-off

  • Bias:

    • difference between average prediction of our model and the correct value which we are trying to predict.
    • Model with high bias pays very little attention to the training data and oversimplifies the model. Always leads to high error on training and test data.
  • Variance:

    • variability of model prediction for a given data point. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before: the result is that such models perform very well on training data but has high error rates on test data.

Image source:https://towardsdatascience.com/what-bias-variance-bulls-eye-diagram-really-represent-ff6fb9670993

Image source:https://towardsdatascience.com/what-bias-variance-bulls-eye-diagram-really-represent-ff6fb9670993

Image source:https://towardsdatascience.com/what-bias-variance-bulls-eye-diagram-really-represent-ff6fb9670993

Image source:https://hshan0103.medium.com/understanding-bias-variance-trade-off-from-learning-curve-a64b4223bb02

⏪ Recap time: Decision trees

We mentioned previously that single decision trees had a tendency to overfit.

Common tree parameters:

These parameters define the end condition for building a new tree and are usually tuned to increase accuracy and prevent overfitting.

  • Max. depth: how tall a tree can grow
    • Usually want < 10
    • Sometimes defined by number of leaves
  • Max. features: how many features can be used to build a given tree
    • features are randomly selected from total set
    • The tree doesn’t need to use all of the available features
  • Min. samples per leaf: how many samples are required to make a new leaf
    • Usually want < 1% of data
    • Sometimes defined by samples per split

⏪ Recap time: Another solution for overfitting- Ensemble methods

We can tweak parameters to avoid overfitting or we can do something else…

So, what is better than a single tree?

Answer: Ensembles of trees!

Random forest: principle

  1. Step 1: In a Random Forest model, a subset of data points and a subset of features is selected to construct each decision tree. In other words, \(n\) data points and \(m\) features are randomly selected out of the dataset to construct each of the \(B\) trees that will compose the forest

  2. Step 2: Individual decision trees are constructed for each sample.

  3. Step 3: Each decision tree will generate an output.

  4. Step 4: The final output of the forest is obtained through majority voting (for classification) or averaging of tree predictions (for regression).

Check (Breiman 2001) for more details

Random Forest vs Decision Trees

  • Random forests are created from subsets of data and the final output is based on average or majority ranking so less prone to overfitting then decision trees
  • Random Forest are comparatively slower to run
  • Random forest randomly selects observations, builds a decision tree, and takes the average result. It doesn’t use any set of formulas. Decision trees, on the other hand, will formulate some rules to make predictions. This makes Random Forest less interpretable than Decision Trees.
  • Random Forest is immune to the curse of dimensionality since each tree is trained on only a subset of data/features (unlike decision trees)
  • Random Forest maintains diversity as all the attributes are not considered while making each decision tree (though this might not be true in all cases)

Boosted trees

Boosting 1 2 is a method of combining many weak learners (trees) into a strong classifier.

Usually:

  • Each tree is created iteratively
  • The tree’s output (\(h(x)\)) is given a weight (\(w\)) depending on its accuracy
  • The output of the ensemble of trees is the weighted sum: \(\hat{y}=\sum_t w_th_t(x)\)
  • After each iteration, each data sample is given a weight based on its misclassification: the more often a data sample is misclassified, the more important it becomes
  • The goal is to minimize an objective function: \(O(x)=\sum_i l(\hat{y_i},y_i)+\sum_t \Omega(f_t)\)
    • \(l(\hat{y_i},y_i)\) is the loss function — the distance between the truth and the prediction of the \(i\)-th sample
    • \(\Omega(f_t)\) is the regularization function — it penalizes the complexity of the \(t\)-th tree

Types of boosting

There are many different ways of iteratively adding learners to minimize a loss function.

Some of the most common are :

  • AdaBoost
  • Gradient Boosting
    • Uses gradient descent to create new learners
    • the loss function is differentiable
    • See (Friedman 2001) for details
  • XGBoost
    • “eXtreme Gradient Boosting”
    • Type of gradient boosting
    • Has become very popular in data science competitions
    • See (Chen and Guestrin 2016) for details
  • CatBoost
    • “Categorical boosting”
    • Another type of gradient boosting
    • Also popular in data science competitions
    • See (Prokhorenkova et al. 2018) for details

Boosting: common parameters

  • Loss function: defines distance between ground truth and predictions
    • Use binary logistic for two class problems
  • Learning rate: set data weights adjustment rate after each iteration
    • Smaller is better but slower
    • Somewhere around 0.1
  • Subsample size: number of samples to train each new tree
    • Data samples are randomly selected each iteration
  • Number of trees: How many total trees to create
    • This is the same as the number of iterations
    • Usually more is better, but might lead to overfitting

Pros and cons of using boosted trees

Benefits:

  • Fast: Both training and prediction is fast
  • Easy to tune
  • Not sensitive to scale
  • The features can be a mix of categorical and continuous data
  • Good performance: Training on the residuals gives very good accuracy
  • Lots of available software
    • Boosted tree algorithms are very commonly used
    • There is a lot of well supported, well tested software available.

Problems:

  • Sensitive to overfitting and noise
    • Should always crossvalidate!
    • Modern software libraries have tools to avoid overfitting

Example: Buiding trees on the Boston dataset

The Boston dataset is part of the ISLR2 library and contains information collected by the US Census service about housing in the Boston area (Massachussetts). It is composed of 13 variables and 506 observations (for details see here)

Training a simple regression tree on Boston

We train a regression tree to predict the variable medv (i.e median value of owner-occupied homes in $1000s).

.metric .estimator .estimate
<chr>   <chr>          <dbl>
rmse    standard        4.78

Training a random forest

  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        3.52

Training a boosted tree

.metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        3.34

Feature importance per type of tree

XGBoost

Random Forest

Feature importance per type of tree

Decision Tree

Source code: Common to all trees

library(tidymodels)
library(ISLR2)
library(rpart.plot)
library(vip)

#loading data
data("Boston", package = "MASS")

Boston <- as_tibble(Boston)
set.seed(1234)

#splitting data in training and test set
Boston_split <- initial_split(Boston)

Boston_train <- training(Boston_split)
Boston_test <- testing(Boston_split)

Source code : Random Forest


#forest of trees specification
bagging_spec <- rand_forest(mtry = .cols()) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("regression")

# fitting model and getting metrics
bagging_fit <- fit(bagging_spec, medv ~ ., data = Boston_train)
augment(bagging_fit, new_data = Boston_test) %>%
  rmse(truth = medv, estimate = .pred)

# plotting ground truth against model predictions
augment(bagging_fit, new_data = Boston_test) %>%
  ggplot(aes(medv, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)

#plotting feature importance
vip(bagging_fit)

Source code: Boosted tree


# boosted trees specification

boost_spec <- boost_tree(trees = 5000, tree_depth = 4) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# fitting model and getting metrics
boost_fit <- fit(boost_spec, medv ~ ., data = Boston_train)
augment(boost_fit, new_data = Boston_test) %>%
  rmse(truth = medv, estimate = .pred)

#plotting ground truth against model predictions
augment(boost_fit, new_data = Boston_test) %>%
  ggplot(aes(medv, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)

#plotting feature importance
vip(boost_fit)

Introduction to a new paradigm: unsupervised learning

What is Unsupervised Learning

Unsupervised learning algorithms :

  • rely on unlabeled data
  • learn hidden patterns in the data without human intervention

Types of Unsupervised Learning

Clustering

  • Consists in dividing data points into clusters, i.e groups of points that are similar between themselves and are dissimilar to data points belonging to other cluster
  • Examples:
    • segmenting patients with a particular condition (e.g cancer) into groups depending on their response to particular treatment drugs (aim: treatment personalisation)
    • customer segmentation according to common traits and/purchasing behaviour (aim: guiding marketing and business strategies)

Association rule mining

  • Series of algorithms that try to find rules to discover the probability of the co-occurrence of items in a collection (e.g Apriori, Eclat, FPGrowth)
  • E.g:
    • Given a collection of supermarket purchase records/transactions, can we tell how likely a customer is to buy milk if he/she has bought bread?
    • Can we learn how likely a series of medical diagnoses are, given the symptoms to which they are associated?

Types of Unsupervised Learning

Dimensionality reduction

  • Techniques that reduce the number of features, or dimensions, in a dataset (prior to applying a classification or regression algorithm)

    Example:

    • Principal component analysis (PCA - see 🗓️ Week 09)
    • or Singular value decomposition (SVD)

Anomaly detection

  • Techniques that discover unusual data points in a dataset i.e with patterns that deviate from normal dataset patterns
  • E.g:
    • Fraud detection (credit card fraud)
    • Cyberattack detection
    • Also useful in predicting whether a patient has or doesn’t have a disease

Unsupervised learning: where is it used in practice?

  • Market Basket Analysis : The goal here is to understand customer purchasing patterns and it involves analyzing data, e.g purchase history, to reveal product groupings and products likely to be purchased together.
  • Recommendation engines: Using data such as past purchasing or browsing behaviour or past product/services ratings or behaviour of similar users, unsupervised learning algorithms are used to suggest products, services, information to a user might like or find interesting
  • Medical imaging: Unsupervised learning algorithms are used for tasks as essential in radiology and pathology to diagnose patients quickly and accurately as image detection, classification and segmentation
  • Computer vision: Visual perception tasks such as object recognition rely on unsupervised learning algorithms
  • Natural language processing (NLP): Google News uses unsupervised learning to categorize articles based on the same story from various news outlets; for example, the results of the football transfer window would all be categorized under football.
  • Anomaly detection: The idea is to find atypical patterns in the data, i.e anomalies. Anomalies can be anything from a disease/medical diagnosis, a pattern of fraud (e. credit card fraud, insurance fraud, money laundering), security breaches (e.g cyberattacks) or faulty equipment
  • Genetic research: Clustering algorithms, in particular hierarchical clustering, are often used to analyze DNA patterns and reveal evolutionary relationships.

Similarity

  • Notion central to clustering and anomaly detection
  • Quantifies how data samples are related or close to each other
  • Often expressed as a number between 0 and 1
  • The higher it is, the closer the points are together, i.e “the more similar” they are
  • Based on a distance function

Similarity

Examples of common distance/similarity functions:

Real-valued data:

  • Euclidian distance: \(d(A,B)=\sqrt{\sum_{i=1}^n |A_i-B_i|^2}\)

  • Cosine distance: \(S_C(A,B)=\frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^n A_i^2}\sqrt{\sum_{i=1}^n B_i^2}}\)

where A and B are two n-dimensional vectors of attributes, and \(A_i\) and \(B_i\) the \(i\)th components of vectors A and B respectively

String similarity:

  • Hamming distance: given two strings of equal length, it is the number of positions at which the corresponding symbols (i.e either letters, bits, or decimal digits) are different e.g:

    • 0000 and 1111 is 4
    • cat and car is 1
    • call and cold is 2
  • Levenstein distance: given two strings, it’s the minimal number of edits (substutions, deletions, insertions) needed to turn one string into another e.g:

    The Levenshtein distance between kitten and sitting is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:

    1. kitten → sitten (substitution of “s” for “k”),
    2. sitten → sittin (substitution of “i” for “e”),
    3. sittin → sitting (insertion of “g” at the end).

Similarity

Examples of common distance/similarity functions:

Set similarity:

A set is a collection of items with no order or repetition. Sets are typically used to represent relationships or associations between objects or even people.

  • Jaccard index: it is often used in recommendation systems and social media analysis and measures the similarity between two sets based on the number of items present in both sets relative to the total number of items

    \(J(A,B)=\frac{|A\cap B|}{|A\cup B|}\)
    where A and B are two sets, \(|A\cap B|\) the number of items included in both sets and \(|A\cup B|\) the total number of items

Density

\(\epsilon\)-neighbourhoods:

  • given \(\epsilon>0\) and a point \(p\), the \(\epsilon\)-neighbourhood of \(p\) is the set of points that are at distance at most \(\epsilon\) away from \(p\)

Example:
(Images source: https://domino.ai/blog/topology-and-density-based-clustering)

100 points scattered on the [1,3]x[2,4] grid. We pick (3,2) as our point \(p\)

We draw the neighbourhood of \(p\) with radius \(\epsilon=0.5\): 31 points are in the neighbourhood

We draw the neighbourhood of \(p\) with radius \(\epsilon=0.15\): 3 points are in the neighbourhood

Density (continued)

Density:

  • Roughly defined, \(density=\frac{"mass"}{"volume"}\)
  • If we go back to our previous example, the “mass” would be the number of points in an \(\epsilon\)-neighbourhood centered in \(p\) and the “volume” the area of radius \(\epsilon\) centered around \(p\)

Example:

If we take the figure above, the local density estimation at \(p=(3,2)\) is \(\frac{31}{0.25\pi} \approx 39.5\)

  • Number that is meaningless on its own
  • If we calculate the densities of all points in our dataset, we could say that points within the same neighbourhood and similar local densities belong to the same cluster \(\rightarrow\) General principle of density-based clustering methods e.g DBSCAN (see (Ester, Kriegel, and Xu 1996) for more details on the DBSCAN algorithm)

An example of clustering algorithm: k-means

An example of clustering algorithm: k-means (continued)

An example of clustering algorithm: k-means (continued)

k-means in more mathematical terms

Κ-means clustering uses iterative refinement to produce a final result.

Algorithm inputs: number of clusters k and data set.

  • The data set is a collection of features for each data point.
  • The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set.
  • The algorithm then iterates between two steps:
  1. Data assignment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if \(c_i\) is the collection of centroids in set \(C\), then each data point \(x\) is assigned to a cluster based on

\(\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2\)

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be \(S_i\).

k-means in more mathematical terms (continued)

  1. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid’s cluster.

\(c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i} x_i\)

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.

How do we choose k in k-means?

The “elbow” method

The elbow method, also known as total within sum of squares, is a heuristic used to determine the optimal number of clusters for k-means clustering.

Idea behind elbow method:

  • run k-means clustering on dataset for a range of values of \(k\) (number of clusters)
  • and for each \(k\) calculate the sum of squared distances of each point from its closest centroid (SSE): the elbow point is the point on the plot of SSE against the number of clusters (\(k\)) where the change in SSE begins to level off, indicating that adding more clusters does not improve the model much.

The “elbow” method (continued)

The steps to perform the elbow method are:

  • Select a range of \(k\) values, generally from 1 to 10 or the square root of the number of observations/data points in the dataset.
  • Run k-means for each \(k\) value and calculate the SSE (sum of squared distances of each point from its closest centroid).
  • Plot the SSE for each \(k\) value: the point on the plot where the SSE starts to decrease at a slower rate is the elbow point, and the corresponding number of clusters is the optimal value for \(k\)

Source: (Dangeti 2017)

The silhouette score

  • The silhouette value/score measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation)
  • Value between -1 and +1: a high value is desirable and indicates that the point is placed in the correct cluster. If many points have negative Silhouette values, it may indicate that we have created too many or too few clusters.
  • The Silhouette Value \(s(i)\) for each data point \(i\) is defined as follows: \(s(i)=\frac{b(i)-a(i)}{\max(a(i), b(i))} \textrm{ if } |C_i|>1 \textrm{ and } s(i)=0 \textrm{ if } |C_i|=1\) (note that the silhouette value is 0 for clusters with one single element,)

where \(a(i)\) is the measure of similarity of the point \(i\) to its own cluster, measured as the average distance of \(i\) from other points in the cluster i.e:
for each data point \(i \in C_i\) (data point \(i\) in cluster \(C_i\)), \(a_i=\frac{1}{|C_i|-1}\sum_{j\in C_i,i\neq j} d(i,j)\)

and \(b(i)\) is the measure of dissimilarity of \(i\) from points of other clusters i.e:
for each data point \(i \in C_i\), \(b(i)=\min_{i\neq j}\frac{1}{|C_j|}\sum_{j\in C_j}d(i,j)\)

\(d(i,j)\) is the distance between points \(i\) and \(j\), with the Euclidean Distance generally used as the distance metric.

Average silhouette method

  • Compute the average silhouette score for all data points for a range \(k\) of clusters
  • Plot the average silhouette score for each \(k\) value: the number of clusters with the highest silhouette value is the optimal number of clusters we are looking for

Source: https://uc-r.github.io/kmeans_clustering

Other approaches

  • Calinski — Harabasz Method:

    • It relies on computing the Calinski—Harabasz index \(CH\) i.e the ratio of between-cluster variance (i.e between cluster sum of squares (BCSS)) and the within-cluster variance (i.e within cluster sum of squares (WCSS)) for a range of values of \(k\):

    \(CH=\frac{BCSS/k-1}{WCSS/N-k}=\frac{\sum_{i=1}^k n_i||c_i-c||^2/(k-1)}{\sum_{i=1}^k\sum_{x\in C_i}||x-c_i||^2/N-k}\)

    where \(n_i\) and \(c_i\) are the number of points and centroid of the \(i\)-th cluster respectively, \(c\) is the global centroid, \(N\) is the total number of points and \({x_1,...x_N}\) are the data points that compose the dataset.

    • then we plot the index against \(k\): A higher value of the CH index means the clusters are dense and well separated, though there is no “acceptable” cut-off value. We need to choose that solution which gives a peak or at least an abrupt elbow on the line plot of CH index

See this link or this link for more approaches to determine \(k\)

Clustering with k-means: an example

Exploring a dataset on employment and demographics from #TidyTuesday (dataset here and original blogpost describing this application here).

We’re exploring employment by race and gender

  • The data is first tidied up to remove missing values and to make sure the variables used (i.e the proportions of each category who are Asian, Black, or women and the total number of people in each category) are in the same scale
  • We then apply k-means with an arbitrary (at this point) 3 clusters

We then apply the elbow method to figure out the optimal \(k\)

Not exactly conclusive but 4-5 seem best. Let’s try it with 4!

Source code


library(tidyverse)
library(tidymodels) # for kmeans
library(plotly) #for plotting

# downloading, loading and tidying the data 
employed <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/employed.csv")
employed_tidy <- employed %>%
  filter(!is.na(employ_n)) %>%
  group_by(occupation = paste(industry, minor_occupation), race_gender) %>%
  summarise(n = mean(employ_n)) %>%
  ungroup()

# printing out the data
head(employed_tidy)
summary(employed_tidy)

Source code

# scaling the data variables
employment_demo <- employed_tidy %>%
  filter(race_gender %in% c("Women", "Black or African American", "Asian")) %>%
  pivot_wider(names_from = race_gender, values_from = n, values_fill = 0) %>%
  janitor::clean_names() %>%
  left_join(employed_tidy %>%
              filter(race_gender == "TOTAL") %>%
              select(-race_gender) %>%
              rename(total = n)) %>%
  filter(total > 1e3) %>%
  mutate(across(c(asian, black_or_african_american, women), ~ . / (total)),
         total = log(total),
         across(where(is.numeric), ~ as.numeric(scale(.)))
  ) %>%
  mutate(occupation = snakecase::to_snake_case(occupation))

employment_demo

Source code

#creating 3 clusters
employment_clust <- kmeans(select(employment_demo, -occupation), centers = 3)
summary(employment_clust)
augment(employment_clust, employment_demo) %>%
  ggplot(aes(total, black_or_african_american, color = .cluster)) +
  geom_point()

#elbow method
kclusts <-
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~ kmeans(select(employment_demo, -occupation), .x)),
    glanced = map(kclust, glance)
  )

kclusts %>%
  unnest(cols = c(glanced)) %>%
  ggplot(aes(k, tot.withinss)) +
  geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
  geom_point(size = 2, color = "midnightblue")

Source code

#creating 4 clusters
final_clust <- kmeans(select(employment_demo, -occupation), centers = 4)

#visualising the clusters
p <- augment(final_clust, employment_demo) %>%
  ggplot(aes(total, women, color = .cluster, name = occupation)) +
  geom_point()

ggplotly(p, height = 500)

k-means

Pros:

  • Simple
  • Fast
  • Reasonably effective

Cons:

  • Random initialization does not guarantee optimal clustering. Can be mitigated with clever initialization (e.g k-means++ - see (Arthur and Vassilvitskii 2007) for details)
  • Requires a-priori selection of number of clusters \(k\): not straightforward
  • Supposes spherical shape of clusters
  • Sensitive to outliers
  • Cannot be used with arbitrary distances

Anomaly detection

  • Aim: detect abnormal patterns in the data
  • Based on the notions of similarity/density we discussed previously, can you guess how this goal is accomplished?
  • Clustering is a useful tool here as well: anomalies are points that are farthest from any other or points that lie in areas of low density

What’s next?

  • More unsupervised learning

    • some more clustering: DBSCAN
    • anomaly detection
    • dimensionality reduction: PCA

References

Arthur, David, and Sergei Vassilvitskii. 2007. “K-Means++ the Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–35. SODA. New Orleans Louisiana: Society for Industrial; Applied Mathematics3600 University City Science Center Philadelphia, PAUnited States. https://dl.acm.org/doi/10.5555/1283383.1283494.
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Chen, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. San Francisco California USA: ACM. https://doi.org/10.1145/2939672.2939785.
Dangeti, Pratap. 2017. Statistics for Machine Learning: Techniques for Exploring Supervised, Unsupervised, and Reinforcement Learning Models with Python and R. Birmingham, UK: Packt Publishing.
Ester, Martin, Hans-Peter Kriegel, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–31. KDD’96. Portland, Oregon.
Freund, Yoav, and Robert E Schapire. 1997. “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting.” Journal of Computer and System Sciences 55 (1): 119–39. https://doi.org/10.1006/jcss.1997.1504.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5). https://doi.org/10.1214/aos/1013203451.
Prokhorenkova, Liudmila, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: Unbiased Boosting with Categorical Features.” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf.