Site icon DataFlair

Clustering in R – A Survival Guide on Cluster Analysis in R for Beginners!

We offer you a brighter future with FREE online courses - Start Now!!

Previously, we had a look at graphical data analysis in R, now, it’s time to study the cluster analysis in R. We will first learn about the fundamentals of R clustering, then proceed to explore its applications, various methodologies such as similarity aggregation and also implement the Rmap package and our own K-Means clustering algorithm in R.

What is Clustering in R?

Clustering is a technique of data segmentation that partitions the data into several groups based on their similarity. 

Basically, we group the data through a statistical operation. These smaller groups that are formed from the bigger data are known as clusters. These cluster exhibit the following properties:

Clustering is the most widespread and popular method of Data Analysis and Data Mining. It used in cases where the underlying input data has a colossal volume and we are tasked with finding similar subsets that can be analysed in several ways.

For example – A marketing company can categorise their customers based on their economic background, age and several other factors to sell their products, in a better way.

Get a deep insight into Descriptive Statistics in R

Applications of Clustering in R

Applications of R clustering are as follows:

In different fields, R clustering has different names, such as:

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

To define the correct criteria for clustering and making use of efficient algorithms, the general formula is as follows:

Bn(number of partitions for n objects)>exp(n)

You can determine the complexity of clustering by the number of possible combinations of objects. The complexity of the cluster depends on this number.

The basis for joining or separating objects is the distance between them. These distances are dissimilarity (when objects are far from each other) or similarity (when objects are close by).

Methods for Measuring Distance between Objects

For calculating the distance between the objects in K-means, we make use of the following types of methods:

In general, for an n-dimensional space, the distance is

The squares of the inertia are the weighted sum mean of squares of the interval of the points from the centre of the assigned cluster whose sum is calculated.

We perform the calculation of the Sum of Squares of Clusters on their centres as follows:

Total Sum of Squares (I) = Between-Cluster Sum of Squares (IR) + Within-Cluster Sum of Squares (IA)

The above formula is known as the Huygens’s Formula.

The Between-Cluster Sum of squares is calculated by evaluating the square of difference from the centre of gravity from each cluster and their addition.

We perform the calculation of the Within-Cluster Sum of squares through the process of the unearthing of the square of difference from centre of gravity for each given cluster and their addition within the single cluster. With the diminishing of the cluster, the population becomes better.

R-squared (RSQ) delineates the proportion of the sum of squares that are present in the clusters. The closer proportion is to 1, better is the clustering. However, one’s aim is not the maximisation of the costs as the result would lead to a greater number of clusters. Therefore, we require an ideal R2 that is closer to 1 but does not create many clusters. As we move from k to k+1 clusters, there is a significant increase in the value of  R2

Some of the properties of efficient clustering are:

Note: In the case of correct clustering, either IR is large or IA is small while calculating the sum of squares.

Clustering is only restarted after we have performed data interpretation, transformation as well as the exclusion of the variables. While excluding the variable, it is simply not taken into account during the operation of clustering. This variable becomes an illustrative variable.

Wait! Have you checked – Data Types in R Programming

Agglomerative Hierarchical Clustering

In the Agglomerative Hierarchical Clustering (AHC), sequences of nested partitions of n clusters are produced. The nested partitions have an ascending order of increasing heterogeneity. We use AHC if the distance is either in an individual or a variable space. The distance between two objects or clusters must be defined while carrying out categorisation.

The algorithm for AHC is as follows:

AHC generates a type of tree called dendrogram. After splitting this dendrogram, we obtain the clusters.

Hierarchical Clustering is most widely used in identifying patterns in digital images, prediction of stock prices, text mining, etc. It is also used for researching protein sequence classification.

1. Main Distances

In this case, the minimum distance between the points of different clusters is supposed to be greater than the maximum points that are present in the same cluster. The distance between the points of distance clusters is supposed to be higher than the points that are present in the same cluster.

2. Density Estimation 

In density estimation, we detect the structure of the various complex clusters. The three methods for estimating density in clustering are as follows:

You must definitely explore the Graphical Data Analysis with R

Clustering by Similarity Aggregation

Clustering by Similarity Aggregation is known as relational clustering which is also known by the name of Condorcet method.
With this method, we compare all the individual objects in pairs that help in building the global clustering. The principle of equivalence relation exhibits three properties – reflexivity, symmetry and transitivity.

This type of clustering algorithm makes use of an intuitive approach. A pair of individual values (A,B) are assigned to the vectors m(A,B) and d(A,B). Both A and B possess the same value in m(A,B) whereas in the case of d(A,B), they exhibit different values.

The two individuals A and B follow the Condorcet Criterion as follows:

c(A, B) = m(A, B)-d(A, B)

For an individual A and cluster S, the Condorcet criterion is as follows:

c(A,S) = Σic(A,Bi)

The summation overall is the Bi ∈ S.

With the previous conditions, we start by constructing clusters that place each individual A in cluster S. In this cluster c(A,S), A is the largest and has the least value of 0.

In the next step, we calculate global Condorcet criterion through a summation of individuals present in A as well as the cluster SA which contains them.

Note: Several iterations follow until we reach the specified largest number of iterations or the global Condorcet criterion no more improves.

K-Means Clustering in R

One of the most popular partitioning algorithms in clustering is the K-means cluster analysis in R. It is an unsupervised learning algorithm. It tries to cluster data based on their similarity. Also, we have specified the number of clusters and we want that the data must be grouped into the same clusters. The algorithm assigns each observation to a cluster and also finds the centroid of each cluster.

The K-means Algorithm:

This algorithm works in these steps:

1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2D space.

2. Assign each data point to a cluster: Let’s assign three points in cluster 1 using red colour and two points in cluster 2 using yellow colour (as shown in the image).

3. Compute cluster centroids: The centroid of data points in the red cluster is shown using the red cross and those in a yellow cluster using a yellow cross.

4. Re-assignment of points to their closest cluster in centroid: Red clusters contain data points that are assigned to the bottom even though it’s closer to the centroid of the yellow cluster. Thus, we assign that data point into a yellow cluster.

5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters. We perform the repetition of step 4 and 5 and until that time no more improvements can be performed. We’ll repeat the 4th and 5th steps until we’ll reach global optima. This continues until no more switching is possible. Then it will mark the termination of the algorithm if not mentioned.

We will now understand the k-means algorithm with the following example:

> #Author DataFlair
> library(tidyverse)
> library(cluster)
> library(factoextra)
> library(gridExtra)
> data('USArrests') 
> d_frame <- USArrests
> d_frame <- na.omit(d_frame)  #Removing the missing values
> d_frame <- scale(d_frame)
> head(d_frame)

Output:

> kmeans2 <- kmeans(d_frame, centers = 2, nstart = 25)
> str(kmeans2)

Output:

> fviz_cluster(kmeans2, data = d_frame)

Output:

> kmeans3 <- kmeans(d_frame, centers = 3, nstart = 25)  #DataFlair
> kmeans4 <- kmeans(d_frame, centers = 4, nstart = 25)  
> kmeans5 <- kmeans(d_frame, centers = 5, nstart = 25)  
> #Comparing the Plots
> plot1 <- fviz_cluster(kmeans2, geom = "point", data = d_frame) + ggtitle("k = 2")
> plot2 <- fviz_cluster(kmeans3, geom = "point", data = d_frame) + ggtitle("k = 3")
> plot3 <- fviz_cluster(kmeans4, geom = "point", data = d_frame) + ggtitle("k = 4")
> plot4 <- fviz_cluster(kmeans5, geom = "point", data = d_frame) + ggtitle("k = 5")
> grid.arrange(plot1, plot2, plot3, plot4, nrow = 2)

Output:

Cyber Profiling with K-Means Clustering

Conventionally, in order to hire employees, companies would perform a manual background check. This type of check was time-consuming and could no take many factors into consideration. However, with the help of machine learning algorithms, it is now possible to automate this task and select employees whose background and views are homogeneous with the company. In cases like these cluster analysis methods like the k-means can be used to segregate candidates based on their key characteristics.

With the new approach towards cyber profiling, it is possible to classify the web-content using the preferences of the data user. This preference is taken into consideration as an initial grouping of the input data such that the resulting cluster will provide the profile of the users. The data is retrieved from the log of web-pages that were accessed by the user during their stay at the institution.

Summary

In the R clustering tutorial, we went through the various concepts of clustering in R. We also studied a case example where clustering can be used to hire employees at an organisation. We went through a short tutorial on K-means clustering.

The upcoming tutorial for our R DataFlair Tutorial Series – Classification in R

If you have any question related to this article, feel free to share with us in the comment section below.

Exit mobile version