Site icon DataFlair

R Clustering Tutorial – R Cluster Analysis

FREE Online Courses: Knowledge Awaits – Click for Free Access!

1. Objective

In this tutorial, we will discuss R Clustering in detail. Also, we will look at Clustering in R goal, R clustering types, usages, applications of R clustering and many more. Moreover, we will also cover common types of algorithms based on clustering and k means Clustering in R. Along with this, we use images, graphs for algorithms for clear and better understanding.

So, let’s start the R Clustering tutorial.

R Clustering Tutorial – R Cluster Analysis

2. What is R Cluster Analysis?

First of all, let us see what is R clustering
We can consider R clustering as the most important unsupervised learning problem. Therefore, for every other problem of this kind, it has to deal with finding a structure in a collection of unlabeled data.
“It is the process of organizing objects into groups whose members are similar in some way”.
R clustering is a collection of objects. Which are “similar” to them? Also, “dissimilar” to the objects belonging to other clusters.
Before learning about R clustering, let us revise our concepts of Introduction to R programming language.

3. R Clustering – Goals

To determine the intrinsic grouping in a set of unlabeled data. Although, problem is that how to decide what forms a good clustering? Moreover, It is being shown that there is no absolute “best” criterion. So it would be independent of the final aim of the clustering.

4. Types of R Clustering

i. Hard Clustering

In this, each data point either belongs to a cluster completely or not.

ii. Soft Clustering

In this, we assign a probability of the data point. Although, instead of putting each data point into a separate cluster.

5. Requirements for R clustering

The main requirements that a clustering algorithm should meet are:

6. Applications of R Clustering

We can apply it in many fields:

7. Problems with R Clustering

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

There are some problems with clustering. We will discuss among them:

8. Types of R Clustering Algorithm

Let’s look at some of them in detail:

i. Distribution models

These are based on the notion of how probable is it that all data points in the cluster belong to the same distribution. These models often suffer from overfitting.
For Example:
Model-based clustering: It is being used on a heuristic approach to constructing clusters. it assumes a data model. Also, we can apply an EM algorithm. That’s need to find the most likely model components and the number of clusters.

ii. Connectivity models

It is based on the notion. The data points closer in data space exhibit more similarity to each other.
These models can follow two approaches:

For Example:
Hierarchical clustering: It helps in creating a hierarchy of clusters. Then presents the hierarchy in a dendrogram. In this method, he does not need any number of clusters to be specified at the beginning.

iii. Density Models

It helps in searching the data space for areas of varied density of data points in the data space. It isolates different density regions. Also, assigns the data points within these regions in the same cluster.
For Example:
Density-based R Clustering: In regards to the density measurement it creates clusters. In this method, we have known that cluster has a higher density than the rest of the dataset. Density in data space is the measure.

iv. Centroid models

These are iterative clustering algorithms. In this, the notion of similarity is derived by the closeness of a data point to the centroid of the clusters.
For Example:
K-means clustering: It is also referred to as flat clustering. Also, it requires the number of clusters as an input. But, its performance is faster than hierarchical clustering. Distance from the mean value of each observation/cluster is the measure.
Now I will be taking you through three of the most popular algorithms for R Clustering in detail:

a. K-means Clustering in R

The most common partitioning method is the K-means cluster analysis. It is an unsupervised learning algorithm. it tries to cluster data based on their similarity. Also, we have specified the number of clusters. And we want the data must be group into same clusters. The algorithm assigns each observation to a cluster. Also, finds the centroid of each cluster.
The K-means algorithm:

Image.2 Clustering in R – R Cluster Analysis

Image.3 R Clustering – R Cluster Analysis

Image.4 R Clustering – R Cluster Analysis

Image.6 R Clustering – R Cluster Analysis

This algorithm works in these 5 steps :
1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2D space.

Image.5 Clustering in R – R Cluster Analysis

2. Assign each data point to a cluster:
Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using yellow color.

3. Compute cluster centroids:
The centroid of data points in the red cluster is shown using the red cross. Also, those in a yellow cluster using a yellow cross.

4. Re-assign each point to the closest cluster centroid:
The data point at the bottom is assigned to the red cluster. If even though it’s closer to the centroid of the yellow cluster. Thus, we assign that data point into a yellow cluster.

5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
Repeat steps 4 and 5 until no improvements are possible. We’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points. Then it will mark the termination of the algorithm if not mentioned.

b. DBSCAN R Clustering

It was introduced in Ester et al. 1996. That can be used to identify clusters of any shape in a dataset containing noise and outliers. Although, it is a technique that allows partitioning data into groups with similar characteristics. But it does need specifying the number of those groups in advance.
“The idea behind this approach is derived from a human intuitive clustering method.”
1. Keywords
model, Clustering
2. Usage
dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, …)
# S3 method for dbscan_fast
predict(object, newdata = NULL, data, …)
3. Arguments
x
It is a data matrix or a dist object.
4. eps
It defines the size of the epsilon neighborhood.
5. minPts
We can use it to represent the number of smallest points in the eps region. The default is 5 points.
6. weights
Numeric; weights for the data points. Also, its only needed to perform weighted clustering.
7. borderPoints
logical; should border points be assigned. The default is TRUE for regular DBSCAN. If FALSE then we can consider border points as noise.
8. object
It is a DBSCAN clustering object.
9. data
It is used to create the DBSCAN clustering object.
10. newdata
We can use this argument where we have already predicts the cluster membership.

Additional R arguments are passed on to fixed-radius nearest neighbor search algorithm.
The R packages based on density-based algorithm:

11. Why DBSCAN?
They work well only for compact and well-separated clusters. Moreover, it is being in a notice that presence of noise and outliers affects DBSCAN. 

12. DBSCAN Algorithm in R
This algorithm works on a parametric approach. Moreover, we use two parameters in this algorithm that are:

Image.1 R Clustering – R Cluster Analysis

Image.2 R Clustreing – R Cluster Analysis

Once we define these parameters. The algorithm divides the data points into three points:
Core points: A point p is a core point if at least minPts points are within distance l of it (including p).
Border points: A point q is border from p if there is a path p1, …, pn with p1 = p and pn = q, where each pi+1 is reachable from pi
Outliers: All points not reachable from any other point are outliers.

The steps in DBSCAN are simple after defining the previous steps:

Repeat these steps until all the points are finally assigned to a cluster or to an outlier.

13.  Advantages and Disadvantages of Density-based R Clustreing
Advantages:

Disadvantages

14. Limitation of DBSCAN
It is sensitive to the choice of e. In particular, if clusters have different densities, there are two conditions-
If e is too small then we have to define sparser clusters as noise.
e is too large- If we this condition then the denser clusters may be merged together.

c.  Hierarchical R Clustering

It is an algorithm which builds a hierarchy of clusters. Although, it starts with all the data points that are assigned to a cluster of their own. Then the two nearest clusters will merge into the same cluster. In the end, we use to terminate it when there is only a single cluster left.
1. Characteristics of R Hierarchical Clustering

2. Two important things that you should know about hierarchical clustering in R are:

3. There are many metrics for deciding the closeness of two clusters:
Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
Manhattan distance: ||a-b||1 = Σ|ai-bi|
Maximum distance:||a-b||INFINITY = maxi|ai-bi|
Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

So, this was all in R Clustering. Hope you like our explanation.

9. Conclusion – Clustering in R

In conclusion, we have studied in detail about R Clustering and cluster analysis algorithms. Also, we saw uses, types, advantages of Clustering in R. Moreover, we have covered their applications which helps you to clarify what is a need and why to study R Clustering. Still, if you have any query regarding R Clustering, ask in the comment tab. 

Reference for R 

Exit mobile version