Principal Components and Factor Analysis in R

1. Objective

In this R tutorial, we will learn what does exactly Principal components and Factor Analysis in R means. After this, we will move forward to learn its components, principal, and functions.
Principal Components and Factor Analysis in R

Principal Components and Factor Analysis in R

2. Introduction to Principal components and Factor Analysis in R

We use R principal component and factor analysis as the multivariate analysis method. The aim of this is to reveal systematic covariations among a group of variables. Also, the analysis can be motivated in many different ways. It includes describing the basic anomaly patterns that appear in spatial data sets.

Thus, it is always performed on a symmetric correlation or covariance matrix. Hence, it means the matrix should be numeric.

3. What are Principal components in R?

It is a normalized linear combination of the original predictors in a data set. We can write the principal component in following way:
Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + …. +Φp¹Xp
Where,
Z¹ is first principal component
Φp¹ is the loading vector comprising of loadings (Φ¹, Φ²..) of a first principal component. Also, the loadings are constrained to a sum of square equals to 1.  This is because the large size of loadings may lead to large variance. The direction of the principle component (Z¹) which has the highest variation of data is also defined. Moreover, it results in a line in p dimensional space which is closest to the n observations. We can measure closeness using average squared Euclidean distance.
X¹..Xp is normalized predictors. The means of normalized predictors are equal to 0 and have a standard deviation of 1. 

4. Why Use Principal Components Analysis?

The main aim of principal components analysis is to report hidden structure in a data set. In doing so, we may be able to do following things:
a. Basically, it is prior to identifying how different variables work together to create the dynamics of the system.
b. Then reduce the dimensionality of the data.
c. Afterwards, it decreases redundancy in the data.
d. Filter some of the noise in the data.
e. Then compress the data.
f. Moreover, prepare the data for further analysis using other techniques.

5. Functions to do principal analysis in R

a. prcomp() (stats)
b. princomp() (stats)
c. PCA() (FactoMineR)
d. dudi.pca() (ade4)
eacp() (amap)

6. Methods for Principal Component Analysis in R

There are two methods for R Principal component analysis:

a. Spectral decomposition

It examines the covariances/correlations between variables.

b. Singular value decomposition

It examines the covariances/correlations between individuals. Function princomp() is used here for carrying out a spectral approach.  And we can also use the functions prcomp() and PCA() in the singular value decomposition.

R Quiz

7. prcomp() and princomp() functions

The simplified format of these 2 functions are :
prcomp(x, scale = FALSE)
princomp(x, cor = FALSE, scores = TRUE)
Arguments for prcomp()
a. x: a numeric matrix or data frame.
b. scale: It is a logical value. It indicates whether the variables should be scaled to have unit variance. It will take place before the analysis takes place.
Arguments for princomp()
a. x: a numeric matrix or data frame.
b. cor: a logical value. If TRUE, then data will be centered and also scaled before the analysis.
c. scores: a logical value. If TRUE, then coordinates on each principal component are calculated.

 

9. Factor Analysis in R

Exploratory Factor Analysis or simply Factor Analysis is a technique used for the identification of the latent relational structure. Using this technique, the variance of a large number of can be explained with the help of fewer variables. EFA can be overviewed through a visualization as follows – 

Let us understand factor analysis through the following example – 

Assume an instance of a demographics based survey. Suppose that there is a survey about the number of dropouts in academic institutions. It is observed that the number of dropouts are much greater at higher levels of institutions. That is, the number of high school dropouts are much higher than junior school ones. Similarly, the number of dropouts in college are much higher than high school. In this case, the driving factor behind the number of dropouts is the increase in academic difficulty. But besides this, there can be many other factors like financial background, localities with higher pupil-teacher ratio and even gender in the most remote parts. Since, there are multiple factors that contribute towards the dropout rate, we have to define variables in a structured and a defined manner. The main principle of factor analysis is the categorization of weights based on the influence that the category has. 

With factor analysis, we are able to assess the variables that are hidden from plain observation but are reflected in the variables of the data. We perform transformation on our dataset to equal number of variables such that each variable is a combination of the current ones. This is performed without any removal or addition of new information. Therefore, transformation of these variables in the direction of eigenvalues will help us to determine influential factors. Eigenvalue having a value more than 1 will have greater variance than the original one. These factors are then arranged in a decreasing format based on their variances. Therefore, the first factor will have higher variance than the second one and so on. Weights that contribute towards the variance are known as ‘factor loadings’.

Now, let us take a practical example of factor analysis. We will use the BFI dataset that comprises of personality items. There are about 2800 data points with five main factors – A (Agreeableness), C (Conscientiousness), E (Extraversion), N (Neuroticism), O (Openness). We will implement factor analysis to assess the association of the variables with each factor. 

library(psych)                #Author DataFlair
dataset_bfi = bfi             #Loading the Dataset
dataset_bfi = dataset_bfi[complete.cases(dataset_bfi),] #Removing the rows with Missing Values
cor_mat <- cor(dataset_bfi)   #Creating Correlation Matrix 
FactorLoading <- fa(r = cor_mat, nfactors = 6)
FactorLoading

From the above output, we observe that the first factor N has the highest variance among all the variables. We infer that most members have neuroticism in our data. 

9. Conclusion

We have studied the principal component and factor analysis in R. Along with this, we have discussed its usage, functions, components. After learning all this we have also discussed what is a package for PCA visualization.
Hope you enjoyed the learning!!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.