Data Manipulation In R Tutorial | R Processing Data
1. Data Manipulation In R
We will learn how to perform data manipulation in R along with data processing. We will carry this out in the R programming language. We will also overview the three operators such as subsetting, manipulation as well as sorting and merging in R. Also, we will learn data structures in R, how to create subsets in R and usage of R sample() command, ways to create R data subgroups or bins of data in R. We will also overview the different methodologies for aggregating data in R, performing sorting, ordering as well as data traversal. At last, this R Data Manipulation topics will provide you complete tutorial on ways for manipulating and processing data in R.
So, let’s start Data Manipulation in R.
2. What is Data Manipulation in R?
With the help of data structures we can represent data in the form of data analytics. Data Manipulation in R can be carried out for further analysis and visualisation.
Before we start playing with data in R, let us see how to import data in R and ways to export data from R to different external sources like SAS, SPSS, text file or CSV file.
One of the most important aspects of computing with data Data Manipulation in R and enable its subsequent analysis and visualization. Let us see few basic data structures in R:
a. Vectors in R
These are ordered a container of primitive elements and are used for 1-dimensional data.
Types – integer, numeric, logical, character, complex
b. Matrices in R
These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.
Dimensions – two, three, etc.
c. Lists in R
These are ordered a container for arbitrary elements and are used for higher dimension data, like customer data information of an organization. When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.
d. Data frames
These are two-dimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.
3. Creating Subsets of Data in R
As we know, data size is increasing exponentially and doing an analysis of complete data is very time-consuming. So the data is divided into small sized samples and analysis of samples is done. The process of creating samples is called subsetting.
Different methods of subsetting in R are:
The dollar sign operator selects a single element of data. The result of this operator is always a vector when we use it with a data-frame.
Similar to $ in R, the double square brackets operator in R also returns a single element, but it offers the flexibility of referring to the elements by position rather than by name. It can be used for data frames and lists.
The single square bracket operator in R returns multiple elements of data. The index within the square brackets can be a numeric vector, a logical vector, or a character vector.
For example: To retrieve 5 rows and all columns of already built-in dataset iris, below command is used:
> data(iris) > iris[1:5, ]
4. Sample() command in R
As we have seen, samples are created from data for analysis. To create samples, sample() command is used and the number of samples to be drawn are mentioned.
For example, to create a sample of 10 simulations of a die, below command is used:
> sample(1:6, 10, replace=TRUE)
Sample() should always produce random values. But it does not happen with the test code sometimes. If substituted with a seed value, the sample() command always produces random samples.
The seed value is the starting point for any random number generator formula. Seed value defines both, the initialization of the random number generator along with the path that the formula will follow.
Let us see how seed value is used.
> set.seed(100) > sample(1:5, 10, replace = TRUE) #DataFlair  2 2 3 1 3 3 5 2 3 1
Now let’s move ahead in the R Data Manipulation tutorial with Applications of Subsetting Data.
5. Applications of Subsetting Data
Let us now see few applications of subsetting data in R:
a. Duplicate data can be removed during analysis using duplicated()function in R
Below command shows how to find duplicate data in subsets: Duplicated() function finds duplicate values and returns a logical vector that tells you whether the specified value is a duplicate of a previous value.
For all those values which are duplicate in the sample, true is returned.
b. Missing data can be identified using complete.cases() function in R
If during analysis, any row with missing data can identify and remove as below:
complete.cases() command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values. We will first create our data and store it in a csv file as follows:
#Author DataFlair data <- read.table(header=TRUE, text=' subject sex size 1 M 7 2 F NA 3 F 9 4 M 11 ') write.csv(data, "/home/dataflair/table.csv", row.names=FALSE)
Rows which have NA values can be removed using na.omit() function as below:
> file <- read.csv("/home/dataflair/table.csv") > na.omit(file)
Next topic in Data Manipulation in R Tutorial is Adding Fields to Data.
6. Adding Calculated Fields to Data
After you have created the appropriate subset of your data, the next step in your analysis is to perform some calculations. R makes it easy to perform calculations on columns of a data frame because each column is itself a vector.
Let us see data manipulation with R with the help of an example:
Let us see how to calculate the ratio between the lengths and width of the sepals
The command for the same is:
> data(iris) > x <- iris$Sepal.Length / iris$Sepal.Width > head(x) #Author DataFlair
a. with() function in R
To reduce the amount of typing and make code more readable, we use with() command as below:
> y <- with(iris, Sepal.Length / Sepal.Width) #Author DataFlair > head(y)
This gives output same as above but reduced the task of typing.
b. within() function in R
Let us now see the use of within function for same task:
iris<- within(iris, ratio <- Sepal.Length / Sepal.Width) iris
With() function allows you to refer to columns inside a data frame without explicitly using the dollar
sign or even the name of the data frame itself.
With and Within we can use interchangeably.
7. Creating Subgroups or Bins of Data
Most statisticians often draw histograms to investigate their data. As this type of calculation is common when you use statistics, R has some functions for it.
a. Cut() function in R
Cut() function groups values of a variable into larger bins. It creates bins of equal size and classifies each element into its appropriate bin.
Let us see how cut works in R with example:
> #Author DataFlair > frost <- c(1,2,3) > cut(frost, 3, include.lowest=TRUE) > cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
This gives the result as a factor with three levels. The cut() function creates mathematical labels for the bins. The label names can be provided by the user.
The result shows three labels in the output.
b. table() function in R
To count the number of observations in each level of factor, R table() command we can use as below:
> inp <- cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High")) #Author DataFlair > table(inp)
The result shows the output as a table containing the number of elements in each factor. Now let see combining and Merging Datasets for Data Manipulation in R.
8. Combining and Merging Datasets in R
If you want to combine data from different sources in R, you can combine different sets of data in three ways:
a. By Adding Columns using cbind() in R
If the two sets of data have an equal set of rows, and the order of the rows is identical, then adding columns makes sense. This can be done by using the data.frame or cbind() function.
b. By Adding Rows using rbind() function in R
If both sets of data have the same columns and you want to add rows to the bottom, use rbind().
c. By Combining Data With Different Shapes using merge() function in R
The merge() function combines data based on common columns as well as rows. In database language, this is usually called joining data.
For merging the existing data, using the merge()function is useful. You can use merge()to combine data only when certain matching conditions are satisfied.
Next Function in R Data Manipulation Tutorial is Merge().
9. Merge() Function in R
Let us see the use of merge() function.
The merge() function is used to combine data frames. Let us see this with an example:
#Author DataFlair every.states <- as.data.frame(state.x77) every.states$Name <- rownames(state.x77) rownames(every.states) <- NULL str(every.states) #Creating a subset of freezing states freezing.states <- every.states[every.states$Frost>150 , c("Name", "Frost")] freezing.states #Creating a subset of big states big.states <- every.states[every.states$Area>=100000 , c("Name", "Area")] big.states #Using the merge function merge(freezing.states, big.states)
This is the command to create a data frame that consists of cold as well as large states.
Let us see different types of merge().
The merge() function allows four ways of combining data:
a. Natural join in R
To keep only rows that match from the data frames, specify the argument all=FALSE
b. Full outer join in R
To keep all rows from both data frames, specify all=TRUE
c. Left outer join in R
To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE
d. Right outer join in R
To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE
The merge()function takes a large number of arguments, as follows:
- x: A data frame
- y: A data frame
- by, by.x, by.y: Names of the columns common to both x and y. By default, it uses columns with common names between the two data frames.
- all, all.x, all.y: Logical values that specify the type of merge. The default value is all = FALSE
Any doubt yet in Data Manipulation in R? Please Comment.
10. Match() function in R
The R match() function returns the matching positions of two vectors or, more specifically, the positions of the first matches of one vector in the second vector.
> ind <- match(freezing.states$Name, big.states$Name) #DataFlair > ind
11. Sorting and Ordering Data in R using sort() in R and Order() in R
A common task in data analysis and reporting is sorting information. You can answer many everyday questions with sorted tables of data that tell you the best or worst of specific things; for example, parents want to know which school in their area is the best, and businesses need to know the most productive factories or the most lucrative sales areas.
Let us first create data frame and then we will sort it. Then, we will use the some.states command to create dataframe.
#Author DataFlair some.states <- data.frame( Region = state.region, + state.x77) some.states <- some.states[1:10, 1:3] sort(some.states$Population) #Command to sort Population in ascending order sort(some.states$Population, decreasing=TRUE) #Command to sort Population #in descending order order.pop <- order(some.states$Population) #Another way of sorting some.states[order.pop, ] #In ascending order order(some.states$Population, decreasing=TRUE) #Descending Order
This is how order() and sort() functions we use.
Lets move to another Data Manipulation in R Mrthod, Traversing Data.
12. Traversing Data with the Apply() Function in R
To traverse the data, R uses apply functions. The output of the apply() function depends on the data structure being traversed.
a. Array or matrix
The apply() function traverses either the rows or columns of a matrix, applies a function to each resulting vector, and returns a vector of summarized results
The lapply() function can traverse a list, it applies a function to each element, and returns a list of the results. Sometimes it is possible to simplify the resulting list into a matrix or vector. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
R Apply() function use as below:
apply(X, MARGIN, FUN, ...)
The apply() function takes four arguments as below:
- X: This is the data—an array (or matrix)
- MARGIN: This is a numeric vector that indicates the dimension over which to traverse—1 means rows and 2 means columns
- FUN: This is the function to apply (for example, sum or mean)
- … (dots): If the FUN function requires any additional arguments, they can add here.
In essence, the apply function allows us to make entry-by-entry changes to data frames and matrices. If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when we apply MARGIN=c(1,2) function to every entry of X.
Let us now discuss the variations of the apply() function:
a. lapply() function in R
We have already seen it above.
b. sapply() function in R
It works on a list or vector and returns vector.
c. tapply() function in R
We use it to create tabular summaries of data. This function takes three arguments:
- X: Refers to a vector
- INDEX: Refers to a factor or list of factors
- FUN: Refers to a function
An illustrative example
#Author DataFlair #Create the matrix m<-matrix(c(seq(from=-98,to=100,by=2)),nrow=10,ncol=10) # Return the product of each of the rows apply(m,1,prod) # Return the sum of each of the columns apply(m,2,sum)
13. Formula Interface in R
The R formula interface allows you to concisely specify which columns to use when fitting a model, as well as the behavior of the model for Data Manipulation in R.
You need the operators when you start building models. Formula notation refers to statistical formulae, as opposed to mathematical formulae. The formula operator + means to include a column, not to mathematically add two columns together.
|~||y ~ x||Model y as a function of x|
|+||y ~ a + b||Include columns a as well as b|
|–||y ~ a – b||Include a but exclude b|
|:||y ~ a : b||Estimate the interaction of a and b|
|*||y ~ a * b||Include columns as well as their interaction (that is, y ~ a + b + a:b)|
||||y ~ a | b||Estimate y as a function of a conditional on b|
Above table shows meanings of different operators in formula interfacing.
14. Variables in R
The two types of R variables are:
a. Identifier variables in R
Identifier, or ID variables identify the observations. These act as the keys that identify the observations.
b. Measured variables in R
These represent the measurements to observe.
15. Reshape2 Package in R
Base R has a function, reshape() that works fine for reshaping longitudinal data.
The problem of data reshaping is far more generic than simply dealing with longitudinal data. So package reshape2 that contains several functions to convert data between long and wide format is released.
> install.packages("reshape2") > library(reshape2)
R reshape2 package is based on two key functions:
- Melt() in R takes wide-format data and melts it into long-format data.
- Cast() in R takes long-format data and casts it into wide-format data.
So, this was all on Data Manipulation in R. Hope you like our explanation.
16. Conclusion – Data Manipulation in R
Hence, in this tutorial on Data Manipulation in R, we discussed creating Subsets of Data in R. Moreover, we saw Sample() command in R, Applications of Subsetting Data, Creating Subgroups or Bins of Data, Combining and Merging Datasets in R and much more. Still, if you like the R Data Manipulation Tutorial, please comment.