Manipulating and processing data in R


1. Objectives

In this R Programming tutorial, we are going to learn manipulation and data processing with R. We will see three subset operators in R and how to perform R data manipulation like subsetting in R, sorting and merging of data in R programming language. We will learn data structures in R, how to create subsets in R and usage of R sample() command, ways to create R data subgroups or bins of data in R, different ways to combine data in R, how to merge data in R, sorting and ordering data in R, ways to traverse data in R and formula interface in R. This will provide you complete tutorial on ways for manipulating and processing data in R.

Manipulating and Processing Data in R

2. Manipulating and processing data in R

Data structures provide the way to represent data in data analytics. We can manipulate data in R for analysis and visualization.

Before we start playing with data in R, let us see how to import data in R and ways to export data from R to different external sources like SAS, SPSS, text file or CSV file.

One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few basic data structures in R:

a. Vectors in R

These are ordered container of primitive elements and are used for 1-dimensional data.

Types – integer, numeric, logical, character, complex

b. Matrices in R

These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.

Dimensions – two, three, etc.

c. Lists in R

These are ordered container for arbitrary elements and are used for higher dimension data, like customer data information of an organization. When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.

d. Data frames

These are two-dimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.

3. Creating Subsets of Data in R

As we know, data size is increasing exponentially and doing analysis on complete data is very time-consuming. So data is divided into small sized samples and analysis of samples is done. The process of creating samples is called subsetting.

Different methods of subsetting in R are:

a. $

The dollar sign operator selects a single element of data. When you use this operator with a data frame, the result is always a vector.

b. [[

Similar to $ in R, the double square brackets operator in R also returns a single element, but it offers the flexibility of referring to the elements by position rather than by name. It can be used for data frames and lists.

c. [

The single square bracket operator in R returns multiple elements of data. The index within the square brackets can be a numeric vector, a logical vector, or a character vector.

For example: To retrieve 5 rows and all columns of already built in data set iris, below command is used:

> iris[1:5, ]

4. Sample() command in R

As we have seen, samples are created from data for analysis. To create samples, sample() command is used and the number of samples to be drawn are mentioned.

For example, to create a sample of 10 simulation of a die, below command is used:

> sample(1:6, 10, replace=TRUE)

It gives output as:

[1] 2 2 5 3 5 3 5 6 3 5

Sample() should always produce random values. But it does not happen with the test code sometimes. If substituted with a seed value, the sample() command always produces random samples.

Seed value is the starting point for any random number generator formula. Seed value defines both, the initialization of the random number generator along with the path that the formula will follow.

Let us see how seed value is used.

> set.seed(1) //setting seed values for sample() command
>sample(1:6, 10, replace=TRUE)

This gives output as below:

[1] 2 3 4 6 2 6 6 4 4 1

5. Applications of Subsetting Data

Let us now see few applications of subsetting data in R:

a. Duplicate data can be removed during analysis using duplicated()function in R

Below command shows how to find duplicate data in subsets: Duplicated() function finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value.

>duplicated(c(1,2,1,3,1,4))

This gives output as below:

[1] FALSE FALSE TRUE FALSE TRUE FALSE

For all those values which are duplicate in the sample, true is returned.

b. Missing data can be identified using complete.cases() function in R

If during analysis, any row with missing data can be identified and removed as below:

complete.cases() command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values.

Rows which have NA values can be removed using na.omit() function as below:

> row_name <- na.omit(file_name)

6. Adding Calculated Fields to Data

After you have created the appropriate subset of your data, the next step in your analysis is to perform some calculations. R makes it easy to perform calculations on columns of a data frame because each column is itself a vector.

Let us see data manipulation with R with the help of an example:

Let us see how to calculate the ratio between the lengths and width of the sepals

Command for the same is:

> x <- iris$Sepal.Length / iris$Sepal.Width

>head(x)

//Command to display the first five elements of the result

 

It gives the output as:

[1] 1.457143 1.633333 1.468750 1.483871 1.388889 1.384615

Let us discuss some variations of the operations performed on data frames in R.

a. with() function in R

To reduce the amount of typing and make code more readable, with() command is used as below:

>y <- with(iris, Sepal.Length / Sepal.Width) //Command to calculate the ratio between the lengths and width of the sepals using the with() function

>head(y)

This gives output same as above but reduced the task of typing.

b. within() function in R

Let us now see the use of within function for same task:

>iris<- within(iris, ratio <- Sepal.Length / Sepal.Width)

With() function  allows you to refer to columns inside a data frame without explicitly using the dollar

sign or even the name of the data frame itself.

With and Within can be used interchangeably.

7. Creating Subgroups or Bins of Data

Most statisticians often draw histograms to investigate their data. As this type of calculation is common when you use statistics, R has some functions for it.

a. Cut() function in R

Cut() function groups values of a variable into larger bins. It creates bins of equal size and classifies each element into its appropriate bin.

Let us see how cut works in R with example:

> cut(frost, 3, include.lowest=TRUE)

This gives the result as a factor with three levels.

The cut() function creates mathematical labels for the bins. The label names can be provided by the user.

Let us see this with the help of example:

>cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))

The result shows three labels in the output.

b. table() function in R

To count the number of observations in each level of factor, R table() command can be used as below:

> x <- cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))

> table(x)

The result shows the output as a table containing the number of elements in each factor.

8. Combining and Merging Datasets in R

If you want to combine data from different sources in R, you can combine different sets of data in three ways:

a. By Adding Columns using cbind() in R

If the two sets of data have an equal set of rows, and the order of the rows is identical, then adding columns makes sense. This can be done by using the data.frame or cbind() function.

b. By Adding Rows using rbind() function in R

If both sets of data have the same columns and you want to add rows to the bottom, use rbind().

c. By Combining Data With Different Shapes using merge() function in R

The merge() function combines data based on common columns as well as rows. In database language, this is usually called joining data.

For merging the existing data, using the merge()function is useful. You can use merge()to combine data only when certain matching conditions are satisfied.

9. Merge() Function in R

Let us see the use of merge() function.

The merge() function is used to combine data frames. Let us see this with an example:

> merge(cold.states, large.states) Name Frost Area

This is the command to create a data frame that consists of cold as well as large states.

Let us see different types of merge().

The merge() function allows four ways of combining data:

a. Natural join in R

To keep only rows that match from the data frames, specify the argument all=FALSE

b. Full outer join in R

To keep all rows from both data frames, specify all=TRUE

c. Left outer join in R

To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE

d. Right outer join in R

To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE

The merge()function takes a large number of arguments, as follows:

  • x:A data frame
  • y: A data frame
  • by, by.x, by.y: Names of the columns common to both x and y. By default, it uses columns with common names between the two data frames.
  • all, all.x, all.y: Logical values that specify the type of merge. The default value is all = FALSE

10. Match() function in R

The R match() function returns the matching positions of two vectors or, more specifically, the positions of the first matches of one vector in the second vector.

> index <- match(cold.states$Name, large.states$Name)

This is the command to search for large states that also occur in the data frame cold.states

> index

It gives output as:

[1] 1 4 NA NA 5 6 NA NA NA NA NA

11. Sorting and Ordering Data in R using sort() in R and Order() in R

A common task in data analysis and reporting is sorting information. You can answer many everyday questions with sorted tables of data that tell you the best or worst of specific things; for example, parents want to know which school in their area is the best, and businesses need to know the most productive factories or the most lucrative sales areas.

Let us first create data frame and then we will sort it.

> some.states <- data.frame( + Region = state.region, + state.x77)

This is the command to create data frame some.states.

> some.states <- some.states[1:10, 1:3]

This will create subset of it.

By default, sorting is done in ascending manner if not specified.

> sort(some.states$Population)               //Command to sort Population in ascending order

> sort(some.states$Population, decreasing=TRUE)

//Command to sort Population in descending order

This is how sorting of data can be done in R.

Data frames can also be sorted as below:

order.pop <- order(some.states$Population)

Above is the command to show the order of the elements of the data frame some.states

Now to sort above data frame in ascending order, below command is used:

> some.states[order.pop, ]

To sort in descending order, we need to specify as below:

> order(some.states$Population, decreasing=TRUE)

This is how order() and sort() functions are used.

12. Traversing Data with the Apply() Function in R

To traverse the data, R uses apply functions. The output of the apply() function depends on the data structure being traversed.

a. Array or matrix

The apply() function traverses either the rows or columns of a matrix, applies a function to each resulting vector, and returns a vector of summarized results

b. List

The lapply() function can traverse a list, it applies a function to each element, and returns a list of the results. Sometimes it is possible to simplify the resulting list into a matrix or vector. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

R Apply() function is used as below:

apply(X, MARGIN, FUN, ...)

The apply() function takes four arguments as below:

  • X: This is the data—an array (or matrix)
  • MARGIN: This is a numeric vector that indicates the dimension over which to traverse—1 means rows and 2 means columns
  • FUN: This is the function to apply (for example, sum or mean)
  • … (dots): If the FUN function requires any additional arguments, they can be added here.

In essence, the apply function allows us to make entry-by-entry changes to data frames and matrices. If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when MARGIN=c(1,2) the function is applied to every entry of X.

Let us now discuss the variations of the apply() function:

a. lapply() function in R

We have already seen it above.

b. sapply() function in R

It works on a list or vector and returns vector.

c. tapply() function in R

It is used to create tabular summaries of data. This function takes three arguments:

  • X: Refers to a vector
  • INDEX: Refers to a factor or list of factors
  • FUN: Refers to a function

An illustrative example
Consider the code below:
#Create the matrix

m<-matrix(c(seq(from=-98,to=100,by=2)),nrow=10,ncol=10)

# Return the product of each of the rows

apply(m,1,prod)

# Return the sum of each of the columns

apply(m,2,sum)

13. Introduction to the Formula Interface in R

The R formula interface allows you to concisely specify which columns to use when fitting a model, as well as the behavior of the model.

You need the operators when you start building models. Formula notation refers to statistical formulae, as opposed to mathematical formulae. The formula operator + means to include a column, not to mathematically add two columns together.

OperatorExampleMeaning
~y ~ x        Model y as a function of x
+y ~ a + b        Include columns a as well as b
y ~ a – b        Include a but exclude b
:y ~ a : b        Estimate the interaction of a and b
*y ~ a * b        Include columns as well as their interaction (that is, y ~ a + b + a:b)
|y ~ a | b        Estimate y as a function of a conditional on b

Above table shows meanings of different operators in formula interfacing.

14. Variables in R

The two types of R variables are:

a. Identifier variables in R

Identifier, or ID variables identify the observations. These act as the keys that identify the observations.

b. Measured variables in R

These represent the measurements to be observed.

15. Getting started with reshape2 Package in R

Base R has a function, reshape() that works fine for reshaping longitudinal data.

The problem of data reshaping is far more generic than simply dealing with longitudinal data. So package reshape2 that contains several functions to convert data between long and wide format is released.

> install.packages("reshape2")                 //This is the command to install reshape2 package

> library("reshape2")

//This is the command to load reshape2 package

R reshape2 package is based on two key functions:

  • Melt() in R takes wide-format data and melts it into long-format data.
  • Cast() in R takes long-format data and casts it into wide-format data.

Reference:

https://www.r-project.org/about.html

Leave a comment

Your email address will not be published. Required fields are marked *