Descriptive Statistics in R | Obtain Summary Statistics in R

1. Objective – Descriptive Statistics in R

In our previous blog on R, we have discussed Arguments in R. Today, in this blog contains a detailed description of the Descriptive Statistics in R also known as Summary Statistics in R. Also, we will see various R commands like- summary, name, apply, simple cumulative, complex cumulative, Summary Statistics for matrix object in R are covered in this R summary command tutorial.

So, let’s start Descriptive Statistics in R Tutorial.

Get the best books for R Programming language to become a master in R

Descriptive Statistics in R | Obtain Summary Statistics in R

Descriptive Statistics in R | Obtain Summary Statistics in R

2. What is the Summary Statistics/Descriptive Statistics?

All the data which is gathered for any analysis is useful when it is properly represented so that it is easily understandable by everyone and helps in proper decision making. Thus after doing an analysis of data, making summary plays a vital role. This is known as summarizing the data.
We can summarize the data in several ways either by text manner or by pictorial representation.
Below are the ways of summarizing data in R:

  • Descriptive/Summary Statistics – Descriptive Statistics in R (Summary statistics) are the first figures used to represent nearly every dataset. They also form the foundation for much more complicated computations and analyses. Thus, in spite of being composed of simple methods, they are essential to the analysis process.
  • Tabulation – Representing data analyzed in tabular form for easy understanding.
  • Graphical – It is the way to represent data graphically.

In this Descriptive Statistics in R Tutorial, we will now see the Summary commands in R.

3. Summary Commands in R

Whenever you start working on any data set, you need to know the overview of what you are dealing with. There are few ways of doing this:
As we have seen in the earlier session that ls() command is used to know the list of named objects that you have. So you can start by using ls command for this purpose.
Once you know the objects that are available, you can then type the name of the object to view its contents. However, is the object contains a lot of data, the display may be quite large and you many want a more concise method to examine objects.
You could use the str() command which shows you something about the structure of data rather than giving the statistical summary. It will inform you about the number of rows and columns in the data and values in the columns with their respective heads. The str() command is designed to help you examine the structure of a data object rather than providing a statistical summary.
To get a quick statistical summary of data objects, you can use summary() command.
The output of summary command depends on the object you are looking at. It gives the output as the largest value in data, the least value or mean and median and another similar type of information.
For example, if you have below data:
S.No.        Item                     Quantity
1              Pen                       5
2              Pencil                    10
3              Rubber                  12
Str() command gives you output describing:

3 obs of 2 variables
Item: pen pencil rubber
Quantity: 5 10 12

Summary() command gives output in below form:
Min: 5
Max: 12
Mean: 13.5

The summary command is, therefore, more useful as we see minimum, maximum, mean etc values. The summary() command works for both matrix and data frame objects by summarizing the columns rather than the rows.

4. Name Commands in R

Name command and its variant are used to find or add names to rows and columns of data structures.
Below are specified few of the commands and explanation for them:

  • Names() – This command works on the list or data frame objects. It is used to get or set names for columns of a data frame or the elements of a list. It lists names of variables in a data frame.
  • names() – It works on matrix or data frame objects.
  • Rownames() – It works on matrix or data frame objects and is used to give names to rows.
  • Colnames() – It works on matrix or data frame objects and is used to give names to columns.
  • Dimnames() – Gets row and column names for matrix or data frame objects ie, it is used to see dimensions of the data frame.

rownames and row.names return the same values for the data frame and matrices; the only difference is that where there aren’t any names, rownames will print “NULL” (as does colnames), but row.names return it invisibly.
Descriptive statistics is used to analyze data in various types of industries, such as education, information technology, entertainment, retail, agriculture, transport, sales and marketing, psychology, demography, and advertising. In a broader sense, it is used as a tool to interpret and analyze data. For example, with the help of descriptive statistics, a production engineer can uncover the truth behind breakdowns of motors and a manager can supervise the quality of the production process.

5. Summarizing Samples in R Programming Language

When repeated measurements are there, we generally want to summarize data by showing measures like average. R provides a variety of commands that operate on samples. These samples of data might be individual vectors, or they may be columns in a data frame or part of a matrix or list.
A survey is conducted to find the average weight of people living in a country. As it is not possible to weigh every person of the country, a sample data of a few thousand individuals is collected. The average weight of the people in the sample would be very near to the average weight of the entire population of that country.
A variety of simple summary statistics can be applied to a vector of numbers. Two kinds of summary commands used are:

  • Commands for Single Value Results – Produce single value as a result.
  • Commands for Multiple Value Result – Produce multiple results as an output.

6. Summary Commands with Single Value Results in R

Let us see in Descriptive Statistics in R the summary commands with single value results.
There are many such commands that produce a single value as output. Let us see few of them:

  • max(x, na.rm = FALSE) – It shows the maximum value. By default, NA values are not removed. NA is considered the largest unless na.rm=true is used.
  • min(x, na.rm = FALSE) – Shows minimum value in a vector. If there are na values, NA is returned unless na.rm=true is used.
  • length(x) – Gives length of the vector and includes na values. Na.rm=instruction does not work with this command.
  • sum(x, na.rm = FALSE) – Shows the sum of the vector elements
  • mean(x, na.rm = FALSE) – Shows the arithmetic mean
  • median( x, na.rm = FALSE) – Shows the median value of the vector
  • sd(x, na.rm = FALSE) – Shows the standard deviation
  • var(x, na.rm = FALSE) – Shows the variance
  • mad(x, na.rm = FALSE) – Shows the median absolute deviation

Various commands operate on the vector of values to return a simple result; however, if NA items are present, the final value will also be NA. For most commands, you can ensure that any NA items are ignored by adding the na.rm = TRUE instruction to the command. Now you get a “proper” result.
Note: Many summarizing commands use the na.rm instruction to drop NA items from the summary, however, this is not universal. The length() command, for example, does not use na.rm

7. R Summary Commands Producing Multiple Results

We have seen command producing a single output. Let us now see command producing many outputs.
Below are few such commands:
log(dataset) – Shows log value for each element
summary(dataset) – We have seen it how it shows a summary of dataset like maximum value, minimum value, mean etc.
quantile() – Shows the quantiles by default—the 0%, 25%, 50%, 75%, and 100% quantiles. You can select other quantiles also.
The quantile() command produces multiple results by default. One can alter the default result to produce quantiles for a single probability or several (in any order). The names of the quantiles selected are displayed as percentage labels. You can suppress this by using name = FALSE instruction. If the data contains NA items, you must remove them using the na.rm = TRUE instruction, otherwise, you get an error message.
The command allows other instructions as follows:

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE)

X in the command is the data object you wish to examine.
The probs = instruction enables you to select one or several quantiles to display, defaulting to 0, 0.25, and so on. This is what the seq(0, 1, 0.25) command is doing: Setting a start of 0, an end of 1, and a step of 0.25. This is the same as c(0, 0.25, 0.5, 0.75, 1). The names = instruction tells R if it should display the name of the quantiles produced. With this will now move in R descriptive statistics article with R Cumulative Statistics.

8. R Cumulative Statistics

Cumulative statistics in R are applied sequentially to a series of values. For example, to track the interest received on an investment, cumulative statistics are used.
When data involves interest payments received then the cumulative sum would be a running total that includes interest part of each payment. The commands that calculate cumulative statistics are of two types.

  • Simple cumulative commands – Need the only name of the object.
  • Complex cumulative commands – Should be used in combination with other commands to produce more useful results.

9. Simple Cumulative Commands in R

These are the commands that need the only name of the object. Cumulative commands produce an accurate result when applies to a vector of character data. However if applied on character data, they give error populated as a list of NA items.
If numeric vector contains NA, the cumulative command will work till first NA and thereafter give all result as NA.
Below are some commands that return cumulative values:

  • Cumsum(x) – The cumulative sum of a vector
  • Cummax(x) – The cumulative maximum value
  • Cumin(x) – The cumulative minimum value
  • Cumprod(x) – The cumulative product

Let us see this with an example:
If data2 is a vector as below:
[1] 3 5 7 5 3 2 6
If you want to find the cumulative sum, it can be done as below:

>Cumsum(data2)

This gives output as below:
[1] 3 8 15 20 23 25 31
Let us look data sample that includes NA items:

> Dat.na

[1] 2 5 4 NA 7 NA

> cumprod(dat.na)

It gives output as below:
[1] 2 10 40 NA NA NA
Now lets directly jump to R complex cumulative commands in this Descriptive Statistics in R Tutorial.

10. R Complex Cumulative Commands

Cumulative commands should be used with other commands to produce additional useful results; for example, the running mean.
The basic arithmetic mean is the sum divided by the number of observations. You require the cumulative number of observations to obtain the cumulative sum.
The seq() command can ease cumulative calculations. The index can be created from a sample of numeric values. The main purpose of the command is to generate sequences of values.
Let us see the use of seq() command on data2 above:

>seq(along=data2)

[1] 1 2 3 4 5 6 7
The Same result can be generated using seq_along() command but in a faster manner.
We can combine cumsum() and seq() command as below:

> cumsum(data2) / seq(along= data2)

It gives output as:
[1] 3.0000 4.0000 5.0000 5.0000 4.60000 4.16666

11. Descriptive Statistics in R for Data Frames

Summarizing single vector of data is a simple and straight-forward process. You can directly apply the summarizing command to get results. However complicated data objects are demanding and require some amount of work around.
Let us see few generic commands for data frames as below:

  • Max(frame) – Returns the largest value in the entire data frame
  • Min(frame) – Returns the smallest value in the entire data frame
  • Sum(frame) – Returns the sum of the entire data frame
  • Fivenum(frame) – Returns the Tukey summary values for the entire data frame
  • Length(frame)- Returns the number of columns in the data frame
  • Summary(frame) – Returns the summary for each column

You can extract a single vector from your data frame and perform summary of some sort on it. This approach will not work for rows of data frames.

12. Special Summary Commands in R

There are 2 types of special summary commands:

  • Row summary commands – Applied to work with row data. 2 commands here are rowmeans() and rowsums()
  • Column summary commands – Applied to work with row data. 2 commands here are colmeans() and colsums()

Now lets move ahead with R Row Summary Commands in this Descriptive Statistics in R Tutorial.

13. R Row Summary Commands

The Row Summary commands in R work with row data.
Rowmeans() command gives the mean of values in the row while rowsums() command gives the sum of values in the row.

> rowMeans(fw)         //issues rowmeans() command

Taw    Torridge    Ouse      //show row names
5.5     14.0       10.0      //show mean of row values

> rowSums(fw)

Taw    Torridge    Ouse
11      28         20          //shows the sum of row values
In above example, each row has a row name which got displayed. If there would be no row name, the result will display as a simple vector of values as below:

[1] 5.5 1.0 10.0

14. Column Summary Commands in R

These R commands work with column data.

> colMeans(fw)         //issues colmeans() command

len      sp       alg      //show column names
5.5     14.0      10.0     //show mean of column values

> colSums(fw)

len     sp      alg
11      28      20       //shows sum of column values

15. The apply() Command in R for Summaries

Colmeans() and rowsums() commands are quick alternative to a more general command apply().
The apply() command enables applying a function to the rows or columns of a matrix or data frame. Depending on what function you specify when using the apply command, you will get back either a vector or a matrix. The general form of the command is:

apply(X, MARGIN, FUN, ...)

x specifies the matrix or data frame.
MARGIN command uses either 1 or 2, where 1 is for rows and 2 is for columns. You replace the FUN part with your command (the function you want to apply).
You can also add additional instructions if they are appropriate to the command/function you are applying. For example, you might add the na.rm = TRUE instruction as follows:

> apply(fw, 1, mean, na.rm = TRUE)]

The output of the preceding command:
Taw  Torridge  Ouse   Exe…
5.5    14.0      10.0      5.5…
The above case displays the row names of original data frames.
If the data frame has no set row names, the result would be a vector of values as given below:
[1] 20 21 22 23 21 21 19 16 16 21 21 26 21 20 19 18 17 19 21 21 22……
Next in Descriptive Statistics in R is Summary for Matrix Objects in R.

16. Descriptive Statistics in R for Matrix Objects

A matrix may look like a data frame but is not. In a Matrix Object, data splits into rows and columns though it is a single vector.
The following example shows a matrix comprised of some numeric values relating to observations of common British birds in various habitats:
>bird

Garden Hedgerow Parkland Pasture Woodland
Blackbird47104022
Chaffinch193502
Great Tit5001070
House Sparrow4616840
Robin93002
Song Thrush40600

With data frame, you can use $ to extract data but you cannot extract parts of a matrix using $. You can use the square brackets to retrieve information on any row or column:

> mean(bird[,2]) [1] 5.333333
> mean(bird[2,]) [1] 5.8

The first example returns the mean for the second column, while the next example returns the mean for the second row. Using colmeans() and rowsums() commands as before is also applicable to matrices.
The apply() command also works equally well for a matrix as it does for data frame objects. An example of using apply() command for data frames is as follows:
>apply(bird, 2, median)
Garden   Hedgerow   Parkland   Pasture   Woodland
32.5     3.0       7.0        1.0       1.0

In this case, we extract the median values for the columns of the matrix. Customizing of the result is also possible for specific elements of data.
One can append the Square brackets after the command for customizing the result for specific elements of data. For example:
> apply(bird,1,median)[1:2] Blackbird Chaffinch 10 3          //displays only first and second items
> apply(bird,1,median)[c(1,2,4)] Blackbird Chaffinch House Sparrow 10 3 8           //selects first, second and fourth items
> apply(bird,1,median)[c(1,2,’Robin’)] <NA> <NA> Robin NA NA 2                           //implies that you cannot mix numbers and text
> apply(bird,1,median)[c(‘Blackbird’,’Robin’)] Blackbird Robin 10 2           //selects the column results

Any Doubt yet in the Descriptive Analysis in R? Please Comment.

R Quiz

17. Descriptive Statistics in R for Lists

List objects do not work in a similar manner as matrix or data frame objects. The above-mentioned Summary commands fail to work and thus requires the different approach. The following table shows a list comprised of two vectors of numbers unequal in length:

> grass.l
$mow
[1] 12 15 17 11 15
$unmow
[1] 8 9 7 9

To know how many elements are in the list, you can use the length command as follows:
>length(grass.l)
[1] 2

To manipulate each element of the list, you can use the $ syntax as follows:
> mean(grass.l$mow)
[1] 14
> max(grass.l$unmow)
[1] 9

Using $ syntax is tedious for more than 1 or 2 elements, Instead, you can use special version of the apply command that works specifically on list objects.
The lapply() command denotes “list apply”. This command is very easy to use. It names the list and the function you want to apply to each list element as shown in the following table:
> lapply(grass.l, mean, na.rm = TRUE)
$mow
[1] 14
$unmow
[1] 8.25

Let’s add some extra instructions to the command. To ensure that the NA items are removed before the mean() command is applied. We get the result in the form of a list similar to the original object. To produce a better output, you can also use sapply() command as follows:
> sapply(grass.l, mean, na.rm = TRUE)
mow unmow
14.00 8.25

These two functions (lapply and sapply) work in a similar way, traversing over a set of data like a list or vector and calling the specified function for each item. The resulting output is also a matrix. Thus this enables you to undertake other manipulations because a matrix object is a bit easier to deal with than a list.
This was all on Descriptive Analysis in R.
If you like this post or have any query about Descriptive Statistics in R, do leave a comment in a section below. We will be happy to solve them.

18. Conclusion – Descriptive Statistics in R

Hence, in this tutorial of R Descriptive Statistics, we discussed the meaning of Descriptive Statistics in R. Also, we learned about different commands and data frames in R. Still,if you have any doubt regarding Descriptive Statistics in R, ask in the comment tab.

See Also-

Reference for R 

Leave a Reply

Your email address will not be published. Required fields are marked *