Graphical Data Analysis With R Programming Language

1. Objective – R Graphical Data Analysis

This blog will take you through the introduction to the graphical data analysis with R in detail. Moreover, in this blog, we will discuss the various types of plots drawn in R, types of plots available in R i.e. Histograms, Index Plots, Time Series Plots, Pie Charts. Also, we will look at saving graphics to file in R and the selection of the appropriate blog in R are also present in this R tutorial.

So, let’s start Graphical Data Analysis with R.

Graphical Data Analysis With R Programming Language

Graphical Data Analysis With R Programming Language

2. What is Graphical Data Analysis With R?

Much of the statistical analysis is based on numerical techniques, such as confidence intervals, hypothesis testing, regression analysis, and so on. In many cases, these techniques are based on assumptions about the data being used. One way to determine if data conform to these assumptions is the Graphical Data Analysis with R, as a graph can provide many insights into the properties of the plotted dataset.
Graphs are useful for non-numerical data, such as colors, flavors, brand names, and more. When numerical measures are difficult or impossible to compute, graphs play an important role.
Statistical computing is done with the aim to produce high-quality graphics.
Various types of plots drawn in R are:

  • Plots with single variables – You can plot a graph for a single variable.
  • Plots with multiple variables – You can plot graph with multiple variables
  • Special plots – R has low and high-level graphics facilities.

i. Plots for a Single Variable

You may need to plot for a single variable in Graphical Data Analysis With R. For example, a plot showing daily sales values of a particular product over a period of time. You can also plot the time series for month by month sales.
The choice of plots is more restricted when you have just one variable to the plot. R offers the following plotting functions for single variables:

  • hist(y) – Histograms to show a frequency distribution
  • plot(y) – Index plots to show the values of y in sequence
  • plot.ts (y) – Time series plots
  • pie (x) – Compositional plots like pie diagrams

The types of plots available in R are:

  • Histograms – Used to display the mode, spread, and symmetry of a set of data.
  • Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for error checking.
  • Time Series Plots – When a period of time is complete, the time series plot can be used to join the dots in an ordered set of y values.
  • Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations.

A common mistake among beginners is to confuse histograms and bar charts. Histograms have the response variable on the x-axis, and the y-axis shows the frequency of different values of the response. In contrast, a bar chart has the response variable on the y-axis and a categorical explanatory variable on the x-axis. Let us now study Histograms for performing Graphical analysis with R.

a. Histograms

Histograms display the mode, the spread, and the symmetry of a set of data. The R function hist() is used to plot histograms.
X-axis is divided into which the values of the response variable are distributed and then counted. This is called bins. Histograms are tricky because it depends on the subjective judgments of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce one picture, narrow bins produce a different picture, and unequal bins produce confusion.
Small bins produce multimodality (combination of audio, textual, and visual modes), whereas broad bins produce unimodality (contains a single mode). When there are different bin widths, the default in R is for this to convert the counts into densities.
The convention adopted in R for showing bin boundaries is to employ square and round brackets, so that:

  • [a,b) means ‘greater than or equal to a but less than V [square than round)
  • (a,b] means ‘greater than a but less than or equal to b’ (round than square]

You need to take care that the bins can accommodate both your minimum and maximum values.
The cut() function takes a continuous vector and cuts it up into bins that can then be used for counting.
The hist() function in R does not take your advice about the number of bars or the width of bars. It helps simultaneous viewing of multiple histograms with similar range. For small integer data, you can have one bin for each value.
In R, the parameter k of the negative binomial distribution is known as size and the mean is known as mu.
Drawing histograms of continuous variables is a more challenging task than explanatory variables. This problem depends on the density estimation that is an important issue for statisticians. To deal this problem, you can approximately transform continuous model to a discrete model using a linear approximation to evaluate the density at the specified points.
The choice of bandwidth is a compromise made between removing insignificant bumps and real peaks. The general rule for bandwidth is:

b. Index Plots

For plotting single samples, index plots can be used. The plot function takes a single argument. This is a continuous variable and plots values on the y-axis, with the x coordinate determined by the position of the number in the vector. Index plots are especially useful for error checking.

c. Time Series Plot

The time series plot can be used to join the dots in an ordered set of y values when a period of time is complete. The issues arise when there are missing values in the time series (e.g., if sales values for two months are missing during the last five years), particularly groups of missing values (e.g., if sales values for two quarters are missing during the last five years) for which periods we typically know nothing about the behavior of the time series.
ts.plot and plot.ts are the two functions for plotting time series data in R.

d. Pie Chart

You can use pie charts to illustrate the proportional makeup of a sample in presentations. Here the function pie takes a vector of numbers and turns them into proportions. It then divides the circle on the basis of those proportions.
To indicate each segment of the pie, it is essential to use a label. The label is provided as a vector of character strings, here called data$names.
If a names list contains blank spaces then you cannot use read.table with a tab-delimited text file to enter the data. Instead, you can save the file called piedata as a comma-delimited file, with a “.csv” extension, and input the data to R using read.csv in place of read.table

data <- read, csv (.c : \\temp\\piedata.csv)
data

The pie chart can be created, using the following command:
pie(data$amounts,labels=as.character(data$names))
Note: The color for the segments can also be changed in R.

ii. Plots with Two Variables

The two types of variables used in the graphical data analysis with R:

  • Response variable
  • Explanatory variable

The response variable is represented on the y-axis and the explanatory variable is represented on the x-axis. Nature of the explanatory variable determines the kind of plot produced. When the explanatory variable is a continuous variable, such as length or weight or altitude, the appropriate plot to use is a scatterplot.
When an explanatory variable is categorical, like genotype or color or gender, the appropriate plot is either a box-and-whisker plot or a barplot.
A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles and it depends on the minimum and maximum values, and upper and lower quartiles.
A barplot provides a graphical representation of data in the form of bar charts.
The most frequently used plotting functions for two variables in R are:

  • plot (x, y): Scatterplot of y against x
  • plot (factor, y): Box-and-whisker plot of y at each factor level
  • barplot (y): Heights from a vector of y values (one bar per factor level

The types of plots available in R are:

  • Scatterplots – When the explanatory variable is a continuous variable.
  • Stepped Lines – Used to plot data distinctly and provide a clear view.
  • Boxplots – Boxplots show the location, spread of data and indicate skewness.
  • Barplots – It shows the heights of the mean values from the different treatments.

a. Scatterplots

Scatterplots shows a graphical representation of the relationship between two numbered sets. The plot function draws axis and adds a scatterplot of points. You can also add extra points or lines to an existing plot by using the functions, point, and lines.
The points and line functions can be specified in the following two ways:

  • Cartesian plot (x, y) – A Cartesian coordinate specifies the location of a point in a two-dimensional plan with the help of two perpendicular vectors that are known as an axis. The origin of the Cartesian coordinate system is the point where two axes cut each other and the location of this point is the (0,0).
  • Formula plot (y, x) – The formula based plot refers to representing the relationship between variables in the graphical form. For example, the equation, y=mx+c, shows straight line in the Cartesian coordinate system.

The advantage of the formula-based plot is that the plot function and the model fit look and feel the same. The Cartesian plots build plots using “x than y” while the model fit uses “y than x”.
The plot function uses the following arguments:

  • The name of the explanatory variable
  • The name of the response variable

The syntax for the plot function looks like plot (x, y). The data you want to plot is read into R from a file, as shown in the following commands:

datal <- read, table (.c: \\temp\\scatterl. Txt. ,header=T)
attach(datal)
names(datal)
[1] .x1. .y1.

To produce the scatter plot, type the following command:
Plot (x1, y1, col=.red.)
Unless you specify with explicit labels, the random variable names label the axis. You could use below command to change the label x1 into the longer label called as ‘Explanatory variable’ and the label on ty-axisxis from y1 to ‘Response variable’.
plot(x1, y1, col="red", lab="Explanatory variable", ylab="Response variable")
The argument pch refers to plotting character or plotting symbol. The plotting symbol, pch adds the variations to the scatterplots.
As the value of pch changes, the plotting character also changes. There are 256 different plotting symbols used in R (0 to 255). A graphic showing all of them in sequence, from bottom left to top right, can be built as follows:
plot(0:10,0:10,xlim=c(0,32),ylim=c(0,40),type=.n.,xaxt=.n.,yaxt=.n.,xlab=..,ylab=..)
x <- seq(1,31,2)
s <- –16
f <- –1
for (y in seq (2, 40, 2.5)) {
s <- s + 16
f <- f + 16
y2 <- rep(y, 16)
points (x, y2,pch=s:f,cex=0.7)
text(x,y-l,as.character(s:f),cex=0.6) }

The formula based plot refers to representing the relationship between variables in graphical form. For example, the equation, y=mx+c, shows the straight line in Cartesian coordinate system.

R: Graphical Data Analysis

R: Graphical Data Analysis

The bottom two rows show the basic plotting symbols (pch), with their pch number immediately
beneath. The default value, pch= 1, is a small open circle in black. Note that values from 26 to 32 are not implemented at present. Values for pch between 33 and 127 represent the ASCII character set, while values between 128 and 255 are the symbols from the Windows character set.
The symbols for pch=19 and pch=20 are solid circles of different sizes. The difference between pch=16 and pch=19 is that because the latter uses a border, it is larger when the line width Iwd is large relative to character expansion cex. The symbol for pch=46 is the /dot/.
The plotting symbols (pch) numbered 21–25 allow you to specify the background color and the border color separately. Sometimes you might need to add labels to the data; for example, you can add labels to represent country wise savings data on a graph where each point displays the country name.
You can also add text to scatterplots in R. It is easy to add text to graphics. For example, to add the text ‘(b)’ to a plot at the location x = 80 and y = 65; just type text (80, 65, “ (b)”).
The best way to identify multiple individuals in scatterplots is to use a combination of colors and symbols. A useful tip is to use as.numeric to convert a grouping factor into a color and/or a symbol.

b. Stepped Lines

Stepped lines can be plotted as graphical displays in R. These plots, plot data distinctly and they also provide a clear view of the differences in the figures.
While plotting square edges between two points, you need to decide whether to go across and then up, or up and then across. Let’s assume that we have two vectors from 0 to 10:

x <- 0 : 10
y <- 0 : 10
plot(x, y)

There are three ways to join the dots:

  • With a straight line by using the following command:

lines(x, y, col=.red.)

  • With a stepped line going first across, then up by using the lower-case ‘s’, as shown in the following command:

lines(x, y, col=.blue., type=.s.)

  • With a stepped green line going up first, then across using the uppercase ‘S’ as:

lines(x,y,col=.green.,type=.S.)

c. Box and Whisker Plot

A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles. It is based on the minimum and maximum values, and upper and lower quartiles.
Boxplots summarizes the information available. The vertical dash lines are called the ‘whiskers’. Boxplots are also excellent for spotting errors in data. The extreme outliers represents these errors.
Any doubt yet in the Graphical Analysis with R? Please Comment.

d. Barplot

Barplot is an alternative to boxplot to show the heights of the mean values from the different treatments. Function tapply computes the heights of the bars. Thus it works out the mean values for each level of the categorical explanatory variable.
Let’s take an example to illustrate boxplots with error bars. Data for this example is from an experiment on plant competition, with five factor levels in a single categorical variable called clipping: a control (undipped), two root clipping treatments (r5 and r10) and two shoot clipping treatments (n25 and n50) in which the leaves of neighboring plants were reduced by 25% and 50%. The response variable is yield at maturity (a dry weight) called biomass as shown in the following commands:

trial <- read, table (c : \\temp\\compexpt.txt ,header=T)
attach(trial)
names(trial)
[1] biomass clipping

First, calculate the heights of the bars using t apply to compute the five mean values:
means <- tapply(biomass,clipping,mean)
Then the barplot is produced very simply:
barplot(means,xlab=treatment,ylab=mean yield,col=green)

iii. Plots with Multiple Variables

Initial data inspection using plots is even more important when there are many variables, any one of which might have mistakes or omissions. The principal plot functions that represents multiple variables are:

  • The pairs function – For a matrix of scatterplots of every variable against every other.
  • The coplot function – For conditioning plots where y is plotted against x for different values of z.

It is better to use more specialized commands when dealing with the rows and columns of data frames.

a. The pairs Function

For two or more continuous explanatory variables, it is valuable to check for subtle dependencies between the explanatory variables. Rows represent the response variables and columns represent the explanatory variables.
Every variable in the data frame is on the y-axis against every other variable on the x-axis using the pairs function plots. The pairs function needs only the name of the whole data frame as its first argument.

b. The coplot Function

The relationship between two variables may be obscured by the effects of other processes in multivariate data. When you draw a two-dimensional plot of y against x, then all the effects of other explanatory variables are shown onto the plane of the paper. In the simplest case, we have one response variable and just two explanatory variables.
The coplot panels are ordered from lower left to upper right, associated with the values of the conditioning variable in the upper panel from left to right.
Coplot involves the ‘shingles’ shown in the upper margin which is its biggest disadvantage. The overlap between the shingles shows the extent of overlap between one panel and the next with respect to the number of common data points between them.

3. Special Plots in Graphical Data Analysis With R

R has extensive facilities for producing graphs. It also has low and high-level graphics facilities as per the requirement.
The low-level graphics are the basic building blocks that can build up graphs step by step, while a high-level facility provides the variety of pre-assembled graphical display.
Apart from the various kinds of graphical plots discussed, R supports the following special plots:

  • Design plots – Effective sizes in designed experiments can be visualized using design plots. One can plot the Design plots using the plot.design.  plot.design(Growth.rate~Water*Detergent*Daphnia)
  • Bubble plots – useful for illustrating the variation in the third variable across different locations in the x–y.
  • Plots with many identical values – Sometimes, two or more points with count data fall in exactly the same location in a scatterplot. As a result, the repeated values of y are hidden, one beneath the other.

4. Adding Other Shapes to a Plot

Using the following functions we can add the extra graphical objects in plots:

  • rect – For plotting rectangles – rect(xleft, ybottom, xright, ytop)

Using the locater function we can obtain the coordinates of the corners of the rectangle. But the rect function does not accept locator as its argument.

  • arrows – For plotting arrows and headed bars – The syntax for the arrows function is to draw a line from the point (xO, yO) to the point (x1, y1) with the arrowhead, by default, at the “second” end (x1, y1).
arrows(xO, yO, xl, yl)

Adding code=3 produces a horizontal double-headed arrow from (1,9) to (5,9), for example:

arrows(1,9,5,9,code=3)
  • polygon – For plotting more complicated filled shapes, including objects with curved sides.

To draw a polygon in R, save the coordinates of six points in a vector called locations by using the following commands:

locations <- locator(6)

Now you can draw a lavender-colored polygon by using the following command:

polygon(locations,col=.lavender.)]

5. Saving Graphics to File in R

You are likely to want to save each of your plots as a PDF or PostScript file for publication-quality graphics. This is done by specifying the ‘device’ before plotting, then turning the device off once finished.
Computer screen is the default device, where we can obtain a rough copy of the graph, using the following command:

data <- read, table ("c : \\temp\\pollute.txt", header=T)
attach(data)

There are numerous options for the pdf and postscript functions, but width and height are the ones you are likely to want to change most often. The sizes are in inches. You can specify any nondefault arguments that you want to change using the functions pdf.options and ps.options before you invoke either pdf or postscript.

6. Selecting an Appropriate Graph

You have learned about different types of graphs. As a result, you can draw these graphs in R. It is also important to select an appropriate type of graph according to the requirements.
Some common graphs and their uses are as follows:

  • Line Graph – It displays, the over the time period. It generally keeps the track of records for both, long time period and short time period according to requirements. In the case of small change, the line graph is more common than the bar graph. In some cases, the line graphs also compare the changes among different groups in the same time period.
  • Pie Chart – It displays comparison within a group. For example, you can compare students in a college on the basis of their streams, such as arts, science, and commerce using a pie chart. One can not use a pie chart to show changes over the time period.
  • Bar Graph – Similar to a line graph, the bar graph generally compares different groups or tracking changes over a defined period of time. Thus the difference between two graphs is that the line graph tracks small changes while bar graph tracks large changes.
  • Area Graph – The area graph tracks the changes over the specific time period for one or more groups related to a similar category.
  • X-Y Plot – The X-Y plot displays a certain relationship between two variables. In this type of variable, the X-axis measures one variable and Y-axis measures another variable. On the one hand, if the values of both variable increase at the same time, a positive relationship exists between variables. On the other hand, if the value of one variable decreases at the time of increasing value of another variable, a negative relationship exists between variables. It, it could be also possible that two variables don’t have any relationship. In this case, plotting graph has no meaning.

So, this was all in Graphical Data Analysis with R. Hope you like our explanation.

7. Conclusion – Graphical Data Analysis with R

Hence, in this tutorial of Graphical Data Analysis with R, we discussed the meaning of Graphical Data Analysis. Moreover, we looked at plots and saving graphics to file in R. Also, we learned how to select an appropriate graph. Still, if you have any query related to Graphical Data Analysis with R, so feel free to share with us. We will be happy to solve them.
See Also-

Leave a Reply

Your email address will not be published. Required fields are marked *