Chi-Square Test in R With Example

1. Objective 

In this R tutorial, we will be going to discuss R chi-square test. As it contains some parameters which are necessary to understand. Also, we will discuss each parameter in detail with an example. 

Chi-Square Test in R with Example

Chi-Square Test in R with Example

2. Introduction to Chi-Square Test in R

Chi-Square test in R is a statistical method which is being used to determine if two categorical variables have a significant correlation between them. We have to choose both variables from the same population And they should be categorized as − Male/Female, Red/Green Yes/No, etc.
For example:
We can build a dataset with observations on people’s cake buying pattern. And try to correlate the gender of a person with the flavor of the cake they prefer. Although, if a correlation is being found we can plan for appropriate stock of flavors by knowing the number of the gender of people visiting.
chisq.test() is a function used to perform test.
Syntax of a chi-square test −
Following is the description of the chi-square test parameters

  • Data is the data in form of a table containing the count value of the variables in the observation.
  • We use chisq.test function to perform the chi-square test of independence in the native stats package in R. For this test, the function requires the contingency table to be in the form of a matrix. Depending on the form of the data, to begin with, this can need an extra step, either combining vectors into a matrix or cross-tabulating the counts among factors in a data frame. 
  • We use read.table and as.matrix to read a table as a matrix. While using thi,s be careful of extra spaces at the end of lines. Also, for extraneous characters on the table, as these can cause errors.

We will actually install a chi-squared test in R and learn to interpret the results. Finally, we will be going to solve a mini challenge before we discuss the answers.

  • Background Knowledge
  • Case Study – Effectiveness of a drug treatment
  • Purpose and math of Chi-Square statistic
  • Chi-Square Test
  • R Code
  • Mini-Challenge

a. Background knowledge – Predictive modeling

To understand how predictive modeling works and also chi-squared tests and how it fits in the process.
It is a technique where we use statistical modeling or machine learning algorithms to predict response variables based on one or more predictors. Hence, the predictors are features that influence the response in some way. Also, the models work best if the features are meaningful and thus have a significant relationship with the response.

b. Hypothetical Example: Effectiveness of a Drug Treatment

To test the effectiveness of a drug for a certain medical condition we will consider a hypothetical case.
Suppose we have 105 patients under study and 50 of them were treated with the drug. Moreover, the remaining 55 patients were kept under control samples. Thus, the health condition of all patients was checked after a week.
The following table shows if their condition improved or not. by looking at it, can you tell if the drug had a positive effect on the Patient?
Here in this example, we can see that 35 out of the 50 patients showed improvement. Suppose if the drug had no effect, the 50 will split the same proportion of the patients who were not given the treatment. Here, in this case, improvement of control case is high as about 70% of patients showed improvement.
Since both categorical variables which we have already define must have only 2 levels. Also,  it was sort of perceptive today that the drug treatment and health condition are dependent.

c. Chi-Squared Statistic

For example:
The first cell will take the value: 50 times by 105, which equals 35.7.
All the expected values can be computed this way (shown in brackets).
Once that is done, the Chi-Sq statistic computed as follows.
$$\chi^2= \sum_{i=1}^{n} \frac{(O_i – E_i)^2}{E_i}$$

d. Numeric Computation

\(\ Chi-Sq = ((35-29.04)^2 / 29.04) + ((15-20.95)^2 / 20.95) + \)
\(\ ((26-31.95)^2 / 31.95) + ((29-23.04)^2/23.04) = 5.56 \)
This value will be larger if the difference between the actual and expected values widens.
Also, if we have more categories of the variables then larger the chi-squared statistic should be.

e. Chi-Squared Test

Particularly in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis.
The main thing is, we reject the null hypothesis if the p-value that comes out in the result is less than a predetermined significance level, which is 0.05 usually, then we reject the null hypothesis.
H0: The The two variables are independent
H1: The two variables relate to each other.
In case of a null hypothesis chi-squared test is to test the two variables that are independent.

f. R Code

We will work on R by doing a chi-squared test on the treatment (X) and improvement (Y) columns in treatment.csv
First, read in the treatment.csv data.
df <- read.csv(“”)
table(df$treatment, df$improvement)
improved not-improved
not-treated 26 29
treated 35 15
Let’s do the chi-squared test using the chisq.test() function. It takes the two vectors as the input. We also set `correct=FALSE` to turn off Yates’ continuity correction.
# Chi-sq test
chisq.test(df$treatment, df$improvement, correct=FALSE)
Pearson’s Chi-squared test
data: df$treatment and df$improvement
X-squared = 5.5569, df = 1, p-value = 0.01841
We have a chi-squared value of 5.55. Since we get a p-Value less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are in fact dependent. Sweet!

g. Mini-Challenge

Particularly for this challenge, first, find out if the ‘cyl’ and ‘carb’ variables in ‘mtcars’ dataset. That is dependent or not.
Let’s have a look the table of mtcars$carb vs mtcars$cyl.
table(mtcars$carb, mtcars$cyl)
4 6 8
1 5 2 0
2 6 0 4
3 0 0 3
4 0 4 6
6 0 1 0
8 0 0 1
Since there are more levels. So it’s too hard to make out if they relate to each other. Let’s use the chi-squared test instead.
# Chi-sq test
chisq.test(mtcars$carb, mtcars$cyl)
Pearson’s Chi-squared test
data: mtcars$carb and mtcars$cyl
X-squared = 24.389, df = 10, p-value = 0.006632
We have a high chi-squared value and a p-value of less than 0.05 significance level. So we reject the null hypothesis and conclude that carb and cyl have a significant relationship.

3. Conclusion

We have studied in detail about chi-square tests and its parameters with the example. These parameters with examples which we have discussed in the above information will help you to correlate with real-life examples based on chi-square-tests.

No Responses

  1. Simon says:

    I am worried about the rate of success of chisq.test(). A Very detailed article though.

Leave a Reply

Your email address will not be published. Required fields are marked *