Python Statistics – Python p-Value, Correlation, T-test, KS Test

Master Python with 70+ Hands-on Projects and Get Job-ready - Learn Python

1. Python Statistics

In this Python Statistics tutorial, we will learn how to calculate the p-value and Correlation in Python. Moreover, we will discuss T-test and KS Test with example and code in Python Statistics.

So, let’s start the Python Statistics Tutorial.

Python Statistics - p-Value, Correlation, T-test, KS Test

Python Statistics – p-Value, Correlation, T-test, KS Test

2. p-value in Python Statistics

When talking statistics, a p-value for a statistical model is the probability that when the null hypothesis is true, the statistical summary is equal to or greater than the actual observed results. This is also termed ‘probability value’ or ‘asymptotic significance’.

Do you know about Python Decorators

The null hypothesis states that two measured phenomena experience no relationship to each other. We denote this as H or H0. One such null hypothesis can be that the number of hours spent in the office affects the amount of salary paid. For a significance level of 5%, if the p-value falls lower than 5%, the null hypothesis is invalidated. Then it is discovered that the number of hours you spend in your office will not affect the amount of salary you will take home. Note that p-values can range from 0% to 100% and we write them in decimals. A p-value for 5% will be 0.05.

A smaller p-value bears more significance as it can tell you that the hypothesis may not explain the observation fairly. If one or more of these probabilities turn out to be less than or equal to α, the level of significance, we reject the null hypothesis. For a true null hypothesis, p can take on any value between 0 and 1 with equal likeliness. For a true alternative hypothesis, p-values likely fall closer to 0.

Python Statistics

Python p-Value

3. T-test in Python Statistics

Let’s talk about T-tests. Such a test tells us whether a sample of numeric data strays or differs significantly from the population. It also talks about two samples- whether they’re different. In other words, it gives us the probability of difference between populations. The test involves a t-statistic. For small samples, we can use a T-test with two samples.

Python statistics

Python Statistics – Python T-test

Let’s discuss Python Generators

a. One-sample T-test with Python

Let’s try this on a single sample. The test will tell us whether means of the sample and the population are different. Consider the voting populace in India and in Gujarat. Does the average age of Gujarati voters differ from that of the population? Let’s find out.

>>> import numpy as np
>>> import pandas as pd
>>> import scipy.stats as stats
>>> import matplotlib.pyplot as plt
>>> import math
>>> np.random.seed(6)
>>> population_ages1=stats.poisson.rvs(loc=18,mu=35,size=150000)
>>> population_ages2=stats.poisson.rvs(loc=18,mu=10,size=100000)
>>> population_ages=np.concatenate((population_ages1,population_ages2))
>>> gujarat_ages1=stats.poisson.rvs(loc=18,mu=30,size=30)
>>> gujarat_ages2=stats.poisson.rvs(loc=18,mu=10,size=20)
>>> gujarat_ages=np.concatenate((gujarat_ages1,gujarat_ages2))
>>> population_ages.mean()


>>> gujarat_ages.mean()


>>> stats.ttest_1samp(a=gujarat_ages,popmean=population_ages.mean())

Ttest_1sampResult(statistic=-2.5742714883655027, pvalue=0.013118685425061678)
Now this value of -2.574 tells us how aberrant the sample mean is from the null hypothesis.

b. Two-sample T-test With Python

Such a test tells us whether two data samples have different means. Here, we take the null hypothesis that both groups have equal means. We don’t need a known population parameter for this.
Let’s revise Recursion in Python

>>> np.random.seed(12)
>>> maharashtra_ages1=stats.poisson.rvs(loc=18,mu=33,size=30)
>>> maharashtra_ages2=stats.poisson.rvs(loc=18,mu=13,size=20)
>>> maharashtra_ages=np.concatenate((maharashtra_ages1,maharashtra_ages2))
>>> maharashtra_ages.mean()


>>> stats.ttest_ind(a=gujarat_ages,b=maharashtra_ages,equal_var=False)

Ttest_indResult(statistic=-1.4415218453964938, pvalue=0.1526272389714945)
The value of 0.152 tells us there’s a 15.2% chance that the sample data is such far apart for two identical groups. This is greater than the 5% confidence level.

c. Paired T-test With Python

When you want to check how different samples from the same group are, you can go for a paired T-test. Let’s take an example.

>>> np.random.seed(11)
>>> before=stats.norm.rvs(scale=30,loc=250,size=100)
>>> after=before+stats.norm.rvs(scale=5,loc=-1.25,size=100)
>>> weight_df=pd.DataFrame({"weight_before":before,
>>> weight_df.describe()
Python Statistics

Paired Sample T-test

>>> stats.ttest_rel(a=before,b=after)

Ttest_relResult(statistic=2.5720175998568284, pvalue=0.011596444318439857)
So, we see we have just 1% chances to find such huge differences between samples.
Do you know about Python Errors

4. KS Test in Python Statistics

This is the Kolmogorov-Smirnov test. It lets us test the hypothesis that the sample is a part of the standard t-distribution. Let’s take an example.

>>> stats.kstest(x,'t',(10,))

KstestResult(statistic=0.023682909426459897, pvalue=0.6289865281325614)

>>> stats.kstest(x,'norm')

KstestResult(statistic=0.019334747291889, pvalue=0.8488119233062457)
Pay attention to the p-values in both cases.

a. Two samples

What we saw above was the KS test for one sample. Let’s try two.

>>> stats.ks_2samp(gujarat_ages,maharashtra_ages)

Ks_2sampResult(statistic=0.26, pvalue=0.056045859714424606)

5. Correlation in Python Statistics

This is a statistical relationship between two random variables (or bivariate data). This can be causal. It is a measure of how close two variables are to holding a linear relationship to each other. One such example will be the correlation between demand and supply for a product whose supply is limited.
Have a look at Exception Handling in Python
Correlation can denote a predictive relationship that we can exploit. To measure the degree of correlation, we can use constants like ρ or r. Benefits of correlation-

  • Predicting one quantity from another
  • Discovering the existence of a causal relationship
  • Foundation for other modeling techniques

a. Example of Correlation In Python

Let’s take an example.

>>> df=pd.read_csv('furniture.csv',index_col='Serial',parse_dates=True)
>>> df['Gross']=df.Cost+df.Cost*10
>>> df.describe()
Python Statistics

Python Statistics – Correlation With Python

>>> df.corr()
Python Statistics

Example of Python Correlation

This gives us how each column correlates to another. You can also calculate the covariance in the following way-

>>> df.cov()

               Cost          Gross
Cost    3.131608e+08   3.444769e+09
Gross   3.444769e+09   3.789246e+10

b. Plotting Correlation in Python

Let’s use seaborn to plot the correlation between columns of the ‘iris’ dataset.
Let’s revise Python Iterator

>>> import seaborn as sn
>>> df1=sn.load_dataset('iris')
>>> sn.pairplot(df,kind='scatter')

<seaborn.axisgrid.PairGrid object at 0x06294090>

Python Statistics

Plotting Correlation in Python

c. Saving the Results

We can export the result of a correlation as a CSV file.

>>> d=df1.corr()
>>> d.to_csv('iriscorrelation.csv')

This gives us the following CSV file-

Python Statistics

Python Statistics – Saving the Results

So, this was all in Python Statistics. Hope you like our explanation.

6. Conclusion: Python Statistics

Hence, in this Python Statistics tutorial, we discussed the p-value, T-test, correlation, and KS test with Python. To conclude, we’ll say that a p-value is a numerical measure that tells you whether the sample data falls consistently with the null hypothesis. Correlation is an interdependence of variable quantities. Still, if any doubt regarding Python Statistics, ask in the comment tab.
See also – 
Python Directory and Files 
For reference

If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google


7 Responses

  1. Alkis Piskas says:

    This tutorial seems good but ‘minnesota_ages1’ and ‘minnesota_ages2’ are inexistent, so the code doesn’t work!

    • DataFlair Team says:

      Hi Alkis,
      Thanks for connecting us through this Python Statistics tutorial.
      Good catch, it should be gujarat_ages1 and gujarat_ages2. We have made the correction.

  2. vimal prakash says:

    >>> sn.pairplot(df,kind=’scatter’)
    seems it should be df1

  3. Ichsan says:

    Nice blog. I wanti to ask question. For example, we have code like this. I have 4 method to increase plant height.

    Now, we want to check method’s significancy that affect height of plant

    pair_t = dataframe.t_test_pairwise(‘C(Method)’)

    I got p>|t| and pvalue-hs. Please tell me the different

  4. Da says:

    Hi, thanks for sharing it’s great learning material! Wonder if you can share the sample data you used as example for t-test?

  5. Nidhi says:

    It’s good learning material . Thanks for sharing.
    I am Studying Data Science. It would be really helpful if you can share the datafiles also.

Leave a Reply

Your email address will not be published. Required fields are marked *