Hypothesis testing is a first step into really understanding how to use statistics.
The purpose of the test is to tell if there is any significant difference between two data sets.
Consider the follow example:
Let’s say I am trying to decide between two computers. I want to use the computer to run advanced analytics, so the only thing I am concerned with is speed.
I pick a sorting algorithm and a large data set and run it on both computers 10 times, timing each run in seconds.
Now I put the results into two lists. A and B
a = [10,12,9,11,11,12,9,11,9,9] b = [13,11,9,12,12,11,12,12,10,11]
A quick look at the data makes me think b is slower than a. But is it slower enough to mean something or are these results just a matter of chance (meaning if I ran the test 200 more times would the end result be closer to equal or further apart).
To find out, let’s do a hypothesis test.
Set our Hypothesis:
- H0 = H1 – there is no significant difference between data sets
- H0 <> H1 – there is a significant difference
To test our hypothesis, let’s run a t-test
import stats from scipy and run stats.ttest_ind().
Our output is the z-statistic and the p-value.
Our p-value is 0.08 – greater than the common significance value of 0.05. Since it is greater, we cannot reject H0=H1. This means both computers are effectively the same speed.
Let’s try a third computer – d
d = [13,12,9,12,12,13,12,13,10,11]
Now, let’s run a second T-test. This one comes back with a p-value of 0.026 – under 0.05. This means we can reject our hypothesis that a=d. The speed differences between a and d are significant.