There are many (many, many) different statistical tests available. There are also, however, many different resources available, online or in textbooks, to help make the right choice. We will cover in this document some of the most common ones, seen during the course, and we will highlight the questions that you need to ask about the data and the hypotheses that you are trying to test.
In this document, we will try to go through all the branches of this very large and confusing tree:
The main factors that determine the test to use are:
This last question will typically determine whether you use a parametric test (assumes a well-known distribution, most commonly normal) or a non-parametric test (makes few assumptions about the shape of the distribution, typically works on ranks rather than values).
The other questions can lead to many different branches, but let’s start by the first one: how many variables and / or samples ? Another way to look at it is: do you want to determine something about one variable in isolation, or are you interested in the relationship between two or more variables (or one variable in several samples) ?
The distinctions between multivariates, multi-samples, groups, paired and unpaired samples, etc. can be a little bit confusing, so let’s start by clarifying that.
Let’s first look at the simplest case: a single variable in one sample. Our dataset will therefore simply be a vector of size \(n\) with \(n\) the amount of items in the sample. In that case, we will be looking at univariate tests.
When we start looking at relationships, things are a little bit more complicated, because we can often present the data in different ways.
Let’s say that we have a set of individuals (our sample) for which we measure two different things at the same time. In that case, we have two variables in a single sample, and the dataset will look something like this:
| Individual | X | Y |
|---|---|---|
| 1 | \(x_1\) | \(y_1\) |
| … | … | … |
| n | \(x_n\) | \(y_n\) |
We can see that each \(x_i\) is paired with a \(y_i\), so we can say that we are looking at paired data. Sometimes, however, these “pairings” appear in different ways. Let’s say for instance that, on some individuals, we are measuring one thing at different times. In this case, the data may look something like this:
| Individual | X(t=0) | X(t=1) |
|---|---|---|
| 1 | \(x_{1,0}\) | \(x_{1,1}\) |
| … | … | … |
| n | \(x_{n,0}\) | \(x_{n,1}\) |
In many cases, this will be presented as taking two samples of the same variable on the same individuals. A typical example would be taking a measure before and after treatment in a medical study.
In both cases, we are dealing with paired data, but in the first one we are not necessarily expecting \(X\) and \(Y\) to be similar in any way. In fact, \(X\) may be categorical and \(Y\) numerical, or they may have completely different scales. We will therefore be interested in finding dependence relationships between the two: is the value of \(X\) somehow a function of the value of \(Y\) (and vice-versa)?
In the second case, however, we will typically be trying to measure a change between the two measures, with the null hypothesis that the variable doesn’t change between the measures.
Now another scenario would be to measure one thing in individuals which belong to different groups. We would then have a dataset looking something like that:
| Individual | Group | X |
|---|---|---|
| 1 | A | \(x_1\) |
| 2 | B | \(x_2\) |
| … | … | … |
| n | A | \(x_n\) |
In this case, the two groups are unpaired: there is no one-to-one relationship between individuals in group A and group B, so we have to look at their distributions separately. This is equivalent to saying that we have two independent samples (A and B), each with (potentially) a different sample size, where we measure the same variable.
(Note that this is a special case of the “two variables, one sample” situation, where one variable is a categorical grouping variable)
The bivariate tests typically have extensions to more samples/groups/variables: the multivariate tests
When we are looking at a single variable in isolation, there are still several things that we may want to test: the distribution, the variability, and the central value.
These are tests done on two variables, which may come from the same sample or not, and may be paired or unpaired (see Samples, groups, variables)
Note that “independence” is used for different things: “independent vs dependent” variables in the sense of “having a statistical relationship or not”, or “independent vs paired” samples in the sense of “measured on independent individuals or not”.
For instance, let’s say that our samples are two groups (“Treatment” and “Control”) and a binary outcome (“Sick” or “Healthy”). Individuals (patients) are split into the “Treatment” and “Control” group independently, so each pair (Group, Outcome) represents a different individual. We can then look at the contingency table:
| Outcome \ Group | Treatment | Control |
|---|---|---|
| Sick | a | b |
| Healthy | c | d |
The underlying table of “raw” data could look something like this:
| Individual | Group | Outcome |
|---|---|---|
| 1 | Treatment | Healthy |
| … | … | … |
| n | Control | Sick |
Each variable is dichotomous (binary categories), but we have two independent (“unpaired”) groups, so we should use the Chi-square independence test, as seen in the bivariate unpaired section.
If, on the other hand, we have two samples, “Pre-Treatment” and “Post-Treatment”, measured on the same individuals, then our samples are paired. Then if we try to make a contingency table like in the previous case:
| Outcome \ Group | Pre-Treatment | Post-Treatment |
|---|---|---|
| Sick | a | b |
| Healthy | c | d |
We will have a problem, as individuals actually contribute to different cells (example: an unlucky individual that is healthy pre-treatment and sick post-treatment will be counted in c and b).
The “raw” data here would look something like:
| Individual | Pre-treatment | Post-treatment |
|---|---|---|
| 1 | Sick | Healthy |
| … | … | … |
| n | Sick | Sick |
The correct way to structure the contingency table is therefore:
| Pre \ Post | Sick | Healthy |
|---|---|---|
| Sick | a | b |
| Healthy | c | d |
In this table, each individual only appears once, depending on how their treatment affected their status. The McNemar test will be appropriate in this case.
If we are looking at two variables \(X, Y\) from the same sample, we will often be interested in the correlation between the variables.
Alternatively, if we are looking at the same variable under different conditions (but for the same individuals), we will usually be interested in determining if the variable changed. A common technique is to create a new random variable \(X = X_1 - X_2\), and test the null hypothesis that \(E[X] = 0\).
If we have one categorical grouping variable, we can split the data into groups that we will treat as independent samples. Depending on the number of groups and type of the other variable, we will then look at the appropriate bi- or multivariate test.
Here, we are in the case where individuals have been split into different groups, without one-to-one relationships between the groups. In that case, we will often want to look at the similarity or dissimilarity between the distributions, or between the central values of the two groups.
When moving to more than 2 variables (or groups, or samples), we have to be a little careful: it is tempting to just use the bivariate tests in a pairwise fashion, but this doesn’t account for the fact that, if you perform more tests, the likelihood that you get a false positive (Type I error) increases. So it’s better to use tests on all the samples at once, then to use “post-hoc” pairwise tests to see which pairs of samples contribute to the significant differences in distributions.
Alternatively, when multisamples tests are not applicable for our case, we can use “corrections” such as the Bonferroni procedure to account for this Type I error risk.
After using a multivariate/multisample test, we often want to determine which sample(s) contributed to the rejection of the null hypothesis (which sample is “significantly different” from the others). To do that, we can use post-hoc analysis.
Post-hoc tests are one-to-one tests which can be performed after the multisample test. We cannot use the post-hoc directly without first checking, with the multisamples test, that there is actually a significant effect that we need to look for. Otherwise, we fall back into the problem of multiplying the chances of a Type I error.
For some tests, the post-hoc will be one of the “two samples” test (e.g., McNemar as a post-hoc for Cochran). For others, there are specific post-hoc that can be used (e.g. Nemenyi after Friedman). Those will be detailed with their respective tests below.
Let \(X\) be a random variable that can take \(r\) different discrete values (categories).
\(H_0\): \(X \sim D\) such that, under \(D\), we have a known probability distribution for the different categories. Example: \(P(X=i) = \frac{1}{r} \rightarrow\) equal probability of each category.
Test statistic: \(X^2 = \sum_{i=1}^r \frac{(n_i-e_i)^2}{e_i}\), where \(n_i\) is the observed number of values in category \(i\), and \(e_i\) the expected number under \(H_0\). Under \(H_0\), we have: \(X^2 \sim \chi_{r-1}^2\).
P-value: \(P(\chi^2_{r-1} \geq X^2)\).
Example with code:
Given observed values: ANTERIOR=12, POSTERIOR=17, LEFT=11, RIGHT=20. Total=60 \(\rightarrow\) expected values under \(H_0\) (equal distribution): \(e_i = 15 \forall i\).
Test statistic: \(X^2 = \frac{(12-15)^2}{15} + \frac{(17-15)^2}{15} + \frac{(11-15)^2}{15} + \frac{(20-15)^2}{15} = 3.6\)
P-value: \(P(\chi^2_{3} \geq 3.6) = 0.31 \rightarrow\) we cannot reject \(H_0\), it is possible that the tumours are evenly distributed.
from scipy import stats
observed = np.array([12, 17, 11, 20])
expected = np.array([15, 15, 15, 15])
# home-made version:
x2 = (((observed-expected)**2)/expected).sum()
pval = stats.chi2.sf(x2, df=3)
print(x2, pval)
# directly using Scipy's test:
res = stats.chisquare(observed, expected)
assert res.statistic == x2
assert res.pvalue == pval
Let \(X\) be a random variable which can only take two values, which we will label “SUCCESS” and “FAILURE”.
\(H_0\): \(P(SUCCESS) = p ; 0 \leq p \leq 1\).
Test statistic: number of successes \(k\) for a sample of size \(n\).
Under \(H_0\), we expect to see \(np\) successes. The test is therefore on the likelihood of observing a number of success \(k\) at least as different from \(np\). This likelihood can be computed using the binomial distribution: \(P(\#SUCCESS = k) = \binom{n}{k}p^k(1-p)^{n-k}\).
Therefore, we have:
\(P(\#SUCCESS \leq k) = \sum_{i=0}^k \binom{n}{i}p^i(1-p)^{n-i} = pval_{l}\) (p-value for one-tailed test on the left)
\(P(\#SUCCESS \geq k) = \sum_{i=k}^n \binom{n}{i}p^i(1-p)^{n-i} = pval_{r}\) (p-value for one-tailed test on the right)
\(P(|\#SUCCESS - k| \geq 0) = pval_{l} + pval_{r}\) (p-value for two-tailed test)
Example with code:
We have a quality control process where we expect that 0.5% of the pieces have a defect if the production line works properly. In a sample of 300 pieces, we observe that 4 pieces have a defect. Is our production line working properly?
We can formulate this in two different ways:
Let’s the computation on both, using a significance threshold of 0.05:
\(P(defect) = 0.005\)
\(P(\#defect \geq 4) =
\sum_{i=4}^{300}\binom{300}{i}0.005^i(0.995)^{300-i}\)
\(P(\#defect \geq 4) = 0.065 >
0.05\)
or
\(P(good) = 0.995\)
\(P(\#good \leq 296) =
\sum_{i=0}^{296}\binom{300}{i}0.995^i(0.005)^{300-i}\)
\(P(\#good \leq 296) = 0.065 >
0.05\)
\(\rightarrow\) we cannot reject \(H_0\), the line may be working properly (but we should probably keep an eye on it to be sure…).
Code (for the first version):
from scipy import stats, special
# handmade
pd = 0.005
n = 300
k = 4
pval = sum(special.comb(n, i)*(pd**i)*((1-pd)**(n-i)) for i in range(k, n+1))
# using scipy's test
res = stats.binomtest(k, n, p=pd, alternative='greater')
assert np.isclose(res.pvalue, pval) # to accept rounding errorsLet \(X\), \(Y\) be two categorical random variables which can take respectively \(r\) and \(p\) values (or the same categorical random variable in two groups, in which case \(r=p\)).
The test starts by taking the contingency table, and computing the row totals and column totals (i.e. the marginal distributions for each variable):
| \(Y_1\) | … | \(Y_j\) | … | \(Y_p\) | ROW TOTAL | |
|---|---|---|---|---|---|---|
| \(X_1\) | ||||||
| … | ||||||
| \(X_i\) | \(n_{ij}\) | \(n_{i.}\) | ||||
| … | ||||||
| \(X_r\) | ||||||
| COL TOTAL | \(n_{.j}\) | \(n\) |
\(H_0\): \(X\) and \(Y\) are independent.
If \(X\) and \(Y\) are independent, we would expect each row to have a similar distribution to the column totals (i.e. \(\frac{n_{ij}}{n_{i.}} \approx \frac{n_{.j}}{n}\)), and conversely for the columns / row totals.
In such a case, the expected values of each cell is \(e_{ij} = \frac{n_{i.}n_{.j}}{n}\).
(So that \(\frac{e_{ij}}{n_{i.}} = \frac{n_{i.}n_{.j}}{n_{i.}n} = \frac{n_{.j}}{n}\) and \(\frac{e_{ij}}{n_{.j}} = \frac{n_{i.}n_{.j}}{n_{.j}n} = \frac{n_{i.}}{n}\)).
Application conditions: \(e_{ij} < 5\) in at most 20% of the cells, \(e_{ij} \neq 0 \forall i,j\). If those conditions are not met, we can use Fisher’s exact test (not covered in this document).
Test statistic: \(X^2 = \sum_{i=1}^{r}\sum_{j=1}^{p}\frac{(n_{ij}-e_{ij})^2}{e_{ij}}\)
Under \(H_0\), \(X^2 \sim \chi^2_{(r-1)(p-1)}\).
So we have: \(pval = P(\chi^2_{(r-1)(p-1)} \geq X^2_{observed})\).
Example with code:
We sample 500 people in the population and sort the individuals into three categories according to their age (SENIOR, ADULT, CHILD), and in five categories according to their eye color (BLUE, BROWN, GREEN, OTHER). We get the following contingency table, with the rows and columns totals computed:
| SENIOR | ADULT | CHILD | ROW TOTAL | |
|---|---|---|---|---|
| BLUE | 41 | 61 | 19 | 121 |
| BROWN | 65 | 109 | 54 | 228 |
| GREEN | 17 | 31 | 6 | 54 |
| OTHER | 27 | 50 | 20 | 97 |
| COL TOTAL | 150 | 251 | 99 | 500 |
From the rows and columns totals, we can compute the expected values
under \(H_0\) (using, for instance, np.outer):
n = table.sum()
rt = table.sum(axis=1)
ct = table.sum(axis=0)
e = np.outer(rt, ct)/n| E | SENIOR | ADULT | CHILD |
|---|---|---|---|
| BLUE | 36 | 61 | 24 |
| BROWN | 68 | 114 | 45 |
| GREEN | 16 | 27 | 11 |
| OTHER | 29 | 49 | 19 |
We can then compute the statistic and use scipy.stats.chi2
to compute the p-value.
X2 = (((table-e)**2)/e).sum()
df = (1-len(rt))*(1-len(ct))
p = stats.chi2().sf(X2, df=df)Which gives us \(X2 = 6.68\), 6 degrees of freedom for the \(\chi^2\) distribution, and a p-value of \(0.35\), so we cannot reject the independence hypothesis.
We can also use the scipy.stats.chi2_contingency
method to perform the test directly on the table:
res = stats.chi2_contingency(table)Which will give use the same results.
Let \(X\) be a numerical random variable, from which we draw a sample \(x\) of size \(n\). For the one-sample, one-variable version, we will want to compare it to a theoretical distribution \(\mathcal{D}\). For the two-samples (or two-variables) version, we will want to compare it to a sample from another numerical random variable \(Y\).
So \(H_0\) is that \(X \sim \mathcal{D}\) (or that \(X\) and \(Y\) follow the same distribution).
The general idea of the Kolmogorov-Smirnov test is to measure the maximum distance between the cumulative probability distributions of \(x\) and \(y\) (or \(x\) and \(\mathcal{D}\)).
Let’s first order the sample so that \(x_1 \leq x_i \leq x_n\). The empirical cumulative distribution function) \(ECDF\) is defined as \(ECDF_x(x_i) = \frac{i}{n}\), and will typically have a “staircase” look:

For \(\mathcal{D}\), we can take the regular cumulative distribution function \(CDF(x) = \int_{-\inf}^{x}pdf_\mathcal{D}(t)dt\).
If we simply take the distance \(D = max_{i}(ECDF_x(x_i)-CDF_\mathcal{D}(x_i))\), we will however only look at the distance to the “top” of each step:

To also take into account the distance to the “bottom” of each step, we define two distance measures:
\(D^+ =
max_{i}(\frac{i}{n}-CDF_\mathcal{D}(x_i))\)
\(D^- =
max_{i}(CDF_\mathcal{D}(x_i))-\frac{i-1}{n}\)
So that \(D^+\) can be interpreted as “how far under the top of each step is the CDF of \(\mathcal{D}\)”, and \(D^-\) as “how far above the bottom of each step is the CDF of \(\mathcal{D}\)”.

For the two-sided test, the test statistic is \(C = max(D^+, D^-)\). For the one-sided step, it is either \(D^+\) or \(D^-\) depending on whether we want to look “above” or “below” the theoretical curve.
To compute the p-value, we can use scipy.stats.ksone
to get the distribution of the test statistic in the one-sided case, and
scipy.stats.kstwo
in the two-sided case.
We can also directly use the scipy.stats.kstest
method to perform the test directly (one or two samples, one- or
two-sided).
Example with code
If we take a sample of 50 individuals. Is their age distribution significantly different from the distribution in the world population?
Let’s assume that we have W the distribution for the
world population as a scipy.stats.rv_histogram object built
from some data, and sample the sample values.
We can start by sorting the sample and computing the \(D^+\) and \(D^-\) values:
sample.sort()
n = len(sample)
dplus = (np.arange(1.0, n+1)/n - W.cdf(sample)).max()
dminus = (W.cdf(sample) - np.arange(0.0, n)/n).max()For a two-sided test, our test statistic is therefore the maximum of
these two values, and use scipy.stats.kstwo to compare to
the expected distribution and get the p-value.
d = max(dplus, dminus)
pval = stats.kstwo(n).sf(d)Which will give us d=0.28 and pval=0.0005,
indicating that our sample is not representative of the world population
at all.

We can achieve the same results using
scipy.stats.kstest:
res = stats.kstest(sample, D.cdf)
# check that we have the same results as before
assert np.isclose(pval, res.pvalue)
assert np.isclose(d, res.statistic)
While a QQ-plot provide a visual way of assessing the normality of a sample, it is sometimes useful to have a more formal test to apply. While we could use the Kolmogorov-Smirnov test for that purpose, Shapiro-Wilk offers a more powerful alternative.
Application condition: the sample size should not be too high (typically, \(<2000\)), or we are almost guaranteed to get a very low p-value even for tiny deviations from normality.
Let \(X\) be a numerical random variable from which we draw a sample \(x\). The null hypothesis \(H_0\) is that \(X \sim N(\mu, \sigma^2)\), where \(\mu, \sigma\) don’t need to be known in advance.
In a QQ-plot for a normal distribution, we obtain a straight line with slope \(\sigma\) and intercept \(\mu\):

The general idea of the Shapiro-Wilk test is to compute the ratio between the (squared) estimated slope of the QQ-plot and the (squared) expected slope under \(H_0\), i.e. the variance.
The resulting statistic, \(W\),
doesn’t follow a nice theoretical distribution, but its distribution has
been approximated using Monte Carlo simulation. We can use scipy.stats.shapiro
to get the statistic and the p-value.
Example using a sample of size 200 from a Student’s t distribution with added uniform noise:
# generate noisy student sample
x = stats.t(df=199).rvs(size=200)+stats.uniform(loc=-5, scale=10).rvs(size=200)
res = stats.shapiro(x)Will give us a \(W\) of 0.98, close to 1… but not close enough for that sample size, as we have a p-value of 0.003, indicating that it is indeed very unlikely to be a sample from a normal distribution.

Let \(X\) be a binary variable from which we draw two paired samples \(x\) and \(y\) of size \(n\). Paired samples means that there is a known relationship between \(x_i\) and \(y_i\) (e.g. repeated measures on the same individual).
McNemar tests if the two samples have the same marginal probabilities for each outcome. That is, if we look at the contingency table:
| X \ Y | Y=0 | Y=1 |
|---|---|---|
| X=0 | a | b |
| X=1 | c | d |
McNemar focuses on the discordant cases and states that, under \(H_0\), \(p_b = p_c\) (that is, we expect to see as many disagreements where \(x_i=0, y_i=1\) than disagreements where \(x_i=1, y_i=0\).
The test statistic is \(X2 = \frac{(b-c)^2}{b+c}\) and, under \(H_0\), \(X2 \sim \chi^2_1\) (chi-square with one degree of freedom).
Application conditions: enough discordant cases (\(b+c \geq 25\) is sometimes used as a cutoff).
The p-value will be: \(P(\chi^2_1 \geq X2)\), the probability to have a statistic at least as great as \(X2\) if it follows a \(\chi^2_1\) distribution.
Example with code
We conduct a study where participants with motor difficulties are asked to perform a task with and without electrical stimulation from a new device. We measure for each patient if they can perform the task with and without the device.
The variable is therefore “Success” or “Failure” at the task, with the two groups being “with” and “without” device.
| With \ Without | Failure | Success |
|---|---|---|
| Failure | 67 | 11 |
| Success | 25 | 92 |
Does the device have an effect on the results of the participants? We can compute the test statistic: \(X2 = \frac{(11-25)^2}{11+25} = 5.44\).
We can then use scipy.stats.chi2 to determine the
p-value:
pval = stats.chi2(df=1).sf(X2)Which will give us a result of 0.02: we can reject the null hypothesis of no difference between the outcome probabilities in each group: the device is likely to have an effect.
Let \(X\) be a normally distributed variable. Given a sample \(x\) of size \(n\) with sample variance \(s^2 = \frac{\sum_{x=1}^{n}(x_i-\bar{x})}{n-1}\), we can observe that \(\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}\): the sampling distribution of the variance is related to the chi-squared distribution, with \(n-1\) degrees of freedom.
Therefore, if \(H_0: \sigma^2 = \sigma_0^2\) (the variance of the population is equal to a certain value), we have the test statistic:
\(C^2 = \frac{(n-1)s^2}{\sigma_0^2} \sim \chi^2_{n-1}\)
For a one-sided test on the right, the p-value is given by \(p_{right} = P(\chi^2_{n-1} \geq C^2)\). For a test on the left, it is \(p_{left} = P(\chi^2_{n-1} \leq C^2)\).
For the two-sided test, as the \(\chi^2\) is not symmetrical, the p-value is not quite as easy to compute. Typically, we will use \(pval = 2 \times min(p_{left}, p_{right})\). But it is better in that case to look at it in terms of the acceptance zone for a given significance level, which is easier to compute. For \(\alpha = 0.05\), we will have the bounds for the acceptance zone: \(x_{left} = x | P(\chi^2_{n-1} \leq x) = 0.025\) and \(x_{right} = x | P(\chi^2_{n-1} \geq x) = 0.025\). We cannot reject \(H_0\) if \(x_{left} \leq C^2 \leq x_{right}\).
Example with code
A sample of ten patients is taken from a population whose age distribution is expected to follow a normal with a standard deviation of 10. We get the following ages for the patients:
[72, 59, 64, 77, 73, 45, 64, 53, 53, 59]
Is the variance in this sample higher than the expected value?
We can compute the sample variance:
x = [72, 59, 64, 77, 73, 45, 64, 53, 53, 59]
s2 = np.var(x, ddof=1) # ddof=1 to use the "n-1" correction
c2 = (n-1)*s2/(10**2)This gives us a value of c2=9.2.
We can get the p-value using scipy.stats.chi2. For a
one-sided test on the right, we have:
pval = stats.chi2(df=n-1).sf(c2)Which gives us pval=0.416: we cannot reject \(H_0\).
If we were using a two-sided test, we could either use the p-value:
pval = 2*min(stats.chi2(df=n-1).sf(c2), stats.chi2(df=n-1).cdf(c2))Which will give us twice the p-value, or 0.832. To be more accurate, however, we should compute the acceptance zone for a given significance level, e.g. 0.05:
x_left = stats.chi2(df=n-1).ppf(0.025)
x_right = stats.chi2(df=n-1).isf(0.025)
accept = (x_left <= c2) and (x_right >= c2)This gives us an acceptance zone of \([2.7, 19.0]\). Our value of 9.2 clearly falls within those bounds, so we don’t reject \(H_0\).
Let \(X\) be a numerical random variable from which we draw a sample \(x\) of size \(n\). Student’s one-sample t-test is used to determine if the observed mean of a sample is compatible with the null hypothesis that \(E[X] = \mu_0\).
The test statistic is \(t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}\), which based on the sampling distribution of the mean follows a \(t_{n-1}\) distribution (Student t distribution with \(n-1\) degrees of freedom).
For the two-sided test, the p-value is \(2 \times P(t_{n-1} \geq |t|)\). The one-sided tests’ p-values are given by \(P(t_{n-1} \leq t)\) for a test on the left, \(P(t_{n-1} \geq t)\) for a test on the right.
Example with code:
A sample of ten patients is taken from a population whose age distribution is expected to follow a normal with a population mean of 55. We get the following ages for the patients:
[72, 59, 64, 77, 73, 45, 64, 53, 53, 59]
Is the mean in this sample likely under our hypothesis? We can
compute the test statistic and use scipy.stats.t to compute
the p-value using Student’s t distribution.
x = [72, 59, 64, 77, 73, 45, 64, 53, 53, 59]
n = len(x)
tstat = (np.mean(x)-55)/(np.std(x, ddof=1)/np.sqrt(n))
pval = stats.t(df=n-1).sf(np.abs(tstat))*2This will give us tstat=2.15 and pval=0.06:
using a significance threshold of 0.05, we cannot reject \(H_0\), the sample mean is not unexpected
under our null hypothesis.
We can also direclty use scipy.stats.ttest_1samp to
perform the test:
res = stats.ttest_1samp(x, popmean=55)Which will give us the same values.
Let \(x, y\) be two paired samples drawn from random variables \(X, Y \sim N\).
The null hypothesis \(H_0\) is that \(\mu_X = \mu_Y\), that is the population means are equal. We therefore have to test if the observed differences between the measurements in the samples are likely or not under \(H_0\).
Student’s paired sample t-test is a test on the mean of the differences: since under \(H_0\), \(\bar{x-y} = 0\), we look at the new random variable \(A = X-Y\) with \(\mu_A = 0\) under \(H_0\). Therefore, we look at \(a = x-y = \{x_i-y_i\}_i\) as our new sample, and perform Student’s one-sample t-test on \(a\).
Test statistic: \(t = \frac{\bar{x-y}}{s/\sqrt{n}} \sim t_{n-1}\)
P-value: \(2 \times P(t_{n-1} \geq |t|)\) in the two-sided case, \(P(t_{n-1} \geq t)\) or \(P(t_{n-1} \leq t)\) in the one-sided cases.
Example with code
A sample of 20 patients has their systolic blood pressure taken before and after an experiment. The rounded values are:
[126, 103, 104, 99, 118, 86, 127, 102, 113, 107, 124, 89, 106, 106, 121, 99, 108, 101, 110, 115]
[127, 105, 102, 104, 117, 87, 130, 101, 113, 107, 127, 95, 108, 105, 124, 99, 110, 105, 110, 117]
The researchers are wondering if the blood pressure is
higher after the experiment than it was before. We can
therefore perform a one-sided t-test by computing the itemwise
difference between the two samples, the t statistic, then looking up the
p-value using scipy.stats.t.
a = before - after
n = len(a)
tstat = (np.mean(a))/(np.std(a, ddof=1)/np.sqrt(n))
pval = stats.t(df=n-1).cdf(tstat)Giving us tstat=-3.17 and pval=0.003: we
should reject the hypothesis that the means are equal, the alternative
that \(E[after] > E[before]\) is
preferred.
We can directly use scipy.stats.ttest_rel for the same
results.
res = stats.ttest_rel(before, after, alternative='less')
Note that the test tells nothing about the effect size. If we take:
effect = np.mean(a)We will get a result of -1.58, indicating that, on
average, the systolic blood pressure before was 1.58 lower than
after, which is a relatively small effect in practice, even if
it is statistically significant.
With equal sample sizes
Let \(x, y\) be two independent samples of size \(n\) drawn from distributions \(X, Y\) assumed to be normal and of equal variances.
\(H_0: \mu_X = \mu_Y\). Since the samples are unpaired, we cannot use \(X-Y\) as a random variable. Rather, we will look at the difference of the means: \(\bar{X}-\bar{Y}\), which under \(H_0\) should be equal to 0. The variance equality assumption should be verified using, for instance, the Levene test.
The test statistic is \(t = \frac{\bar{x}-\bar{y}}{s\sqrt{\frac{2}{n}}}\), with \(s = \sqrt{\frac{s^2_x+s^2_y}{2}}\). It will follow a \(t\_{2(n-1)}\) distribution (Student t distribution with \(2(n-1)\) degrees of freedom).
P-value: \(2 \times P(t_{2n-2} \geq |t|)\) in the two-sided case, \(P(t_{2n-2} \geq t)\) or \(P(t_{2n-2} \leq t)\) in the one-sided cases.
Example with code
A sample of 40 patients has their systolic blood pressure taken. 20 of those patients are in a group where they received medication A, while the 20 others received medication B. The rounded values are:
Medication A:
[105, 109, 88, 126, 92, 101, 115, 97, 99, 100, 115, 132, 110, 98, 115, 104, 109, 121, 102, 110]
Medication B:
[128, 122, 120, 110, 118, 118, 119, 116, 119, 117, 113, 124, 124, 128, 120, 117, 117, 112, 124, 114]
The researchers are wondering if the blood pressure is
different in the two groups. We can therefore perform a
two-sided t-test by computing the difference between the means of the
two samples, the t statistic, then looking up the p-value using
scipy.stats.t.
n = len(a)
s = np.sqrt((np.var(a, ddof=1)+np.var(b, ddof=1))/2)
tstat = (np.mean(a)-np.mean(b))/(sp*np.sqrt(2/n))
pval = 2*stats.t(df=2*n-2).sf(np.abs(tstat))Giving us tstat=-4.26 and pval=1.2e-4: we
should reject the hypothesis that the means are equal, the alternative
that \(E[medication A] \neq E[medication
B]\) is preferred.
We can directly use scipy.stats.ttest_ind for the same
results.
res = stats.ttest_ind(a, b, alternative='two-sided')
Note that the test tells nothing about the effect size. If we take:
effect = np.mean(a)-np.mean(b)We will get a result of -11.64, indicating that, on
average, the systolic blood pressure with medication A was 11.64 lower
than with medication B.
Generalisation to unequal sample size
If the sample sizes are different (but we still have similar variances), we can use: \(t = \frac{\bar{x}-\bar{y}}{s\sqrt{\frac{1}{n_x}+\frac{1}{n_y}}}\), with \(s = \sqrt{\frac{(n_x-1)s^2_x + (n_y-1)s^2_y}{n_x+n_y-2}}\), and \(n_x+n_y-2\) degrees of freedom for the test.
scipy.stats.ttest_ind will accept samples of different
sizes. If the parameter equal_var is set to
False, it will also perform Welch’s
test instead.
As many non-parametric test, it is based on the ranks of the sample.
Let \(X\) be a random numerical variable of unknown, symmetrical distribution, and \(x\) a sample of size \(n\) drawn from \(X\).
In the one sample case, we have \(H_0: Median[X] = m\), and for the next steps we will use \(\hat{x} = x-m\) as our sample, so that under \(H_0\) we would have \(E[\hat{x}] = 0\). In the paired samples case, we also have \(y\) a sample of \(Y\), with \(H_0: Median[X] = Median[Y]\). We will therefore use \(\hat{x} = x-y\) as the sample to test, also with \(E[\hat{x}] = 0\) under \(H_0\).
The general idea of the test is to first sort \(\hat{x}\) by its absolute values so that \(|\hat{x}_1| \leq |\hat{x}_i| \leq |\hat{x}_n|\), and \(i\) is the rank of \(\hat{x}_i\) (if there are no ties). We also define \(sign(\hat{x}) = 1\) if \(\hat{x} > 0\), \(-1\) if \(\hat{x} < 0\) otherwise (assuming no zeros).
The test statistic is \(T = \sum_{i=1}^n sign(\hat{x}_i) \times i\). If \(\hat{X}\) is centered on 0, we would expect \(T\) to be close to zero. Alternatively, the statistics \(T^+ = \sum_{i=1,x_i>0}^n i\) (sum of ranks of positive values) and \(T^-\) (sum of ranks of negative values) can be used. Each of these statistics can be computed from any of the others (see e.g. the Wikipedia page on the test).
For \(n=10\) and \(n=100\), these distributions will look like
this (based on a simulation, see choosing-test.py):


As we can see, these distributions look fairly similar to normal distributions. In fact, the normal approximation can be made for large enough \(n\), although an “exact” computation can be done for lower values (typically \(n>50\) is used as a cutoff).
Typically, in tables of critical values, \(T_{crit} = min(T^+, T^-)\) is used as the test statistic, and we reject \(H_0\) if \(T_{crit} < V\) where \(V\) is the “cutoff” in the table for given values of \(n\) and \(\alpha\).
Example with code
While working on a segmentation algorithm, you develop an algorithm A that obtains certain Dice scores on a test set of 10 images. You want to compare it with state-of-the-art algorithm B on that same set.
The two sets of Dice scores are as follows:
A:
[0.74, 0.92, 0.81, 0.85, 0.61, 0.91, 0.74, 0.84, 0.82, 0.89]
B:
[0.76, 0.61, 0.78, 0.82, 0.71, 0.85, 0.61, 0.91, 0.91, 0.67]
Which means that A has an average Dice score of 0.81, and B has an average Dice score of 0.76. Can you conclude that you beat the state-of-the-art?
First, we compute the differences A-B:
[-0.02 0.31 0.03 0.03 -0.1 0.06 0.13 -0.07 -0.09 0.22]
Then, we sort them by their absolute value:
[-0.02 0.03 0.03 0.06 -0.07 -0.09 -0.1 0.13 0.22 0.31]
And compute the signs and ranks:
[-1 2 3 4 -5 -6 -7 8 9 10]
Which gives us: \(T = -1+2+...+10 = 17\), or \(T^+ = 2+3+...+10 = 36\), or \(T^- = 1+5+6+7 = 19\).
With that, we can use tables of critical values such as this one from Larry Winner of the University of Florida.
If we use \(\alpha=0.05\), we see in the table that the critical value for \(n=10\) is 10 for a one-sided test (we want to confirm if A is better than B). Since \(T_{crit} = min(36, 19) = 19 > 10\), we cannot reject \(H_0\), and we cannot conclude that our results are significantly better than the state-of-the-art.
To avoid going into the tables, we can directly use the
scipy.stats.wilcox method to get the same results:
x = np.array(a)-np.array(b)
res = stats.wilcoxon(x)Which will give us res.statistic=19 and
res.pvalue=0.43, a clearly unconclusive value.
Checking for the symmetrical property
We can use scipy.stats.skewtest
to check if the skewness of the distribution is compatible with the
symmetry hypothesis:
res_a = stats.skewtest(a).pvalue
res_b = stats.skewtest(b).pvalueGiving us in this case p-values of 0.12 and 0.89 \(\rightarrow\) we cannot reject the symmetry hypothesis, so it’s reasonable to use Wilcoxon instad of Mann-Whitney.
a.k.a. Wilcoxon rank-sum test, not to be confused with Wilcoxon signed rank test…
Let \(X, Y\) be two random variables, from which we draw samples \(x\) and \(y\) of size \(m\) and \(n\), respectively.
\(H_0\): \(P(X > Y) = P(Y > X)\).
Like the Wilcoxon signed rank test, this test is based on the ranking of the samples.
For the U test, we start by pooling the two samples into a single set \(s = x \cup y\), and we rank the values in \(s\) (so we get, for each value in \(x\) and \(y\), a global rank across all values in \(x\) and \(y\)).
Then, we compute the sum of ranks for \(x\) and \(y\): \(R_x = \sum_{i=1}^m R_{x_i}\) where \(R_{x_i}\) is the global rank of the \(i\)-th value in sample \(x\) (and conversely \(R_y = \sum_{j=1}^n R_{y_j}\) for \(y\)).
We defined \(U_x = mn + \frac{m(m+1)}{2} - R_x\) and \(U_y = mn + \frac{n(n+1)}{2} - R_y\).
The test statistic is \(U = min(U_x,
U_y)\). It can be either compared to critical values tables
(e.g. this one from real-statistics.com),
with \(H_0\) rejected if \(U < U_{crit}\), or the p-value can be
computed using e.g. scipy.stats.mannwhitneyu.
Example with code
We are running a “survival analysis” where, after a critical intervention using protocol A or B, patients are followed until their death, and we record in both groups the number of months left alive after the intervention.
For protocol A, we have 10 patients, and the survival data is:
[6, 45, 9, 72, 61, 57, 3, 44, 51, 21], for a median
survival of 44.5 months.
For protocol B, we have 7 patients, with:
[33, 9, 13, 15, 55, 53, 77], for a median survival of 33
months.
Can we conclude that one of those two groups have different survival times (and therefore one protocol should be preferred)?
The pooled data is:
[6, 45, 9, 72, 61, 57, 3, 44, 51, 21] + [33, 9, 13, 15, 55, 53, 77]
With the ranks:
[2, 10, 3.5, 16, 15, 14, 1, 9, 11, 7] + [8, 3.5, 5, 6, 13, 12, 17]
So we have:
\(R_x = 88.5\), \(R_y = 64.5\), and:
\(U_x = 10 \times 7 + \frac{10 \times
11}{2} - 88.5 = 33.5\)
\(U_y = 10 \times 7 + \frac{7 \times 8}{2} -
64.5 = 36.5\)
\(U = min(U_x, U_y) = 33.5\).
Looking at the critical value tables, we can see that, for \(n_1 = 10\) and \(n_2 = 7\), the critical value for \(U\) at significance level 0.05 is \(U_{crit} = 12\). \(U \geq U_{crit} \rightarrow H_0\) cannot be rejected. We cannot conclude that the two populations have different survival times.
Computing the \(U\) statistic with Python:
a = [6, 45, 9, 72, 61, 57, 3, 44, 51, 21]
b = [33, 9, 13, 15, 55, 53, 77]
m, n = len(a), len(b)
s = a + b # concatenate
rs = stats.rankdata(s)
Rx = sum(rs[:len(a)])
Ry = sum(rs[len(a):])
Ux = m*n + (m*(m+1)/2) - Rx
Uy = m*n + (n*(n+1)/2) - Ry
U = min(Ux, Uy)
print(U)And computing the results directly with scipy.stats:
res = stats.mannwhitneyu(a, b)
print(res.statistic, res.pvalue)Will give us the same U statistic and a p-value of 0.92, confirming that we cannot reject \(H_0\).
This test makes very few assumptions about the data: the samples must be paired, and randomly selected from their respective populations. If the symmetry conditions are met, however, the Wilcoxon signed rank test is more powerful, and if the normality conditions are met, Student’s paired samples t-test is much better.
Let \(X, Y\) be two random variables from unknown distributions, such as the samples \(x, y\) of size \(n\) are paired.
The test simply counts the number of pairs for which \(y_i - x_i > 0\). Or, for one sample \(x\), the number of \(x_i > 0\).
Under \(H_0\), we expect an equal amount of \(>0\) than \(<0\). The statistic, \(W\), therefore follows a binomial distribution under \(H_0\): \(W \sim B(m, 0.5)\), with \(m \leq n\) the number of items where \(x_i \neq y_i\) (i.e. we remove the items where there is no difference between the two samples, or equal to 0 in the one sample case).
A Binomial test can then be performed to determine the p-value.
Example with code
A cognitive test, scored between 0 and 20, is given to 11 patients before their admission to a center of care. The same cognitive test is applied to the same patients six months later to assess their evolution. The following results are recorded:
t0 = [7, 9, 4, 2, 4, 8, 10, 20, 3, 9, 5]
t6 = [4, 6, 3, 2, 5, 7, 10, 18, 2, 5, 4]
Can we say that the cognitive hability of the patients have deteriorated during those six months?
After removing the two patients with no difference in their before-after test, we have:
signs = [-, -, -, +, -, -, -, -, -]. so that \(W = 1\) and \(m =
9\).
A binomial left-sided test tells us that \(P(W \leq 1) = \sum_{i=0}^{1} \binom{9}{i}(0.5)^i(0.5)^{9-i} = 0.02\), and using a significance level of 0.05, we can reject the null hypothesis that there is no deterioration. We can reasonably conclude that there is a cognitive deterioration in our patients after six months.
Using scipy.stats.binomtest:
s = [b-a > 0 for b,a in zip(B, A) if b-a != 0]
W = sum(s)
m = len(s)
res = stats.binomtest(W, m, alternative='less')Let \(X, Y\) be two continuous random variables from which we draw a sample \(x, y\) of size \(n\). The Pearson Correlation Coefficient (PCC) for the population is defined as:
\(\rho_{XY} = \frac{cov(X, Y)}{\sigma_X \sigma_Y}\) where \(cov\) is the covariance: \(cov(X, Y) = E[(X-\mu_X)(Y-\mu_Y)]\).
For the sample, we therefore have:
\(r_{xy} = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^n(y_i-\bar{y})^2}}\)
It will always be bound by \(-1 \leq r_{xy} \leq 1\), with \(r = \pm 1\) showing a perfect colinearity and \(r = 0\) means that there is no linear correlation.
Important: \(r\) doesn’t tell us anything about the slope of the line if there is a linear relationship, only how closely the points \((x_i, y_i)\) follow the line. \(r=0\) also doesn’t mean that the variables are independent, are there could be a non-linear dependency that the PCC will not show. Meanwhile, while \(r \approx \pm 1\) indicates a correlation, it doesn’t indicate a causal relationship between the two variables (and the direction of causality if there is one). The correlation may be due to general trends (“variables that tend to increase/decrease over time”) or to confounding variables.
If \(X\) and \(Y\) follow a normal distribution, under the null hypothesis \(H_0\) that \(\rho_{XY} = 0\), the statistic \(T = r\frac{\sqrt{n-1}}{1-r^2}\) follows a Student’s t distribution: \(T \sim t_{n-2}\).
The p-value can therefore be computed as \(2 \times P(t_{n-2} \geq |T|)\) for the two-sided test, or \(P(t_{n-2} \leq T)\) and \(P(t_{n-2} \geq T)\) for the left- and right- sided tests.
Example with code
In a sample of 60 healthy adults, for a period of ten days, the amount of high-intensity daily activity is measured. During that time, the heart rate at rest is also regularly monitored. Is there a correlation between the average heart rate at rest and the average daily activity?
The scatterplot of the data is:

First, we can compute the correlation coefficient:
r = sum((x-x.mean())*(y-y.mean()))/np.sqrt(sum((x-x.mean())**2)*sum((y-y.mean())**2))Under \(H_0\), \(T = r\frac{\sqrt{n-1}}{1-r^2} \sim t_{n-2}\), so we can compute \(T\) and the p-value with:
T = r*np.sqrt(n-2)/(1-r**2)
pval = 2*stats.t(df=n-2).sf(np.abs(T))This will give us T=-2.68 and pval=0.01,
indicating that we should reject \(H_0\): there is a (negative) correlation
between the active time and heart rate.
We can also use scipy.stats.pearsonr
to compute the test statistic and p-value. Scipy uses a slightly
different method for computing the p-value, so it is not strictly
equivalent to the results we’ve shown, but it will be very similar in
general.
Let \(X, Y\) be two continuous or ordinal random variables from which we draw a sample \(x, y\) of size \(n\). The Spearman Correlation Coefficient (SCC) is defined as the Pearson Correlation Coefficient computed on the ranks variables.
We must therefore first compute the ranks \(R_{x}\) and \(R_{y}\), so that \(R_{x_i}\) is the rank of \(x_i\) in the sample \(x\). We then have:
\(r_s = \rho_{R_xR_y} = \frac{\sum_{i=1}^n(R_{x_i}-\bar{R_x})(R{y_i}-\bar{R_y})}{\sqrt{\sum_{i=1}^n(R{x_i}-\bar{R_x})^2\sum_{i=1}^n(R{y_i}-\bar{R_y})^2}}\)
If there are no ties in the ranks of both variables, then we can simplify to:
\(r_s = 1 - \frac{6\sum_{i=1}^n(R_{x_i}-R_{y_i})^2}{n(n^2-1)}\)
Like the PCC, the SCC will always be bound between -1 and 1.
Under \(H_0\) of no monotonic relationship between \(X\) and \(Y\), \(T = r_s \sqrt{\frac{n-2}{1-r^2}} \sim t_{n-2}\).
We can use scipy.stats.spearmanr
to compute the p-values and test statistic, or convert to ranks and then
use scipy.stats.pearsonr for the same results:
scc = stats.spearmanr(x, y)
rx = stats.rankdata(x)
ry = stats.rankdata(y)
scc_ = stats.pearsonr(rx, ry)
assert np.isclose(scc.statistic, scc_.statistic)
assert np.isclose(scc.pvalue, scc_.pvalue)Let \(X, Y\) be two samples of sizes \(n\) and \(m\) from two populations with a normal distribution.
The F-test tests for the equality of the variances in the two populations.
\(H_0\) is that the variances are equal: \(S_X^2 = S_Y^2\), with \(S_X^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2\), and \(S_Y^2 = \frac{1}{m-1}\sum_{j=1}^m(Y_i-\bar{Y})^2\).
Under \(H_0\), the statistic \(F = \frac{S_X^2}{S_Y^2}\) follows an F-distribution with \(n-1, m-1\) degrees of freedom.
Example with code
We have two samples \(x\) and \(y\) of sample sizes 30 and 40, coming from distributions known to be normal.
We measure the variance of \(x\) at 10.9, and the variance of \(y\) at 6.0 Can we conclude that the two samples have inequal variance?
Under \(H_0\), \(f = \frac{s^2_x}{s^2_y}\) follows a \(F_{29, 39}\) distribution.
We have= \(f = \frac{10.9}{6.0} = 1.8\).
And we can compute, from tables the
thresholds at different \(\alpha\)
values, or the p-value using scipy.stats to see if we can
reject \(H_0\).
pval = stats.f(29, 39).sf(1.8)We get a p-value of 0.04, so using the typical \(\alpha=0.05\) threshold we would reject the null hypothesis and conclude that we can’t consider the variances to be equal. This means that we couldn’t use a Student independent t test to check the equality of the means of the two populations.

Levene’s test assesses the hypothesis that the variances of two or more populations are equal. Like the F-test of equality of variances, it uses the F-distribution, but with a different test statistic.
For \(k\) different samples, let:
\(n = \sum_{i=1}^k n_i\) is the
total number of elements in all groups combined.
\(Z_{ij} = |X_{ij}-\bar{X_i}|\) is the
absolute difference between an observation and the mean of its
sample.
\(\bar{Z_i} = \frac{1}{n_i}\sum_{j=1}^{n_i}
Z_{ij}\) is the mean of the \(Z_{ij}\) for the \(i\)-th group.
\(\bar{Z} =
\frac{1}{n}\sum_{i=1}^k\sum_{j=1}^{n_i}Z_{ij}\) is the mean of
all \(Z_{ij}\).
The test statistic is:
\(W = \frac{n-k}{k-1} \frac{\sum_{i=1}^k n_i(\bar{Z_i}-\bar{Z})^2}{\sum_{i=1}^k \sum_{j=1}^{n_i}(Z_{ij}-\bar{Z_i})^2}\)
Under \(H_0\), \(W\) follows a F-distribution with \(k-1, n-k\) degrees of freedom.
Example with code
Let’s take the same example as the F-test of equality of variances, where have two samples \(x\) and \(y\) of sample sizes 30 and 40, coming from distributions known to be normal. First, we have to compute the \(Z_{ij}\) values:
zx = np.abs(x-x.mean())
zy = np.abs(y-y.mean())
n = len(zx)+len(zy)
z = np.concatenate([zx, zy])Then, for two samples, we can simplify the equation for \(W\):
W = (n-2)*(len(zx)*(zx.mean()-z.mean())**2+len(zy)*(zy.mean()-z.mean())**2)/(((zx-zx.mean())**2).sum()+((zy-zy.mean())**2).sum())And then compute the p-value for the \(F(1, 68)\) distribution:
fdist = stats.f(1, n-2)
pval = fdist.sf(W)Which gives us a p-value of 0.21. So with Levene’s test, we cannot reject \(H_0\).

Alternatively, we can directly use scipy.stats.levene
to get the same results:
results = stats.levene(x, y, center='mean')
assert np.isclose(results.statistic, W)
assert np.isclose(results.pvalue, pval)As we can see, the relaxed normality assumption comes at the cost of making the test potentially less powerful, as we can’t reject \(H_0\) here even though the samples are actually coming from populations with different variances.
Let \(X, Y\) be two independent (unpaired) samples of sizes \(n\) and \(m\).
Welch’s test uses a t statistic defined as:
\(t = \frac{\bar{X}-\bar{Y}}{\sqrt{\frac{s^2_X}{n}+\frac{s^2_Y}{m}}}\)
Under \(H_0: \bar{X} = \bar{Y}\), the statistic follows a \(t_\nu\) distribution (Student distribution with \(\nu\) degrees of freedom), with:
\(\nu = \frac{(\frac{s_X^2}{n}+\frac{s_Y^2}{m})^2}{\frac{s_X^4}{n^2(n-1)}+\frac{s_Y^4}{m^2(m-1)}}\)
Example with code
Using the same example as for Student’s independent t-test:
A sample of 40 patients has their systolic blood pressure taken. 20 of those patients are in a group where they received medication A, while the 20 others received medication B. The rounded values are:
Medication A:
[105, 109, 88, 126, 92, 101, 115, 97, 99, 100, 115, 132, 110, 98, 115, 104, 109, 121, 102, 110]
Medication B:
[128, 122, 120, 110, 118, 118, 119, 116, 119, 117, 113, 124, 124, 128, 120, 117, 117, 112, 124, 114]
The researchers are wondering if the blood pressure is
different in the two groups. We can therefore perform a Welch
test by computing the t statistic and degrees of freedom, then looking
up the p-value using scipy.stats.t, or we can use scipy.stats.ttest_ind
with equal_var=False to perform the test directly.
n = len(x)
m = len(y)
vx = x.var(ddof=1)
vy = y.var(ddof=1)
t = (x.mean()-y.mean())/np.sqrt(vx/n + vy/m)
df = ((vx/n)+(vy/m))**2/((vx**2/(n**2*(n-1)))+(vy**2/(m**2*(m-1))))
pval = 2*stats.t(df=df).sf(np.abs(t))Giving us df=26.27, tstat=-4.26 and
pval=2.3e-4: we should reject the hypothesis that the means
are equal, the alternative that \(E[medication
A] \neq E[medication B]\) is preferred.
Using scipy.stats.ttest_ind, we can verify that we get
the same results.
res = stats.ttest_ind(a, b, alternative='two-sided', equal_var=False)
assert res.statistic == t
assert res.pvalue == pvalThe Bonferroni correction procedure is a method that can be used when testing multiple null hypotheses (either because we are in the multivariate case and making one-to-one tests, or because we are testing several different things at the same time).
The issue that this procedure tries to solve is that the probability of making at least one Type I error (know as the family-wise error rate) increases with the number of tests. If we use a significance level \(\alpha=0.05\) for \(m\) tests, and all the null hypotheses are true (\(H_0\) conditions for all tests), we have for each test a probability of 0.05 to reject \(H_0\). If all the tests are independent, then the probability of rejecting at least one test is: \(P(\geq 1 rejection|all H_0) = 1 - P(no rejection|all H_0) = 1 - (1-\alpha)^m\).
The correction procedure consists in using \(\alpha_B = \frac{\alpha}{m}\), which ensures that the FWER remains \(\leq \alpha\).
Alternatively, we can multiply each p-value by \(m\) while keeping the significance level \(\alpha\) to get the same conclusions about the tests.
Note that, as often in statistics, using Bonferroni is a trade-off: by reducing the risk of Type I error, we also reduce the power of the tests. Bonferroni is a very conservative approach, which for large \(m\) comes with a larger chance of making Type II errors and not rejecting \(H_0\) when we should.
Let \(X_i\) be dichotomous random variables from paired samples of size \(n\), so that \(X_{ij}\) represents the \(j\)-th individual in the \(i\)-th sample (so \(X_{11}, X_{21}, X_{31}...\) are values for the same individual in three different samples). We therefore have a table like:
| Individual \ Sample | 1 | … | m |
|---|---|---|---|
| 1 | \(X_{11}\) | … | \(X_{m1}\) |
| … | … | \(X_{ij}\) | … |
| n | \(X_{1n}\) | … | \(X_{mn}\) |
And \(X_{ij}\) can only take two distinct values (e.g.: YES/NO, SICK/HEALTHY…)
The null hypothesis is that the probabilities of each outcome are the same across the different samples.
The test statistic is:
\(T = m(m-1)\frac{\sum_{i=1}^m (\sum_{j=1}^nX_{ij} - \frac{N}{m})^2}{\sum_{j=1}^n(\sum_{i=1}^mX_{ij})(m-\sum_{i=1}^mX_{ij})}\)
Where \(N = \sum_{i=1}^n \sum_{j=1}^m X_{ij}\).
Under \(H_0\), \(T \sim \chi^2_{m-1}\).
We can easily verify that with a quick simulation. Let’s assume that each \(X_{ij}\) is independently drawn from a Bernoulli distribution. This would satisfy our \(H_0\) conditions, as we would expect each sample to have the same distribution of values. If we run 10 000 simulations where we draw \(n\) individuals and \(m\) samples from a Bernoulli with \(p=0.8\), and compute the test statistic, we can verify that the distribution of the statistic follows the \(\chi^2_{m-1}\).

If, on the other hand, one of the samples is drawn from a Bernoulli with a different probability (here, the first sample uses \(p=0.6\)), then the test statistic is much more likely to be higher. Using a significance level of \(\alpha=0.05\), we find (as expected) around 5% of the runs above the critical value in the first case, and around 88% in the second.

Cochran’s Q test is not implemented in scipy, but it can be found in
statsmodels.
If x contains the contingency table for \(n\) individuals and \(m\) samples:
# compute manually
xc = x.sum(axis=0) # sum per column
xr = x.sum(axis=1) # sum per row
t = m*(m-1)*((xc-(N/m))**2).sum()/(xr*(m-xr)).sum()
pval = chi2_distrib.sf(t)
# compute with statsmodels
from statsmodels import stats as sms
res = sms.contingency_tables.cochrans_q(x)
assert np.isclose(res.pvalue, pval)One-to-one comparisons between the samples can be performed with McNemar if \(H_0\) of Cochran’s test can be rejected. This will tell us which pairs of samples are significantly different from each other.
a.k.a. Repeated measures one-way ANOVA
Let \(X_1...X_i...X_m\) be \(m\) measures of a normally distributed numerical variable on \(n\) individuals.
\(X_{ij}\) is the \(j\)-th value in the \(i\)-th measure.
The repeated measures ANOVA assumes equality of variances between the measures (or samples).
\(H_0\) is that all groups have the same mean: \(\mu_i = \mu_j \forall i,j\).
\(H_1\) is that at least two groups have different means.
The ANOVA tests if the total variance of \(X\) can be explained fully by the within group or the between individuals variance, and not from the between groups variance.
The test statistic is \(F = \frac{MS_b}{MS_e}\), where \(MS_b\) is the between groups variance, and \(MS_e\) is the mean “error variability”, which is the part of the variance that cannot be explained by the between groups or the between individuals variances.
The between groups variance is expressed as:
\(MS_b = \frac{\sum_{i=1}^m n(\bar{x}_{i.}-\bar{\bar{x}})^2}{m-1}\)
Where \(\bar{x}_{i.}\) is the mean values for group \(i\). The mean error variability is expressed as:
\(MS_e = \frac{SS_w - SS_i}{(n-1)(m-1)}\)
Where \(SS_w\), the “within groups” sum of squares, is:
\(SS_w = \sum_{i=1}^m\sum_{j=1}^n (x_{ij}-\bar{x}_i)^2\)
And \(SS_i\), the “between individuals” sum of squares, is:
\(SS_i = m\sum_{j=1}^n(\bar{x}_{.j}-\bar{\bar{x}})^2\)
Where \(\bar{x}_{.j}\) is the mean value for individual \(j\).
Under \(H_0\), \(F\) follows a Snedecor-Fisher
F-distribution with \(m-1,(n-1)(m-1)\) degrees of freedom. We can
look-up critical values in statistical
tables, or use scipy.stats.f.
While not implemented in scipy, we can also compute the
repeated measures ANOVA using statsmodels
(it requires a bit of pre-processing to put the input data into a
pandas DataFrame, however – see example below).
Example with code:
To test the safety of a new drug, 80 healthy patients are given the same dose, and a parameter is measured right before administration (“T0”), one day after (“T+1”), one week after (“T+7”) and one month after (“T+30”) administration.
After acquiring the data, we verify that all measures pass the Shapiro-Wilk normality test.
The mean and standard deviations of this parameter at each time are:
T0: mean=26.8, std=9.5
T1: mean=28.5, std=9.8
T2: mean=27.2, std=9.6
T3: mean=30.8, std=9.6
The variances are similar enough that we can proceed with the ANOVA test. The null hypothesis would be that all the means are equal.
We can start by computing the columns, rows and total means
(i.e. means per-sample, per-individual, and global, respectively). If
X is a numpy.ndarray with each row
corresponding to an individual, and each column to a measure, we
have:
cm = X.mean(axis=0) # means per sample
im = X.mean(axis=1) # means per individual
tm = X.mean() # total meanThen we can compute \(MS_b\):
n, m = X.shape # number of rows = individuals = n, number of columns = groups = m
MSb = ((cm-tm)**2).sum()*n/(m-1)Similarly, \(SS_w\) and \(SS_i\), from which we can compute \(MS_e\):
SSw = 0
for i in range(m):
SSw += ((X[:, i]-cm[i])**2).sum()
SSi = m*((im-tm)**2).sum()
MSe = (SSw-SSi)/((n-1)*(m-1)) And from there we can compute the \(F\) test statistic and the p-value using the F-distribution:
F = MSb/MSe
pval = stats.f(dfn=m-1, dfd=(n-1)*(m-1)).sf(F)Which will give us F=51.3 and
pval=1.33e-25: we can be very confident that we should
reject \(H_0\), and there is a
significant difference between at least two of the measures. This would
indicate in this example that there is an effect over time of the
medication.
Different post hoc can be used after a repeated measures ANOVA. A common choice – implemented in scipy – would be Tukey’s HSD (“Honestly significant difference”).
Tukey’s HSD uses as a test statistic based on the difference between the means \(\bar{X}_i - \bar{X}_j\). The exact computation depends on whether the sample sizes are equal or not (see e.g. Wikipedia or references in Scipy’s documentation).
We can perform Tukey’s HSD with
scipy.stats.tukey_hsd:
X_in_groups = [X[:, i] for i in range(m)] # split by column
post_hoc = stats.tukey_hsd(*X_in_groups)We could also perform pairwise Student’s paired samples t-tests, correcting the p-values using Bonferroni’s method.
In both cases, we can show the results using a boxplot where we note the pairwise pvalues. As we can see, different post-hoc may yield different results:
Let \(X_i\) be ordinal or numerical random variables from paired samples of size \(n\), so that \(X_{ij}\) represents the \(j\)-th individual in the \(i\)-th sample (so \(X_{11}, X_{21}, X_{31}...\) are values for the same individual in three different samples). We therefore have a table like:
| Individual \ Sample | 1 | … | m |
|---|---|---|---|
| 1 | \(X_{11}\) | … | \(X_{m1}\) |
| … | … | \(X_{ij}\) | … |
| n | \(X_{1n}\) | … | \(X_{mn}\) |
And \(X_{ij}\) are values that can be assigned ranks.
First, the samples are ranked per-individual. Let \(R_{ij}\) be the rank, for individual \(j\), of sample \(i\), and \(1 < R_{ij} < m\).
Then, we can compute the average ranks for each sample: \(T_{i} = \frac{1}{n}\sum_{j=1}^n R_{ij}\).
And finally the test statistic \(Q = \frac{12n}{m(m+1)}\sum_{i=1}^m(T_i - \frac{m+1}{2})^2\).
If \(m\) or \(n\) is large enough (\(m > 4\), \(n > 15\)), \(Q \sim \chi^2_{m-1}\). Otherwise, tables such as this one should be used to find critical values.
Example with code:
Using the example from https://datatab.net/tutorial/friedman-test:
We have a set of 7 people, and we have a measure of the reactivity of the people that you take in the morning, at noon and in the evening. We want to know if there is a difference in the reactivity depending on the time of day.
The data table is:
X = np.array([
[34, 45, 36],
[33, 36, 31],
[41, 35, 44],
[39, 43, 42],
[44, 42, 41],
[37, 42, 45],
[39, 46, 40]
])The first step is to compute the ranks per individual:
R = stats.rankdata(X, axis=1)Which will give use the result:
[[1. 3. 2.]
[2. 3. 1.]
[2. 1. 3.]
[1. 3. 2.]
[3. 2. 1.]
[1. 2. 3.]
[1. 3. 2.]]
Then, we compute the average rank per sample. Under \(H_0\), we are expecting these average ranks to be equal.
T = R.mean(axis=0)Result:
[1.57 2.43 2.]
Is this variation within the expected range under \(H_0\)? We must compute the test statistic:
Q = 12*n/(m*(m+1))*((T-(m+1)/2)**2).sum()Result: 2.57
According to the tables,
for 3 groups and 7 individuals, the critical value at \(\alpha=0.05\) is 7.143.
As 2.57 < 7.143, we cannot reject \(H_0\). The data doesn’t support the idea
that there is a difference between the measure according to the time of
day.
If we wanted to use the \(\chi^2\) approximation (which we shouldn’t in this case as both \(n\) and \(m\) are too low), we could do:
test_distribution = stats.chi2(df=m-1)
pval = test_distribution.sf(Q)Which would give us a p-value of 0.27, too high to
reject \(H_0\). We could obtain the
same results using scipy.stats.friedmanchisquare:
results = stats.friedmanchisquare(X[:, 0], X[:, 1], X[:, 2])
assert np.isclose(Q, results.statistic)
assert np.isclose(pval, results.pvalue)Several post-hoc tests have been proposed to offer more insight on which samples are different from each other after rejecting \(H_0\) with the Friedman test. One such test is the Nemenyi test (see e.g. Demsar, 2006).
According to the Nemenyi test, two samples are significantly different from one another if the difference of their average rank is larger than a critical value:
\(|T_i - T_j| > q_{\alpha}\sqrt{\frac{m(m+1)}{6N}}\), where \(q_\alpha\) values are based on the \(q\) statistic of the Studentized Range (e.g. table) divided by \(\sqrt{2}\).
We can also get those values from scipy.stats.studentized_range,
using an infinite degrees of freedom:
distrib_studentized = stats.studentized_range(df=np.inf, k=m)
critical_value = distrib_studentized.isf(1-alpha)/(2**.5)Note that there is some debate among statistician on the validity of this post-hoc, with some (Benavoli et al., 2015) arguing that using pairwise Wilcoxon signed-rank tests is a better alternative, to avoid the results of a pairwise comparison to be influenced by the other groups. Note that Wilcoxon in this case should still be used only after Friedman so that we still account for the multiple-tests effect on the probability of making Type I errors.
Example:
If instead of the table above we have:
X = np.array([
[34, 45, 36],
[28, 36, 31],
[42, 35, 44],
[39, 43, 42],
[38, 42, 41],
[37, 42, 45],
[39, 46, 40]
])We get a significant Friedman result (pvalue=0.018). The
average ranks are [1.14, 2.57, 2.29], which give us mean
rank differences of:
1.431.140.29With \(q_{0.05} = 2.343\) from the tables, we get the critical value of \(2.343 \sqrt\frac{3*(3+1)}{6*7} = 1.16\), meaning that only the difference between the first and second samples is significative.
Alternatively, we could compute the test statistic: \(q_{ij} = \frac{|T_i-T_j|}{\sqrt{\frac{m(m+1)}{6N}}}\), and compute the p-value using the Studentized range distribution:
pvals = np.zeros((m, m))
distrib_studentized = stats.studentized_range(df=np.inf, k=m)
for i in range(m):
for j in range(i+1, m):
d = np.abs(T[i]-T[j])
qval = d / (m*(m+1)/(6*n))**.5
pvals[i, j] = distrib_studentized.sf(qval*(2**.5))Which would give us:
0.020.080.85Leading us to the same conclusion.
a.k.a. One-way ANOVA test for independent samples
Let \(X_1...X_i...X_m\) be \(m\) groups of sizes \(n_1...n_i...n_m\), with \(X_i\) being normally distributed numerical variables.
\(X_{ij}\) is the \(j\)-th value in the \(i\)-th group.
ANOVA Fisher assumes equality of variances inter-sample (if not, the Welch ANOVA can be used as an extension of the Welch test, and will not be reviewed here).
\(H_0\) is that all groups have the same mean: \(\mu_i = \mu_j \forall i,j\).
\(H_1\) is that at least two groups have different means.
The ANOVA for independent sample tests if the total variance of \(X\) can be explained fully by the within group variance, and not from the between groups variance.
If \(s_i^2\) is the sample variance of group \(i\), we have:
\(MS_w = \frac{1}{m}\sum_{i=1}^ms_i^2 = \sum_{i=1}^m\sum_{j=1}^{n_i}\frac{(x_{ij}-\bar{x}_i)^2}{m(n_i-1)}\), the “Mean Square within”, which is a measure of the within group variance, and:
\(MS_b = \frac{1}{m-1}\sum_{i=1}^mn_i(\bar{x}_i-\bar{\bar{x}})^2\), the “Mean Square between”, which is a measure of the between groups variance, with \(\bar{\bar{x}}\) the mean of all values in all groups.
If each group has a high variance, \(MS_w\) will be large. If the sample means are very different from each other, \(MS_b\) will be large.
The test statistic is \(F = \frac{MS_b}{MS_w}\).
Under \(H_0\), \(F\) follows a Snedecor-Fisher
F-distribution with \(m-1,N-m\) degrees of freedom (with \(N = \sum_{i=1}^m n_i\)). We can look-up
critical values in statistical
tables, or use scipy.stats.f…
or perform the one-way ANOVA using scipy.stats.f_oneway
directly.
Example with code:
In a random clinical trial, patients are split into four different groups: one control, and three different treatments. Each group has a different number of patients. In each patient, we measure a marker of the disease in a blood sample.
We verify that, in each group, the distribution of this measure is normal, with reasonably similar variances. We get the following data:
CONTROL (n=18)
[41, 28, 28, 25, 37, 18, 42, 27, 33, 30, 40, 19, 30, 29, 38, 25, 30, 26]
TREATMENT A (n=12)
[35, 36, 27, 44, 28, 33, 39, 31, 32, 32, 39, 47]
TREATMENT B (n=21)
[45, 36, 33, 19, 31, 30, 32, 28, 32, 29, 23, 39, 39, 44, 33, 30, 29, 22, 39, 25, 24]
TREATMENT C (n=9)
[41, 45, 32, 47, 37, 26, 35, 46, 43]
Let X be a list of np.ndarray, each
containing a sample/group.
We can compute:
m = len(X) # number of groups
total_mean = np.concatenate(X).mean()
N = sum(len(x) for x in X) # total sample size
variances = [x.var(ddof=1) for x in X] # variances
MSw = sum(variances)/m
MSb = (1/(m-1))*sum([len(x)*(x.mean()-total_mean)**2 for x in X])
F = MSb/MSw
pval = stats.f(dfn=m-1, dfd=N-m).sf(F)Wich gives us F=4.09 and a p-value of 0.01:
we can reject \(H_0\)
and it is very likely that at least two of the groups have different
means.
We could also use stats.f_oneway:
results = stats.f_oneway(*X)
print(results.statistic, results.pvalue)Because there are slightly different ways to compute the F statistic,
we get a small different in the result (so in this case
assert np.isclose(results.statistic, F) would raise an
AssertionError), but the difference is small and the
conclusion is the same: statistic=4.05,
pvalue=0.01.
Different post hoc can be used after a one-way ANOVA. A common choice – implemented in scipy – would be Tukey’s HSD (“Honestly significant difference”).
Tukey’s HSD uses as a test statistic based on the difference between the means \(\bar{X}_i - \bar{X}_j\). The exact computation depends on whether the sample sizes are equal or not (see e.g. Wikipedia or references in Scipy’s documentation).
We can perform Tukey’s HSD with
scipy.stats.tukey_hsd:
post_hoc = stats.tukey_hsd(*X)Which will give us pairwise p-values, showing that groups CONTROL-TREATMENT C and TREATMENT B-TREATMENT C are significantly different, but the other comparisons do not yield significant results.
We could also visualize these results using box plots, marking the pairs that are significantly different according to the post-hoc:

post_hoc = stats.tukey_hsd(*X).pvalue
ymax = max(x.max() for x in X)
cur_y = ymax+2
h = 3
fig, ax = plt.subplots()
ax.boxplot(X)
for i in range(m):
for j in range(i+1, m):
if post_hoc[i, j] < 0.05:
ax.plot([i+1, i+1, j+1, j+1], [cur_y, cur_y+h, cur_y+h, cur_y], 'k-')
ax.text((i+j+2)/2, cur_y+h+0.5, f"pval={post_hoc[i, j]:.2f}", ha='center')
cur_y += h + 1
plt.title(f'One-way ANOVA pval={res.pvalue:.2f}')Let \(X_1...X_i...X_m\) be \(m\) groups of sizes \(n_1...n_i...n_m\), with \(X_i\) being numerical or ordinal variables with identically shaped distributions and scale.
\(X_{ij}\) is the \(j\)-th value in the \(i\)-th group.
The null hypothesis is that the medians of all groups are equal.
First, we compute \(R_{ij}\) the global rank of \(X_{ij}\) (so that \(R_{ij} = 1\) for the smallest value in all samples together, and \(R_{ij} = N\) for the largest, with \(N = \sum_{i=1}^m n_i\))
The test statistic is:
\(K = (N-1)\frac{\sum_{i=1}^m n_i(\bar{R}_i-\bar{\bar{R}})^2}{\sum_{i=1}^m \sum_{j=1}^{n_i}(R_{ij}-\bar{\bar{R}})^2}\)
With \(\bar{R}_i = \frac{1}{n_i}\sum_{j=1}^{n_i} R_{ij}\) the average rank for group \(i\).
And \(\bar{\bar{R}} = \frac{N+1}{2}\) is the global average rank.
Under \(H_0\), \(K\) follows a known distribution. For three
groups with small sample sizes, we can use tables
of critical values. Otherwise, we can make the approximation that
\(K \sim \chi^2_{m-1}\), and compute
the p-values using the chi-squared distribution (or directly with
scipy.stats.kruskal).
Example with code
In a random clinical trial, patients are split into four different groups: one control, and three different treatments. Each group has a different number of patients. In each patient, we measure a marker of the disease in a blood sample.
We have a small number of patients (6, 7, 9, 5 in each group), so we don’t feel confident in verifying the normality, and would rather fallback on a nonparametric test. We get the following data:
CONTROL (n=6)
[9.17, 12.20, 5.00, 8.02, 6.47, 5.92]
TREATMENT A (n=7)
[10.49, 7.21, 11.40, 10.48, 10.36, 9.64, 8.64]
TREATMENT B (n=9)
[9.41, 10.67, 7.33, 9.09, 12.14, 12.17, 6.00, 6.66, 5.41]
TREATMENT C (n=5)
[14.80, 12.28, 14.84, 13.29, 13.19]
First, we compute the global ranks for each value, and re-separate the ranks in each group:
# pool all data together
all_data = np.concatenate(X)
# compute global rank
ranks = stats.rankdata(all_data)
# separate per group
ranks_per_group = []
cur_index = 0
for x in X:
r = ranks[cur_index:cur_index+len(x)]
cur_index += len(x)
ranks_per_group.append(r)From there, we can get the average rank for each group, the total sample size, and the global average rank:
# average rank per group
Ri = [sum(r)/len(r) for r in ranks_per_group]
# total sample size
N = sum(len(x) for x in X)
# global average rank
R = (N+1)/2Which gives us:
Average rank per group: [8.7, 14, 11.4, 25.0]
Total sample size: 27
Global average rank: 14
We can now compute the test statistic. Since we have more than three groups, we will use the \(\chi^2\) approximation to compute the p-value, with 3 degrees of freedom (as we have four groups).
K_numerator = sum(len(r)*(ri-R)**2 for r, ri in zip(ranks_per_group, average_rank_per_group))
K_denominator = sum(sum((r-R)**2 for r in rs) for rs in ranks_per_group)
K = (N-1)*K_numerator/K_denominator
test_distribution = stats.chi2(df=len(X)-1)
pval = test_distribution.sf(K) Which gives us K=13.2 and pval=0.004: we
should reject \(H_0\), and it is very
likely that at least two of the groups have different median values.
We can get the same values using scipy.stats.kruskal:
results = stats.kruskal(*X)
assert np.isclose(results.statistic, K)
assert np.isclose(results.pvalue, pval)A common choice for the post hoc analysis is to use Dunn’s test (see e.g. Dinno, 2015).
Dunn’s test between group \(i\) and group \(j\) uses the test statistic:
\(z = \frac{\bar{R}_i - \bar{R}_j}{\sigma}\)
Where \(\sigma\) is the standard deviation of \(\bar{R}_i - \bar{R}_j\). If there are no ties in the rankings, we have:
\(\sigma = \sqrt{\frac{N(N+1)}{12}(\frac{1}{n_i}+\frac{1}{n_j})}\)
If there are ties, a correcting factor needs to be added (see Dinno,
2015, and implementation in the choosing_test.py file).
Under \(H_0\) (both groups have the same median), \(z \sim Z = N(0, 1)\), so the p-values can be computed using the standard normal distribution. However, they should be adjusted, using e.g. the Bonferroni correction: if we make \(k\) pairwise comparisons, each p-value \(p\) should be multiplied by \(k\) to get the adjusted p-value \(p^\star = kp\).
In our example, this would yield a significant difference between
CONTROL and TREATMENT C (adjusted_pval=0.004) and between
TREATMENT B and TREATMENT C (adjusted_pval=0.013).
We can visualize those results using a plot where, since the number of observation is low, we can plot each point for the different groups:
