The notion of “paired” and “unpaired” variables is often confusing. The goal of this document is to try to give clear definitions and examples of the different cases that you may encounter.
When we are comparing two (or more) things, we have three different kinds of relationships:
Comparing different things on different individuals or comparing different things in different conditions would not be very useful (for instance: if we measure the body temperature of one group and the cholesterol level of another group, there is no “relationship” to study).
In 1 and 2, the measures are paired (but will be analyzed slightly differently!). In 3, they are unpaired. Let’s take a closer look at some examples to understand how to approach each situation.
In this case, the condition is a categorical variable (ex: BEFORE/AFTER treatment, MORNING/NOON/EVENING, RESTING/DURING EFFORT, etc.). The thing that we measure can be a categorical or a numerical variable.
Example with a numerical variable
Measuring the cholesterol level of patients before and after treatment.
The table of data, if we put one row per measurement, would look something like that:
Individual | BEFORE/AFTER | Cholesterol |
---|---|---|
1 | BEFORE | 221 |
2 | BEFORE | 187 |
… | … | … |
1 | AFTER | 203 |
2 | AFTER | 189 |
Note that in this table, we have two rows for each individual. Usually, to make things more readable, the table of data will therefore rather be presented with one row per individual, and the BEFORE/AFTER categorical variable will be used to “split” the Cholesterol numerical variable in two:
Individual | Cholesterol BEFORE | Cholesterol AFTER |
---|---|---|
1 | 221 | 203 |
2 | 187 | 189 |
… | … | … |
Since we are measuring the same thing on the same individual under different conditions, we will usually be interested in the difference between the measures. The relationship that we want to determine is between the CONDITION and the MEASURE: are the measures different under the different conditions?
Our null hypothesis will typically be: “there is no impact of the condition on the measure”. This can lead us to a formulation such as: “H0: Mean[Cholesterol BEFORE - Cholesterol AFTER] = 0”, and to tests such as Student’s paired-samples t-test, Wilcoxon signed-rank, or if there are more than two conditions to a Repeated-measures ANOVA or a Friedman test.
Example with a categorical variable
Measuring the HOUSING STATUS (Owner, Renter, No fixed home) of people suffering from a drug addiction BEFORE/AFTER being regularly followed at a specialized medical center.
Again, the table of data with one row per measurement would look like:
Individual | HOUSING STATUS | FOLLOWED |
---|---|---|
1 | Renter | BEFORE |
2 | No fixed home | BEFORE |
… | … | … |
1 | Renter | AFTER |
2 | Renter | AFTER |
Which would lead to a contingency table such as:
HOUSING \ FOLLOWED | BEFORE | AFTER |
---|---|---|
No fixed home | 12 | 6 |
Renter | 43 | 47 |
Owner | 6 | 8 |
This table, however, is misleading, as one individual actually appears in two different cells (one in the BEFORE column, one in the AFTER column).
We would therefore typically split the HOUSING STATUS variable to present the data with one row per individual:
Individual | HOUSING BEFORE | HOUSING AFTER |
---|---|---|
1 | Renter | Renter |
2 | No fixed home | Renter |
Leading to a contingency table such as:
HOUSING BEFORE \ HOUSING AFTER | No fixed home | Renter | Owner |
---|---|---|---|
No fixed home | 6 | 6 | 0 |
Renter | 0 | 41 | 2 |
Owner | 0 | 0 | 6 |
Our null hypothesis, as with the numerical variable, would be that the CONDITION (BEFORE/AFTER) doesn’t affect the MEASURE (HOUSING STATUS). If the MEASURE is dichotomous (only two possible outcomes) and we have two CONDITIONS, then that would lead us to a McNemar test. If the measure is dichotomous and we have more than two CONDITIONS, then we can use Cochran’s Q test. If, like here, we have more than two categories in the MEASURE, then we can use Stuart-Maxwell (not covered in this course).
In this case, we have a table of data that looks like this (for two “things”):
Individual | X | Y |
---|---|---|
1 | \(X_1\) | \(Y_1\) |
… | … | … |
n | \(X_n\) | \(Y_n\) |
Each thing will be a different variable. Since it’s measuring a different thing, they may all be numerical (but with potentially different scales, units, etc.), or all categorical, or a mix of both.
If they are all numerical, then we will typically be interested in measuring correlations (with e.g. Pearson or Spearman), and in finding the natural “structure” of the data by looking at it as a point cloud in a feature space, leading us towards dimensionality reduction techniques (PCA, MDS, t-SNE…), clustering methods, etc. If one or more of the variables are “outcome” variables that we want to be able to predict based on the others, then that will lead us towards regression techniques.
Example: measuring body temperature, cholesterol level and systolic blood pressure in the same patients.
Individual | Temp | Cholesterol | BP |
---|---|---|---|
1 | 37.8 | 187 | 133 |
2 | 37.2 | 231 | 153 |
… | … | … |
Could lead us to a variance/covariance or a correlation matrix such as:
R | Temp | Cholesterol | BP |
---|---|---|---|
Temp | 1 | 0.03 | -0.07 |
Cholesterol | 0.03 | 1 | 0.22 |
BP | -0.07 | 0.22 | 1 |
If they are all categorical, then we may try to find out if they are independent using a Chi-square independence test. We can further analyze the relationships with techniques such as Correspondance analysis (e.g. MCA).
Example: measuring Smoking status, Education level, Country of residence.
Individual | Smoking | Education | Country |
---|---|---|---|
1 | YES | BACHELOR | BE |
2 | NO | HIGH SCHOOL | FR |
… | … | … | … |
With three categorical variables, the resulting contingency table will have three dimensions, which we can represent like this:
Smoking=YES | NO EDUC | HIGH SCHOOL | BACHELOR | MASTER | PHD |
---|---|---|---|---|---|
BE | |||||
DE | |||||
FR | |||||
LU | |||||
NL |
Smoking=NO | NO EDUC | HIGH SCHOOL | BACHELOR | MASTER | PHD |
---|---|---|---|---|---|
BE | |||||
DE | |||||
FR | |||||
LU | |||||
NL |
Alternatively, we can use a Burt table, which puts everything in a 2D table:
SMOKING=YES | SMOKING=NO | EDUC=NO | EDUC=HS | … | CO=NL | |
---|---|---|---|---|---|---|
SMOKING=YES | ||||||
SMOKING=NO | ||||||
EDUC=NO | ||||||
… | ||||||
CO=NL |
If we have a mix, then we will typically use one or several of the categorical variables as grouping variables to split the dataset into different groups, which will bring us to the unpaired case of measuring the same things on different individuals, which brings us to the last category
Example: measuring systolic blood pressure, education level and smoking status.
Individual | Smoking | Education | BP |
---|---|---|---|
1 | YES | BACHELOR | 133 |
2 | NO | HIGH SCHOOL | 153 |
… | … | … | … |
How we proceed depends on the question that we are trying to ask. If we want to know whether the education level has an effect on smoking and blood pressure, we can use “education level” as a grouping variable. If we use the same 5 education categories as above, this will lead us to five unpaired samples in which we measure the same two variables (smoking and BP).
Smoking is a categorical variable, so we can measure its independence from the education level by constructing a contingency table:
Smoking \ Education | NO | HS | BA | MA | PHD |
---|---|---|---|---|---|
YES | 6 | 31 | 16 | 14 | 4 |
NO | 8 | 30 | 18 | 11 | 3 |
As you may notice, this is equivalent to what we’ve done in the “different things, same individual” setup. For categorical variables, we have a sort of “dual” representation of the problem which all lead us to the same test (Chi-square independence test): we can either think of it as measuring different things (Education, Smoking) on the same individuals OR as measuring the same thing (Smoking) in different groups (split by Education level).
Blood pressure is a numerical variable, so here we will have something different: for each category of our grouping variable, we will have a distinct sample of blood pressures:
EDUCATION=NO
Individual | BP |
---|---|
1 | 118 |
… | … |
EDUCATION=HS
Individual | BP |
---|---|
7 | 131 |
… | … |
Etc…
The question we are asking could then be: is the average blood pressure the same in all groups? If the distributions are all normally distributed and the variances are (approximately) equal and we have more than two categories, then we can use the ANOVA Fisher test. For two categories, we have Student’s independent t-test. If the variances are unequal, we have Welch’s test or the ANOVA Welch. If normality is not verified, we have the non-parametric equivalents: Mann-Whitney (two categories) and Kruskal-Wallis (more than two categories).
If we want to also know if the shape of the distributions are the same, we will use the Kolmogorov-Smirnov test instead.