Paired or Unpaired: a clarification

Adrien Foucart

The notion of “paired” and “unpaired” variables is often confusing. The goal of this document is to try to give clear definitions and examples of the different cases that you may encounter.

When we are comparing two (or more) things, we have three different kinds of relationships:

We are comparing the same thing on the same individuals under different conditions.
We are comparing different things on the same individuals in the same conditions.
We are comparing the same thing on different groups of individuals.

Comparing different things on different individuals or comparing different things in different conditions would not be very useful (for instance: if we measure the body temperature of one group and the cholesterol level of another group, there is no “relationship” to study).

In 1 and 2, the measures are paired (but will be analyzed slightly differently!). In 3, they are unpaired. Let’s take a closer look at some examples to understand how to approach each situation.

1. Same thing, same individuals, different conditions

In this case, the condition is a categorical variable (ex: BEFORE/AFTER treatment, MORNING/NOON/EVENING, RESTING/DURING EFFORT, etc.). The thing that we measure can be a categorical or a numerical variable.

Example with a numerical variable

Measuring the cholesterol level of patients before and after treatment.

The table of data, if we put one row per measurement, would look something like that:

Individual	BEFORE/AFTER	Cholesterol
1	BEFORE	221
2	BEFORE	187
…	…	…
1	AFTER	203
2	AFTER	189

Note that in this table, we have two rows for each individual. Usually, to make things more readable, the table of data will therefore rather be presented with one row per individual, and the BEFORE/AFTER categorical variable will be used to “split” the Cholesterol numerical variable in two:

Individual	Cholesterol BEFORE	Cholesterol AFTER
1	221	203
2	187	189
…	…	…

Since we are measuring the same thing on the same individual under different conditions, we will usually be interested in the difference between the measures. The relationship that we want to determine is between the CONDITION and the MEASURE: are the measures different under the different conditions?

Our null hypothesis will typically be: “there is no impact of the condition on the measure”. This can lead us to a formulation such as: “H0: Mean[Cholesterol BEFORE - Cholesterol AFTER] = 0”, and to tests such as Student’s paired-samples t-test, Wilcoxon signed-rank, or if there are more than two conditions to a Repeated-measures ANOVA or a Friedman test.

Example with a categorical variable

Measuring the HOUSING STATUS (Owner, Renter, No fixed home) of people suffering from a drug addiction BEFORE/AFTER being regularly followed at a specialized medical center.

Again, the table of data with one row per measurement would look like:

Individual	HOUSING STATUS	FOLLOWED
1	Renter	BEFORE
2	No fixed home	BEFORE
…	…	…
1	Renter	AFTER
2	Renter	AFTER

Which would lead to a contingency table such as:

HOUSING \ FOLLOWED	BEFORE	AFTER
No fixed home	12	6
Renter	43	47
Owner	6	8

This table, however, is misleading, as one individual actually appears in two different cells (one in the BEFORE column, one in the AFTER column).

We would therefore typically split the HOUSING STATUS variable to present the data with one row per individual:

Individual	HOUSING BEFORE	HOUSING AFTER
1	Renter	Renter
2	No fixed home	Renter

Leading to a contingency table such as:

HOUSING BEFORE \ HOUSING AFTER	No fixed home	Renter	Owner
No fixed home	6	6	0
Renter	0	41	2
Owner	0	0	6

Our null hypothesis, as with the numerical variable, would be that the CONDITION (BEFORE/AFTER) doesn’t affect the MEASURE (HOUSING STATUS). If the MEASURE is dichotomous (only two possible outcomes) and we have two CONDITIONS, then that would lead us to a McNemar test. If the measure is dichotomous and we have more than two CONDITIONS, then we can use Cochran’s Q test. If, like here, we have more than two categories in the MEASURE, then we can use Stuart-Maxwell (not covered in this course).

2. Different things, same individuals, same conditions

In this case, we have a table of data that looks like this (for two “things”):

Individual	X	Y
1	\(X_1\)	\(Y_1\)
…	…	…
n	\(X_n\)	\(Y_n\)

Each thing will be a different variable. Since it’s measuring a different thing, they may all be numerical (but with potentially different scales, units, etc.), or all categorical, or a mix of both.

All numerical

If they are all numerical, then we will typically be interested in measuring correlations (with e.g. Pearson or Spearman), and in finding the natural “structure” of the data by looking at it as a point cloud in a feature space, leading us towards dimensionality reduction techniques (PCA, MDS, t-SNE…), clustering methods, etc. If one or more of the variables are “outcome” variables that we want to be able to predict based on the others, then that will lead us towards regression techniques.

Example: measuring body temperature, cholesterol level and systolic blood pressure in the same patients.

Individual	Temp	Cholesterol	BP
1	37.8	187	133
2	37.2	231	153
…	…	…

Could lead us to a variance/covariance or a correlation matrix such as:

R	Temp	Cholesterol	BP
Temp	1	0.03	-0.07
Cholesterol	0.03	1	0.22
BP	-0.07	0.22	1

All categorical

If they are all categorical, then we may try to find out if they are independent using a Chi-square independence test. We can further analyze the relationships with techniques such as Correspondance analysis (e.g. MCA).

Example: measuring Smoking status, Education level, Country of residence.

Individual	Smoking	Education	Country
1	YES	BACHELOR	BE
2	NO	HIGH SCHOOL	FR
…	…	…	…

With three categorical variables, the resulting contingency table will have three dimensions, which we can represent like this:

Smoking=YES	NO EDUC	HIGH SCHOOL	BACHELOR	MASTER	PHD
BE
DE
FR
LU
NL

Smoking=NO	NO EDUC	HIGH SCHOOL	BACHELOR	MASTER	PHD
BE
DE
FR
LU
NL

Alternatively, we can use a Burt table, which puts everything in a 2D table:

	SMOKING=YES	SMOKING=NO	EDUC=NO	EDUC=HS	…	CO=NL
SMOKING=YES
SMOKING=NO
EDUC=NO
…
CO=NL

Mix

If we have a mix, then we will typically use one or several of the categorical variables as grouping variables to split the dataset into different groups, which will bring us to the unpaired case of measuring the same things on different individuals, which brings us to the last category

3. Same things, different individuals

Example: measuring systolic blood pressure, education level and smoking status.

Individual	Smoking	Education	BP
1	YES	BACHELOR	133
2	NO	HIGH SCHOOL	153
…	…	…	…

How we proceed depends on the question that we are trying to ask. If we want to know whether the education level has an effect on smoking and blood pressure, we can use “education level” as a grouping variable. If we use the same 5 education categories as above, this will lead us to five unpaired samples in which we measure the same two variables (smoking and BP).

Smoking is a categorical variable, so we can measure its independence from the education level by constructing a contingency table:

Smoking \ Education	NO	HS	BA	MA	PHD
YES	6	31	16	14	4
NO	8	30	18	11	3

As you may notice, this is equivalent to what we’ve done in the “different things, same individual” setup. For categorical variables, we have a sort of “dual” representation of the problem which all lead us to the same test (Chi-square independence test): we can either think of it as measuring different things (Education, Smoking) on the same individuals OR as measuring the same thing (Smoking) in different groups (split by Education level).

Blood pressure is a numerical variable, so here we will have something different: for each category of our grouping variable, we will have a distinct sample of blood pressures:

EDUCATION=NO

Individual	BP
1	118
…	…

EDUCATION=HS

Individual	BP
7	131
…	…

Etc…

The question we are asking could then be: is the average blood pressure the same in all groups? If the distributions are all normally distributed and the variances are (approximately) equal and we have more than two categories, then we can use the ANOVA Fisher test. For two categories, we have Student’s independent t-test. If the variances are unequal, we have Welch’s test or the ANOVA Welch. If normality is not verified, we have the non-parametric equivalents: Mann-Whitney (two categories) and Kruskal-Wallis (more than two categories).

If we want to also know if the shape of the distributions are the same, we will use the Kolmogorov-Smirnov test instead.