Paired or Unpaired: a clarification

Adrien Foucart

The notion of “paired” and “unpaired” variables is often confusing. The goal of this document is to try to give clear definitions and examples of the different cases that you may encounter.

When we are comparing two (or more) things, we have three different kinds of relationships:

  1. We are comparing the same thing on the same individuals under different conditions.
  2. We are comparing different things on the same individuals in the same conditions.
  3. We are comparing the same thing on different groups of individuals.

Comparing different things on different individuals or comparing different things in different conditions would not be very useful (for instance: if we measure the body temperature of one group and the cholesterol level of another group, there is no “relationship” to study).

In 1 and 2, the measures are paired (but will be analyzed slightly differently!). In 3, they are unpaired. Let’s take a closer look at some examples to understand how to approach each situation.

1. Same thing, same individuals, different conditions

In this case, the condition is a categorical variable (ex: BEFORE/AFTER treatment, MORNING/NOON/EVENING, RESTING/DURING EFFORT, etc.). The thing that we measure can be a categorical or a numerical variable.

Example with a numerical variable

Measuring the cholesterol level of patients before and after treatment.

The table of data, if we put one row per measurement, would look something like that:

Individual BEFORE/AFTER Cholesterol
1 BEFORE 221
2 BEFORE 187
1 AFTER 203
2 AFTER 189

Note that in this table, we have two rows for each individual. Usually, to make things more readable, the table of data will therefore rather be presented with one row per individual, and the BEFORE/AFTER categorical variable will be used to “split” the Cholesterol numerical variable in two:

Individual Cholesterol BEFORE Cholesterol AFTER
1 221 203
2 187 189

Since we are measuring the same thing on the same individual under different conditions, we will usually be interested in the difference between the measures. The relationship that we want to determine is between the CONDITION and the MEASURE: are the measures different under the different conditions?

Our null hypothesis will typically be: “there is no impact of the condition on the measure”. This can lead us to a formulation such as: “H0: Mean[Cholesterol BEFORE - Cholesterol AFTER] = 0”, and to tests such as Student’s paired-samples t-test, Wilcoxon signed-rank, or if there are more than two conditions to a Repeated-measures ANOVA or a Friedman test.

Example with a categorical variable

Measuring the HOUSING STATUS (Owner, Renter, No fixed home) of people suffering from a drug addiction BEFORE/AFTER being regularly followed at a specialized medical center.

Again, the table of data with one row per measurement would look like:

Individual HOUSING STATUS FOLLOWED
1 Renter BEFORE
2 No fixed home BEFORE
1 Renter AFTER
2 Renter AFTER

Which would lead to a contingency table such as:

HOUSING \ FOLLOWED BEFORE AFTER
No fixed home 12 6
Renter 43 47
Owner 6 8

This table, however, is misleading, as one individual actually appears in two different cells (one in the BEFORE column, one in the AFTER column).

We would therefore typically split the HOUSING STATUS variable to present the data with one row per individual:

Individual HOUSING BEFORE HOUSING AFTER
1 Renter Renter
2 No fixed home Renter

Leading to a contingency table such as:

HOUSING BEFORE \ HOUSING AFTER No fixed home Renter Owner
No fixed home 6 6 0
Renter 0 41 2
Owner 0 0 6

Our null hypothesis, as with the numerical variable, would be that the CONDITION (BEFORE/AFTER) doesn’t affect the MEASURE (HOUSING STATUS). If the MEASURE is dichotomous (only two possible outcomes) and we have two CONDITIONS, then that would lead us to a McNemar test. If the measure is dichotomous and we have more than two CONDITIONS, then we can use Cochran’s Q test. If, like here, we have more than two categories in the MEASURE, then we can use Stuart-Maxwell (not covered in this course).

2. Different things, same individuals, same conditions

In this case, we have a table of data that looks like this (for two “things”):

Individual X Y
1 \(X_1\) \(Y_1\)
n \(X_n\) \(Y_n\)

Each thing will be a different variable. Since it’s measuring a different thing, they may all be numerical (but with potentially different scales, units, etc.), or all categorical, or a mix of both.

All numerical

If they are all numerical, then we will typically be interested in measuring correlations (with e.g. Pearson or Spearman), and in finding the natural “structure” of the data by looking at it as a point cloud in a feature space, leading us towards dimensionality reduction techniques (PCA, MDS, t-SNE…), clustering methods, etc. If one or more of the variables are “outcome” variables that we want to be able to predict based on the others, then that will lead us towards regression techniques.

Example: measuring body temperature, cholesterol level and systolic blood pressure in the same patients.

Individual Temp Cholesterol BP
1 37.8 187 133
2 37.2 231 153

Could lead us to a variance/covariance or a correlation matrix such as:

R Temp Cholesterol BP
Temp 1 0.03 -0.07
Cholesterol 0.03 1 0.22
BP -0.07 0.22 1

All categorical

If they are all categorical, then we may try to find out if they are independent using a Chi-square independence test. We can further analyze the relationships with techniques such as Correspondance analysis (e.g. MCA).

Example: measuring Smoking status, Education level, Country of residence.

Individual Smoking Education Country
1 YES BACHELOR BE
2 NO HIGH SCHOOL FR

With three categorical variables, the resulting contingency table will have three dimensions, which we can represent like this:

Smoking=YES NO EDUC HIGH SCHOOL BACHELOR MASTER PHD
BE
DE
FR
LU
NL
Smoking=NO NO EDUC HIGH SCHOOL BACHELOR MASTER PHD
BE
DE
FR
LU
NL

Alternatively, we can use a Burt table, which puts everything in a 2D table:

SMOKING=YES SMOKING=NO EDUC=NO EDUC=HS CO=NL
SMOKING=YES
SMOKING=NO
EDUC=NO
CO=NL

Mix

If we have a mix, then we will typically use one or several of the categorical variables as grouping variables to split the dataset into different groups, which will bring us to the unpaired case of measuring the same things on different individuals, which brings us to the last category

3. Same things, different individuals

Example: measuring systolic blood pressure, education level and smoking status.

Individual Smoking Education BP
1 YES BACHELOR 133
2 NO HIGH SCHOOL 153

How we proceed depends on the question that we are trying to ask. If we want to know whether the education level has an effect on smoking and blood pressure, we can use “education level” as a grouping variable. If we use the same 5 education categories as above, this will lead us to five unpaired samples in which we measure the same two variables (smoking and BP).

Smoking is a categorical variable, so we can measure its independence from the education level by constructing a contingency table:

Smoking \ Education NO HS BA MA PHD
YES 6 31 16 14 4
NO 8 30 18 11 3

As you may notice, this is equivalent to what we’ve done in the “different things, same individual” setup. For categorical variables, we have a sort of “dual” representation of the problem which all lead us to the same test (Chi-square independence test): we can either think of it as measuring different things (Education, Smoking) on the same individuals OR as measuring the same thing (Smoking) in different groups (split by Education level).

Blood pressure is a numerical variable, so here we will have something different: for each category of our grouping variable, we will have a distinct sample of blood pressures:

EDUCATION=NO

Individual BP
1 118

EDUCATION=HS

Individual BP
7 131

Etc…

The question we are asking could then be: is the average blood pressure the same in all groups? If the distributions are all normally distributed and the variances are (approximately) equal and we have more than two categories, then we can use the ANOVA Fisher test. For two categories, we have Student’s independent t-test. If the variances are unequal, we have Welch’s test or the ANOVA Welch. If normality is not verified, we have the non-parametric equivalents: Mann-Whitney (two categories) and Kruskal-Wallis (more than two categories).

If we want to also know if the shape of the distributions are the same, we will use the Kolmogorov-Smirnov test instead.