Multivariate data analysis (STAT-H400)

4. Unsupervised methods

Adrien Foucart

Introduction

What is this module about ?

Reminder: Feature space

Individual \(X_1\) \(X_j\) \(X_p\)
i \(x_{i1}\) \(x_{ij}\) \(x_{ip}\)

Line vector \(x_{i.}\) represents one individual in a p-dimensional space \(\rightarrow\) “variable space” or “feature space”.

All rows together form a point cloud showing the distribution of individuals in feature space.

2D examples

Reminder: Feature space

What is a distance ?

A distance function (or metric) is a function \(d(x, y)\) (\(x\), \(y\) being vectors) such that, for all vectors \(x, y, z\):

Reminder: Feature space

2D examples

Feature space transformation \(\rightarrow\) distances that are (maybe) more meaningful.

Feature space transformations are an essential part of machine learning algorithms. In fact, almost all machine learning algorithms can be essentially reduced to:

  1. Transform feature space in order to make the groups more easily separable (for classification) or more “aligned” (for regression).
  2. Find the best line to cut between the groups (for classification or clustering) or to pass through the point cloud (for regression).

Principal Component Analysis (PCA)

What is Principal Component Analysis ?

What is Principal Component Analysis ?

Why PCA ?

PCA step-by-step

First step: center the point cloud so that the origin of our new coordinate system is at the center of gravity of the point cloud \(\rightarrow\) \(x_c = x - \bar{x}\)

PCA step-by-step

Determination of the first component \(u\): direction in the point cloud that follows the axis of maximum elongation \(\rightarrow\) minimizes the distance between each point and the line, or maximizes the variance of the point cloud projected on that axis.

\(min \sum_{i=1}^{n} d^2(x_{i.}, u)\)

\(max \sum_{i=1}^{n} (x_{i.}^T u)^2\)

With the added constraint that \(u\) must be a unit vector (since we are building a new basis system).

Good videos to better understand eigenvectors and eigenvalues: 3Blue1Brown (https://www.youtube.com/watch?v=PFDu9oVAE-g).

The rest of that series on linear algebra is also very useful if linear algebra is not very fresh on your mind!

PCA step-by-step

Example: for angles by increment of 1 degree, sum of the squared dot products between the points and the unit vector of that direction (scaled to fit on the graph, maximum value noted in black):

PCA step-by-step

How do we find the exact direction of \(u\)? \(\rightarrow\) through the variance-covariance matrix. We want our new basis to have maximum variance along the axis. We can formulate that as searching for an orthonormal transformation matrix \(P\) such that \(PX\) has a diagonal covariance matrix (i.e. maximum variance along the axis, minimal covariance between the new variables).

\(X_{2 \times N}\): coordinates of the samples in feature space.
\(P_{2 \times 2}\): transformation matrix.

Objective: \(Cov(X) = \begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix}\)
\(\rightarrow\) \(Cov(PX) = \begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix}\)

PCA step-by-step

How do we find the exact direction of \(u\)? \(\rightarrow\) through the variance-covariance matrix.

Assuming that \(X\) is zero-centered…

\(Cov(PX) = E[(PX-E[PX])(PX-E[PX])^T] = E[PX(PX)^T]\) (because \(P\) is orthonormal, so \(E[PX] = E[X] = 0\))
\(Cov(PX) = E[PXX^{T}P^{T}]\) (because \((AB)^T = B^TA^T\))
\(Cov(PX) = PE[XX^{T}]P^{T}\) (because \(X\) is a random variable, but \(P\) has a fixed value)
\(Cov(PX) = PCov(X)P^T\) (\(E[XX^T] = Cov(X)\) if \(X\) is zero-centered)
\(Cov(PX) = PCov(X)P^{-1}\) (orthogonality constraint: \(P^T = P^{-1}\))

PCA step-by-step

\(X_{2 \times N}\): coordinates of the samples in feature space.
\(P_{2 \times 2}\): transformation matrix.

\(Cov(PX) = PCov(X)P^{-1}\)

\(\begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix} = P\begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix}P^{-1}\)

\(\Rightarrow Cov(X)\) is diagonalizable by \(P\).

\(Cov(X) = P^{-1}Cov(PX)P \Rightarrow\) the column vectors of \(P^{-1}\) form a basis consisting of eigenvectors of \(Cov(X)\), and the diagonal of \(Cov(PX)\) contains the corresponding eigenvalues.

Let \(P = \begin{pmatrix} p_{11} & p_{12} \\ p_{21} & p_{22} \end{pmatrix}\)

\(\rightarrow \begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix} = \begin{pmatrix} p_{11} & p_{21} \\ p_{12} & p_{22} \end{pmatrix} \begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix} \begin{pmatrix} p_{11} & p_{12} \\ p_{21} & p_{22} \end{pmatrix}\)

\(\rightarrow [p_{11}, p_{12}]\) and \([p_{21}, p_{22}]\) are the two eigenvectors, and \(\hat{s}_1^2, \hat{s}_2^2\) are the two corresponding eigenvalues.

PCA step-by-step

\(Cov(PX) = PCov(X)P^{-1} \Rightarrow Cov(X)\) is diagonalizable by \(P\).

\(Cov(X) = P^{-1}Cov(PX)P \Rightarrow\) the column vectors of \(P^{-1}\) form a basis consisting of eigenvectors of \(Cov(X)\), and the diagonal of \(Cov(PX)\) contains the corresponding eigenvalues.

Therefore, the eigendecomposition of \(Cov(X)\) gives us eigenvectors which form a new basis system with maximum variance along the axes, and the eigenvalues correspond to the per-new-axis variance.

PCA step-by-step

How do we find the exact direction of \(u\)? \(\rightarrow\) through the variance-covariance matrix.

Specifically: \(u\) is the normalized eigenvector of the variance-covariance matrix \(Cov(X)\) with the highest corresponding eigenvalue \(\lambda\).

# Xc is the centered data
S = np.cov(Xc, rowvar=False) 
eigvalues, eigvectors = np.linalg.eig(S)
# descending order of eigenvalues
o = np.argsort(eigvalues)[::-1] 
# first eigenvector
u = eigvectors[:, o[0]] 

PCA step-by-step

Determination of the next components:

\(\rightarrow\) next eigenvectors, by descending order of eigenvalues !

Eigenvalues can be interpreted as the explained variance of each principal component.

\(\rightarrow \frac{\lambda_i}{\sum_j^p \lambda_j}\) is the explained variance ratio of the \(i\)-th eigenvector.

PCA step-by-step

\(X_{2 \times N}\): coordinates of the samples in feature space.
\(P_{2 \times 2}\): transformation matrix.

\(Cov(PX) = PCov(X)P^{-1}\)

\(Cov(X) = P^{-1}Cov(PX)P\)

\(\begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix} = P^{-1}\begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix}P\)

\(\begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix} = \begin{pmatrix} u_1 & v_1 \\ u_2 & v_2 \end{pmatrix} \begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \begin{pmatrix} u_1 & v_1 \\ u_2 & v_2 \end{pmatrix}^{-1}\)

PCA step-by-step

Projection on the new basis system: \(X_{t} = PX\)

PCA step-by-step

\(\lambda_1 = 68.91\), \(\lambda_2 = 3.48\) \(\rightarrow\) \(EV_1 = 95\%\), \(EV_2 = 5\%\)

\(\begin{pmatrix} 37.7 & 32.7 \\ 32.7 & 34.7 \end{pmatrix} = \begin{pmatrix} 0.72 & -0.69 \\ 0.69 & 0.72 \end{pmatrix} \begin{pmatrix} 68.91 & 0 \\ 0 & 3.48 \end{pmatrix} \begin{pmatrix} 0.72 & 0.69 \\ -0.69 & 0.72 \end{pmatrix}\)

PCA Example

Example: diabetes dataset. Reminder: we have 6 numerical variables corresponding to blood sample results (\(s_1, ..., s_6\)). Let’s do a PCA (after zero-centering the data):

S = np.cov(Xc)
eigvalues, eigvectors = np.linalg.eig(S)
o = np.argsort(eigvalues)[::-1]
evs = eigvectors[:, o]
Xt = (evs.T @ Xc).T
Original
After PCA

PCA Example

Explained variance ratios (eigvalues[o]/eigvalues.sum()) \(\rightarrow\)

0.84, 0.09, 0.05, 0.02, 0.00, 0.00

Or, cumulatively:

0.84, 0.93, 0.98, 1.0, 1.0, 1.0

In other words, we need three components to get to 98% of the explained variance.

PCA Example

What are the contributions of the variables to the components?

PCA Example

PCA Example

\(s_4\) and \(s_5\) also have very low standard deviations compared to the other variables \(\rightarrow\) PCA is influenced by the scale. Rescaling by the standard deviation before PCA (X = X / np.std(X, axis=0)) can ensure that their contribution to the total variance is not ignored.

Original
After PCA

PCA Example

Explained variance ratios (eigvalues[o]/eigvalues.sum()) \(\rightarrow\)

0.55, 0.22, 0.13, 0.09, 0.01, 0.00

Or, cumulatively:

0.55, 0.77, 0.90, 0.99, 1.0, 1.0

With three components, we now get 90% of the explained variance instead of 98..

PCA Example

What are the contributions of the variables to the components?

PCA Example

Note that after scaling, the explained variance of the first two components is much lower \(\rightarrow\) we effectively increased the dimensionality of the dataset by equalizing the influence of the different variables, whereas the original dataset was “flat” along two of the dimensions !

No scaling
Scaling

PCA Example

Link with correlations between variables \(\rightarrow\) highly correlated variables will tend to be close together when projected on the principal components.

PCA Example

Implementation: with covariance matrix… or with scikit-learn. Assuming X is a zero-centered array with samples as rows and variables as columns…

S = np.cov(X.T)
eigvalues, eigvectors = np.linalg.eig(S)
o = np.argsort(eigvalues)[::-1]
evs = eigvectors[:, o]
evrs = eigvalues[o]/eigvalues.sum()
Xt = (evs.T @ X.T).T
from sklearn.decomposition import PCA

pca = PCA(n_components=6)
Xt = pca.fit_transform(X)
# pca.components_ = evs.T
# pca.explained_variance_ratio_ = evrs

Quick quizz

Where would the first and second component be in this point cloud ?

Quick quizz

PCA in individual space

Note that there is an equivalence between doing the PCA in feature space and in individual space. If \(S_f = XX^T\) is the covariance matrix in feature space (meaning that \(S_f\) will have the size \(p \times p\) for \(p\) features), and we note \(S_i = X^TX\) the covariance matrix in individual space (with size \(n \times n\)), then:

\(v\) is an eigenvector of \(S_f\) \(\Leftrightarrow\) \(X^Tv\) is an eigenvector of \(S_i\).

Proof:

\(XX^Tv = \lambda v\) (by definition of eigenvalues/eigenvectors)
\(\Rightarrow X^T(XX^Tv) = X^T(\lambda v)\)
\(\Rightarrow (X^TX)(X^Tv) = \lambda (X^T v)\) \(\Rightarrow\) \(X^T v\) is an eigenvector of \(X^TX\).

When could that be useful ?

If \(p\) is large and \(n\) is comparatively smaller, we may have situation where inverting \(S_f\) is computationally very expensive. This could for instance happen when dealing with images, where \(p\) is the number of pixels.

PCA Summary (so far)

Dimensionality reduction

Since the first components capture a large proportion of the variance in the data, we can discard the last components while preserving most of the information.

Dimensionality reduction

PCA in this case becomes a transformation from a \(p\) dimensional space (feature space) into a \(q \leq p\) dimensional space. Typically, we will choose \(q\) such that the cumulative explained variance ratio is \(>80\%\) or \(>90\%\).

Original data

Scaled data

Dimensionality reduction

Very useful for visualization (and sometimes as a preprocessing step for further data analysis)… but may misrepresent some of the points \(\rightarrow\) check the distance between each point and its projection (i.e. distance between the point and the hyperplane formed by the \(q\) PCs).

Dimensionality reduction

Higher \(q\) \(\rightarrow\) more explained variance captured by principal components \(\rightarrow\) less residual projection error.

Original data

Scaled data

Reading

Quick Quizz

I have a dataset with radiology images, from patients with abdominal cancer and from healthy patients. After extracting 25 image features from the images, I perform PCA on those 25 numerical variables and select the 7 best features, representing 92% of the variance.

In this new, 7-dimensional feature space, healthy and cancer patients will be…

  1. More easily separated (i.e. forming two “more distinct” point clouds).
  2. Less easily separated (i.e. forming two “more overlapping” point clouds).
  3. Exactly as separated (PCA has no impact on separability).
  4. We don’t know (PCA may impact in one direction or the other).
  1. No, that cannot happen. Since we are just rotating the point cloud (and then removing dimensions) we cannot separate point clouds that are overlapping. However…

  2. This may happen if we have two classes that are well separated, but where the separation is not in an axis of high variance. Example: think about two thin parallel elongated point clouds, one per class.

  3. This is the best case scenario for separability.

  4. No, it can’t go in both directions.

Categorical variables: Multiple Correspondence Analysis

PCA works on numerical variables. What can we do if we have categorical variables ?

(Note: it’s a bit more complex than that… but we’ll not cover MCA more deeply in this course.)

Non-linear dimensionality reduction

Non-linear dimensionality reduction

MDS

MDS: Multidimensional Scaling.

Goal: find a low-dimensional (typically 2 or 3D) representation of the data which preserves the distances between points in the high-dimensional space.

Iterative process that tries to minimize the residual sum of squares between the distances in projected space and the distances in the original space.

Formally, if \(x_i\) is the position of point \(i\) in the original \(m\)-dimensional space, and \(\hat{x}_i\) is the position of point \(i\) in the projected \(q\)-dimensional space (\(q < m\)), then we want to minimize the stress: \(S = \sqrt{\sum_{i < j}(||x_i-x_j|| - ||\hat{x}_i-\hat{x}_j||)^2}\).

See also: https://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling

MDS

\(S = \sqrt{\sum_{i < j}(||x_i-x_j|| - ||\hat{x}_i-\hat{x}_j||)^2}\)

More generally, we can substitute \(||x_i-x_j||\) by any “dissimilarity metric” \(D_{ij}\).

\(S = \sqrt{\sum_{i < j}(D_{ij} - ||\hat{x}_i-\hat{x}_j||)^2}\)

A “non-metric” version also exists, which focuses on preserving the ranking of the distances rather than the distances themselves (i.e. there should be a monotonic relationship between \(D_{ij}\) and \(||\hat{x}_i-\hat{x}_j||\), but they don’t need to be equal).

Note that MDS is very slow if the number of points gets large.

MDS Example

From Foucart, 2022 (PhD dissertation)
https://research.adfoucart.be/thesis/7-iv.html

Example: agreement between experts in prostate cancer annotations.

\(D_{ij} = 1-\kappa_U(i, j)\), where \(\kappa_U(i, j)\) is the unweighted Cohen’s kappa (agreement measure) between the annotations of expert \(i\) and expert \(j\).

MDS Example

Example: agreement between experts in prostate cancer annotations.

\(D_{ij} = 1-\kappa_U(i, j)\), where \(\kappa_U(i, j)\) is the unweighted Cohen’s kappa (agreement measure) between the annotations of expert \(i\) and expert \(j\).

MDS Example

Residuals = “stress”: difference between distance in projected space and dissimilarity.

MDS Example

Some randomness to the process: always check the stability of the result!

t-SNE

Similar idea, but based on joint probabilities. Given two points \(i\) and \(j\), the conditional probability \(p_{j|i}\) (loosely interpretable as “probability that \(j\) is in the neighbourhood of \(i\)”) is defined as:

\(p_{j|i} = \frac{exp(-||x_i-x_j||^2)/2\sigma_i^2}{\sum_{k \neq i}exp(-||x_i-x_k||^2)/2\sigma_i^2}\) if \(i \neq j\), and \(p_{i|i} = 0\).

The joint probability \(p_{ij}\) is defined as: \(p_{ij} = \frac{p_{j|i}+p_{i|j}}{2N}\).

This similarity can also be computed in the projected space (\(q_{ij}\), with a slightly different definition out of scope for this course). t-SNE tries to minimize the Kullback-Leibler divergence between the distributions of probabilities before and after projection:

\(KL(P || Q) = \sum_{i \neq j}p_{ij}log\frac{p_{ij}}{q_{ij}}\).

t-SNE

From Shi et al., Visualizing and understanding graph convolutional network, 2021.

Example

Medical MNIST (https://www.kaggle.com/datasets/andrewmvd/medical-mnist)

Abdomen CT

Breast MRI

Chest CT

Chest XR

Hand XR

Head CT

Example

Original feature space: 64x64 pixels = 4096 dimensions. Using a subset of 6x1000 images.

PCA \(\rightarrow\) 2 dimensions

MDS \(\rightarrow\) 2 dimensions

t-SNE \(\rightarrow\) 2 dimensions

Example

Big advantage of PCA: the dimensions are interpretable.

Reprojection of vectors along the main directions in PC space.

Example

t-SNE: choice of perplexity (recommended between 5 and 50 in van der Maaten and Hinton, 2008):

Perplexity=5

Perplexity=30

Perplexity=50

Autoencoders

Autoencoders

Autoencoders

self.encoder = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, padding=1, stride=2), # 32x32
    nn.GELU(),
    nn.Conv2d(16, 32, kernel_size=3, padding=1, stride=2), # 16x16
    nn.GELU(),
    nn.Conv2d(32, 32, kernel_size=3, padding=1, stride=2), # 8x8
    nn.GELU(),
    nn.Conv2d(32, 32, kernel_size=3, padding=1, stride=2), # 4x4
    nn.GELU(),
    nn.Flatten(),
    nn.Linear(4*4*32, encoding_dim)
)
self.decoder = nn.Sequential(
    nn.Linear(encoding_dim, 4*4*32),
    nn.Unflatten(1, (32, 4, 4)),
    nn.GELU(),
    nn.ConvTranspose2d(32, 32, kernel_size=3, padding=1, stride=2, output_padding=1), # 8x8
    nn.GELU(),
    nn.ConvTranspose2d(32, 32, kernel_size=3, padding=1, stride=2, output_padding=1), # 16x16
    nn.GELU(),
    nn.ConvTranspose2d(32, 16, kernel_size=3, padding=1, stride=2, output_padding=1), # 32x32
    nn.GELU(),
    nn.ConvTranspose2d(16, 1, kernel_size=3, padding=1, stride=2, output_padding=1), # 64x64
    nn.Tanh()
)

Summary of dimensionality reduction

Clustering

Basic principles

Given a data table of \(n\) individuals and \(p\) variables, group the individuals (or, less frequently, the variables) into a small number of homogeneous groups.

Note: this is still unsupervised, so we are looking for “natural” groups into which the data can be separated.

Two main types of approaches: non-hierarchical and hierarchical.

Basic principles

Non-hierarchical methods

Partitioning of the point cloud into \(k\) groups.

Basic principles

! The groups may differ from the actual “classes” of a supervised problem !

PCA with supervised classes

PCA with clusters (K-Means)

Basic principles

Hierarchical methods.

Partitioning into groups and subgroups.

Bottom-up methods (agglomerative)

Basic principles

Top-down methods (divisive)

Basic principles

There is no single best criterion or best method for clustering. Different methods, distance metrics, criterion will lead to different point of views on the data, which may sometimes be useful to combine and compare.

Distance and dissimilarity

A distance metric must respect four properties:

  1. Non-negativity: \(d(x, y) \geq 0\)
  2. Identity: \(d(x, y) = 0 \Leftrightarrow x = y\)
  3. Symmetry: \(d(x, y) = d(y, x)\)
  4. Triangle inequality: \(d(x, y) + d(y, z) \geq d(x, z)\)

A similarity metric only needs to respect (1), (2) and (3).

Example: cosine distance \(CD(x, y) = 1 - \frac{x \cdot y}{||x|| ||y||} = 1-cos(\angle xy)\) is not a true distance metric (if \(\angle xy = \angle xz = \frac{\pi}{4}\) and \(\angle xz = \frac{\pi}{2}\), then \(CD(x, y) = CD(y, z) = 1-\frac{1}{\sqrt{2}}\) and \(CD(x, z) = 1 \Rightarrow CD(x,y) + CD(y, z) = 2 - \frac{2}{\sqrt{2}} \approx 0.59 < 1\)).

Distance metrics

For numerical data:

Euclidean distance: easy to understand and calculate, very common \(\rightarrow\) region of equal distance to a point = hypersphere \(\rightarrow\) good at identifying spherical clusters.

When variables have different variances, however, euclidean distance give too much importance to high-variance dimensions \(\rightarrow\) compute euclidean distance on normalized variables.

Distance metrics

We can formulate distance metrics using a metric tensor such that \(d(x, y) = (x-y)^T M (x-y)\).

Euclidean distance: \(M = I \Rightarrow d(x, y) = \begin{pmatrix} x_0-y_0 & ... & x_p-y_p \end{pmatrix} \begin{pmatrix} 1 & ... & ... \\ ... & 1 & ... \\ ... & ... & 1 \end{pmatrix} \begin{pmatrix} x_0-y_0 \\ ... \\ x_p-y_p \end{pmatrix}\)

Standardized euclidean distance: \(M = diag(1/s_i^2) \Rightarrow d(x, y) = \begin{pmatrix} x_0-y_0 & ... & x_p-y_p \end{pmatrix} \begin{pmatrix} 1/s_0^2 & ... & ... \\ ... & 1/s_i^2 & ... \\ ... & ... & 1/s_p^2 \end{pmatrix} \begin{pmatrix} x_0-y_0 \\ ... \\ x_p-y_p \end{pmatrix}\).

Distance metrics

Mahalanobis distance (takes into account the inter-variable correlations \(\rightarrow\) use of the covariance matrix): \(M = S^{-1}\).

Same as Euclidean distance after PCA and scaling. (Note: PCA, then scaling in PCA space, not scaling in feature space + PCA!)

Distance metrics

For categorical data:

Start by forming the indicator table (cf. MCA) \(\rightarrow m = \sum_p m_p\) dimensions, with \(m_p\) the number of possible categories for variable \(p\). Then, we can compute a “euclidean distance”:

\(d^2(x, y) = \sum_{j=1}^m (x_j-y_j)^2\), which is equal to the number of discordant values between the two individuals in the indicator table. This is equal to \(2 \times\) the number of variables for which the values differ.

\(\begin{matrix} x = & 1 0 & 0 1 & 0 1 0 \\ y = & 0 1 & 0 1 & 1 0 0 \\ (x - y)^2 = & 1 1 & 0 0 & 1 1 0 \end{matrix} \rightarrow d^2(x, y) = 4\)

Distance metrics

Since we have \(0 \leq d^2(x, y) \leq 2p\) with \(p\) the number of initial variables, we can normalize this distance into a similarity measure:

\(s(x, y) = \frac{2p-d^2(x, y)}{2p} \rightarrow 0 \leq s(x, y) \leq 1\) is the % of variables with the same value.

\(\begin{matrix} x = & 1 0 & 0 1 & 0 1 0 \\ y = & 0 1 & 0 1 & 1 0 0 \\ (x - y)^2 = & 1 1 & 0 0 & 1 1 0 \end{matrix} \rightarrow d^2(x, y) = 4\)

\(\rightarrow s(x, y) = \frac{6-4}{6} = \frac{1}{3}\)

Distance metrics

What about mixed data ?

Possibility 1: as in MCA, convert numerical data into categorical data by binning the variables \(\rightarrow\) lots of information lost.

Possibility 2: use “mixed” measure. General idea: \(d^2(x, y) = \sum_{j=1}^p \delta_j(x, y)\) where \(\delta_j\) is the contribution of variable \(j\) to the distance.

\(\delta_j\) are for instance constructed so that \(0 \leq \delta_j \leq 1\) and \(\delta_j(x, y) = 0 \Leftrightarrow x = y\). So we have similarly scaled and independent contributions of the different variables to the total distance.

Example: for numerical variables, \(\delta_j(x, y) = \frac{(x_j-y_j)^2 - min_{x^*, y^*}(x_j^*-y_j^*)^2}{max_{x^*, y^*}(x_j^*-y_j^*)^2-min_{x^*, y^*}(x_j^*-y_j^*)^2}\)

Hierarchical clustering

Bottom-up approach: iteratively group closest individuals (or groups).

Start from all separated individuals: \(G = \{g_i\}, g_i = \{x_i\}\) where \(g_i\) is the \(i\)-th group and \(x_i\) the \(i\)-th individual.

  1. Compute the pairwise distance of the \(n\) initial groups: \(D_{n \times n}\) and \(D_{ij} = d(g_i, g_j) = d(x_i, x_j)\). This will be give us a symmetric matrix with 0s on the diagonal.

  2. Find the non-diagonal minimum of \(D\) (i.e. the pair of elements that are closest together). Let \(m, n\) be the indices of these elements: merge those two elements so that \(G \leftarrow G \setminus \{g_m, g_n\} \cup \{g_k = g_m \cup g_n\}\) (i.e. remove “old” groups, add new group formed by the union of the two).

Hierarchical clustering

  1. Recompute distance matrix \(D_{n \times n} \leftarrow D_{n-1 \times n-1}\) with \(D_{ij} = d(g_i, g_j)\).

  2. Repeat step 2 and 3 until a single element remains.

How do we define \(d(g_i, g_j)\) when \(g_i\) and/or \(g_j\) contain more than a single element ? Different possibilities:

Single linkage = “minimum distance”: \(d(g_i, g_j) = min_{x_a \in g_i, x_b \in g_j} d(x_a, x_b)\)

Complete linkage = “maximum distance”

Average linkage: \(d(g_i, g_j) = \frac{1}{|g_i||g_j|} \sum_{x_a \in g_i, x_b \in g_j} d(x_a, x_b)\)

Hierarchical clustering

Hierarchical clustering

Other common metric: Ward criterion \(\rightarrow\) minimizes the intra-cluster variance:

\(d(g_i, g_j) = \sum_{x \in g_i \cup g_j} ||x-\mu_{g_i \cup g_j}||^2 - \sum_{x \in g_i}||x-\mu_{g_i}||^2 - \sum_{x \in g_j} ||x - \mu_{g_j}||^2\)

Hierarchical clustering

Representation of the clusters’ hierarchy with the dendogram: shows the pairings. To get n clusters, we can “cut” the tree at the desired “height”.

Hierarchical clustering

Criteria for where to cut the tree:

May be useful to remove outliers before starting the clustering.

Hiererachical clustering

Summary: the results of the hierarchical clustering will be influenced by…

Note: same principle can also be applied on the features for dimensionality reduction (i.e. pair most similar features and “merge” them into agglomerated features).

Non-hierarchical clustering

Basic principle: partition the point cloud into n clusters that optimizes some “fitness” function. Usually iterative methods:

  1. Starting from initial clusters (allocated how?)
  2. Estimate fitness function (which?)
  3. Do something that changes the clusters (do what?)
  4. Repeat 2-3 until we converge to a stable solution.

K-Means

Most well-known non-hierarchical clustering method: K-Means.

In its most “basic” version:

  1. Initial cluster centers are chosen randomly. Points are assigned to closest center (according to a chosen distance metric).
  2. Fitness function: minimize intra-cluster inertia: \(I_w = \sum_{c=1}^{n} \sum_{p \in g_c}d^2(p, \mu_c)\).
  3. Cluster centers are updated (centroïd of the points assigned to the cluster).
  4. Repeat 2-3 until clusters are stable.

K-Means

K-Means assumptions:

K-Means

Initialization: random initialization is not optimal (and not stable). Possible solutions:

K-Means - Medical MNIST

K-Means - Random initialization

K-Means - Assign to closest

K-Means - Update centroids & reassign

K-Means - Update centroids & reassign

K-Means - Update centroids & reassign

K-Means - Converged

Mixed methods

Combining methods can lead to better (or at least more robust) results. Possible pipeline:

  1. Remove outliers.
  2. Hierarchical clustering, cut the tree based on distance criterion.
  3. Non-hierarchical clustering initialized with cluster centers from 2.

Full unsupervised pipeline

Objective: describe a dataset and find the natural structure of the data, the interactions (relationships) between variables, and the natural groupings of the data.

Example pipeline:

  1. Use PCA or MCA \(\rightarrow\) visualization in 2 or 3D, remove “unecessary” dimensions, makes euclidean distance more appropriate (particularly if scaling before or after PCA).
  2. Use different clustering methods to identify stable groups.
  3. Check visually that clusters “make sense” with the data.
  4. Validate the quality of the clustering

Quality control

Different ways to control the “quality” of the clustering.

Hypothesis test: if \(H_0\) is that there is no natural partitioning of the data(and \(H_1\) that there is a natural partitioning into k groups).

We can choose a test statistic (e.g. ratio between inter- and intra-cluster inertia \(\lambda = \frac{I_B}{I_W}\)). What would this statistic look like under \(H_0\)?

\(\rightarrow\) simulation is a useful tool in this case !

Quality control

Example: take global parameters of the point cloud (mean, variance…), generate \(N\) random point clouds based on those parameters.

Given a clustering method:

Quality control

Other possibility: run different clustering methods (or with different parameters), compare the results using a quality index.

Example: Dunn’s index.

\(Dunn = \frac{S_{min}}{D_{max}}\) where \(S_{min}\) is the minimum separation between clusters (minimum distance between two individuals from different clusters) and \(D_{max}\) is the maximum diameter (maximum distance between two individuals of the same clusters).

Dunn’s index should be maximized in better clustering solution (i.e. more compact groups, more separation between groups).

Quality control

…but remember that such a quality index don’t tell anything about how “meaningful” the clusters are!

Interpreting the results requires looking into the data: what do the points in the same clusters have in common?

\(\Rightarrow\) data analysis can’t be focused only on the numbers and the abstraction of the “real world”.

Conclusions

To find the groups, it’s a lot easier when we have some labeled examples: \(\rightarrow\) supervised (or semi-supervised methods).