4. Unsupervised methods
| Individual | \(X_1\) | … | \(X_j\) | … | \(X_p\) |
|---|---|---|---|---|---|
| … | … | … | … | … | … |
| i | \(x_{i1}\) | … | \(x_{ij}\) | … | \(x_{ip}\) |
| … | … | … | … | … | … |
Line vector \(x_{i.}\) represents one individual in a p-dimensional space \(\rightarrow\) “variable space” or “feature space”.
All rows together form a point cloud showing the distribution of individuals in feature space.





A distance function (or metric) is a function \(d(x, y)\) (\(x\), \(y\) being vectors) such that, for all vectors \(x, y, z\):


Feature space transformation \(\rightarrow\) distances that are (maybe) more meaningful.
Feature space transformations are an essential part of machine learning algorithms. In fact, almost all machine learning algorithms can be essentially reduced to:

Why PCA ?
First step: center the point cloud so that the origin of our new coordinate system is at the center of gravity of the point cloud \(\rightarrow\) \(x_c = x - \bar{x}\)

Determination of the first component \(u\): direction in the point cloud that follows the axis of maximum elongation \(\rightarrow\) minimizes the distance between each point and the line, or maximizes the variance of the point cloud projected on that axis.
\(min \sum_{i=1}^{n} d^2(x_{i.}, u)\)
\(max \sum_{i=1}^{n} (x_{i.}^T u)^2\)
With the added constraint that \(u\) must be a unit vector (since we are building a new basis system).

Good videos to better understand eigenvectors and eigenvalues: 3Blue1Brown (https://www.youtube.com/watch?v=PFDu9oVAE-g).
The rest of that series on linear algebra is also very useful if linear algebra is not very fresh on your mind!
Example: for angles by increment of 1 degree, sum of the squared dot products between the points and the unit vector of that direction (scaled to fit on the graph, maximum value noted in black):

How do we find the exact direction of \(u\)? \(\rightarrow\) through the variance-covariance matrix. We want our new basis to have maximum variance along the axis. We can formulate that as searching for an orthonormal transformation matrix \(P\) such that \(PX\) has a diagonal covariance matrix (i.e. maximum variance along the axis, minimal covariance between the new variables).

\(X_{2 \times N}\): coordinates of
the samples in feature space.
\(P_{2 \times 2}\): transformation
matrix.
Objective: \(Cov(X) = \begin{pmatrix} s_1^2
& s_{12} \\ s_{12} & s_2^2 \end{pmatrix}\)
\(\rightarrow\) \(Cov(PX) = \begin{pmatrix} \hat{s}_1^2 & 0 \\ 0
& \hat{s}_2^2 \end{pmatrix}\)
How do we find the exact direction of \(u\)? \(\rightarrow\) through the variance-covariance matrix.
Assuming that \(X\) is zero-centered…
\(Cov(PX) = E[(PX-E[PX])(PX-E[PX])^T] =
E[PX(PX)^T]\) (because \(P\) is
orthonormal, so \(E[PX] =
E[X] = 0\))
\(Cov(PX) = E[PXX^{T}P^{T}]\) (because
\((AB)^T = B^TA^T\))
\(Cov(PX) = PE[XX^{T}]P^{T}\) (because
\(X\) is a random variable, but \(P\) has a fixed value)
\(Cov(PX) = PCov(X)P^T\) (\(E[XX^T] = Cov(X)\) if \(X\) is zero-centered)
\(Cov(PX) = PCov(X)P^{-1}\)
(orthogonality constraint: \(P^T =
P^{-1}\))

\(X_{2 \times N}\): coordinates of
the samples in feature space.
\(P_{2 \times 2}\): transformation
matrix.
\(Cov(PX) = PCov(X)P^{-1}\)
\(\begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix} = P\begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix}P^{-1}\)
\(\Rightarrow Cov(X)\) is diagonalizable by \(P\).
\(Cov(X) = P^{-1}Cov(PX)P \Rightarrow\) the column vectors of \(P^{-1}\) form a basis consisting of eigenvectors of \(Cov(X)\), and the diagonal of \(Cov(PX)\) contains the corresponding eigenvalues.
Let \(P = \begin{pmatrix} p_{11} & p_{12} \\ p_{21} & p_{22} \end{pmatrix}\)
\(\rightarrow \begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix} = \begin{pmatrix} p_{11} & p_{21} \\ p_{12} & p_{22} \end{pmatrix} \begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix} \begin{pmatrix} p_{11} & p_{12} \\ p_{21} & p_{22} \end{pmatrix}\)
\(\rightarrow [p_{11}, p_{12}]\) and \([p_{21}, p_{22}]\) are the two eigenvectors, and \(\hat{s}_1^2, \hat{s}_2^2\) are the two corresponding eigenvalues.
\(Cov(PX) = PCov(X)P^{-1} \Rightarrow Cov(X)\) is diagonalizable by \(P\).
\(Cov(X) = P^{-1}Cov(PX)P \Rightarrow\) the column vectors of \(P^{-1}\) form a basis consisting of eigenvectors of \(Cov(X)\), and the diagonal of \(Cov(PX)\) contains the corresponding eigenvalues.
Therefore, the eigendecomposition of \(Cov(X)\) gives us eigenvectors which form a new basis system with maximum variance along the axes, and the eigenvalues correspond to the per-new-axis variance.
How do we find the exact direction of \(u\)? \(\rightarrow\) through the variance-covariance matrix.
Specifically: \(u\) is the normalized eigenvector of the variance-covariance matrix \(Cov(X)\) with the highest corresponding eigenvalue \(\lambda\).
# Xc is the centered data
S = np.cov(Xc, rowvar=False)
eigvalues, eigvectors = np.linalg.eig(S)
# descending order of eigenvalues
o = np.argsort(eigvalues)[::-1]
# first eigenvector
u = eigvectors[:, o[0]] 
Determination of the next components:
\(\rightarrow\) next eigenvectors, by descending order of eigenvalues !
Eigenvalues can be interpreted as the explained variance of each principal component.
\(\rightarrow \frac{\lambda_i}{\sum_j^p \lambda_j}\) is the explained variance ratio of the \(i\)-th eigenvector.

\(X_{2 \times N}\): coordinates of
the samples in feature space.
\(P_{2 \times 2}\): transformation
matrix.
\(Cov(PX) = PCov(X)P^{-1}\)
\(Cov(X) = P^{-1}Cov(PX)P\)
\(\begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix} = P^{-1}\begin{pmatrix} \hat{s}_1^2 & 0 \\ 0 & \hat{s}_2^2 \end{pmatrix}P\)
\(\begin{pmatrix} s_1^2 & s_{12} \\ s_{12} & s_2^2 \end{pmatrix} = \begin{pmatrix} u_1 & v_1 \\ u_2 & v_2 \end{pmatrix} \begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \begin{pmatrix} u_1 & v_1 \\ u_2 & v_2 \end{pmatrix}^{-1}\)

Projection on the new basis system: \(X_{t} = PX\)

\(\lambda_1 = 68.91\), \(\lambda_2 = 3.48\) \(\rightarrow\) \(EV_1 = 95\%\), \(EV_2 = 5\%\)
\(\begin{pmatrix} 37.7 & 32.7 \\ 32.7 & 34.7 \end{pmatrix} = \begin{pmatrix} 0.72 & -0.69 \\ 0.69 & 0.72 \end{pmatrix} \begin{pmatrix} 68.91 & 0 \\ 0 & 3.48 \end{pmatrix} \begin{pmatrix} 0.72 & 0.69 \\ -0.69 & 0.72 \end{pmatrix}\)
Example: diabetes dataset. Reminder: we have 6 numerical variables corresponding to blood sample results (\(s_1, ..., s_6\)). Let’s do a PCA (after zero-centering the data):
S = np.cov(Xc)
eigvalues, eigvectors = np.linalg.eig(S)
o = np.argsort(eigvalues)[::-1]
evs = eigvectors[:, o]
Xt = (evs.T @ Xc).T
Explained variance ratios (eigvalues[o]/eigvalues.sum())
\(\rightarrow\)
0.84, 0.09, 0.05, 0.02, 0.00, 0.00
Or, cumulatively:
0.84, 0.93, 0.98, 1.0, 1.0, 1.0
In other words, we need three components to get to
98% of the explained variance.

What are the contributions of the variables to the components?


\(s_4\) and \(s_5\) also have very low standard
deviations compared to the other variables \(\rightarrow\) PCA is influenced by the
scale. Rescaling by the standard deviation before PCA
(X = X / np.std(X, axis=0)) can ensure that their
contribution to the total variance is not ignored.
Explained variance ratios (eigvalues[o]/eigvalues.sum())
\(\rightarrow\)
0.55, 0.22, 0.13, 0.09, 0.01, 0.00
Or, cumulatively:
0.55, 0.77, 0.90, 0.99, 1.0, 1.0
With three components, we now get 90% of the
explained variance instead of 98..

What are the contributions of the variables to the components?

Note that after scaling, the explained variance of the first two components is much lower \(\rightarrow\) we effectively increased the dimensionality of the dataset by equalizing the influence of the different variables, whereas the original dataset was “flat” along two of the dimensions !
Link with correlations between variables \(\rightarrow\) highly correlated variables will tend to be close together when projected on the principal components.


Implementation: with covariance matrix… or with scikit-learn.
Assuming X is a zero-centered array with samples as rows
and variables as columns…
S = np.cov(X.T)
eigvalues, eigvectors = np.linalg.eig(S)
o = np.argsort(eigvalues)[::-1]
evs = eigvectors[:, o]
evrs = eigvalues[o]/eigvalues.sum()
Xt = (evs.T @ X.T).Tfrom sklearn.decomposition import PCA
pca = PCA(n_components=6)
Xt = pca.fit_transform(X)
# pca.components_ = evs.T
# pca.explained_variance_ratio_ = evrs
Where would the first and second component be in this point cloud ?

Note that there is an equivalence between doing the PCA in feature space and in individual space. If \(S_f = XX^T\) is the covariance matrix in feature space (meaning that \(S_f\) will have the size \(p \times p\) for \(p\) features), and we note \(S_i = X^TX\) the covariance matrix in individual space (with size \(n \times n\)), then:
\(v\) is an eigenvector of \(S_f\) \(\Leftrightarrow\) \(X^Tv\) is an eigenvector of \(S_i\).
Proof:
\(XX^Tv = \lambda v\) (by definition
of eigenvalues/eigenvectors)
\(\Rightarrow X^T(XX^Tv) = X^T(\lambda
v)\)
\(\Rightarrow (X^TX)(X^Tv) = \lambda (X^T
v)\) \(\Rightarrow\) \(X^T v\) is an eigenvector of \(X^TX\).
When could that be useful ?
If \(p\) is large and \(n\) is comparatively smaller, we may have situation where inverting \(S_f\) is computationally very expensive. This could for instance happen when dealing with images, where \(p\) is the number of pixels.
sklearn),
rescaling (mostly if the scale of the variables are
intrinsically different).Since the first components capture a large proportion of the variance in the data, we can discard the last components while preserving most of the information.

PCA in this case becomes a transformation from a \(p\) dimensional space (feature space) into a \(q \leq p\) dimensional space. Typically, we will choose \(q\) such that the cumulative explained variance ratio is \(>80\%\) or \(>90\%\).
Original data

Scaled data

Very useful for visualization (and sometimes as a preprocessing step for further data analysis)… but may misrepresent some of the points \(\rightarrow\) check the distance between each point and its projection (i.e. distance between the point and the hyperplane formed by the \(q\) PCs).

Higher \(q\) \(\rightarrow\) more explained variance captured by principal components \(\rightarrow\) less residual projection error.
Original data

Scaled data


I have a dataset with radiology images, from patients with abdominal cancer and from healthy patients. After extracting 25 image features from the images, I perform PCA on those 25 numerical variables and select the 7 best features, representing 92% of the variance.
In this new, 7-dimensional feature space, healthy and cancer patients will be…
No, that cannot happen. Since we are just rotating the point cloud (and then removing dimensions) we cannot separate point clouds that are overlapping. However…
This may happen if we have two classes that are well separated, but where the separation is not in an axis of high variance. Example: think about two thin parallel elongated point clouds, one per class.
This is the best case scenario for separability.
No, it can’t go in both directions.
PCA works on numerical variables. What can we do if we have categorical variables ?
(Note: it’s a bit more complex than that… but we’ll not cover MCA more deeply in this course.)

MDS: Multidimensional Scaling.
Goal: find a low-dimensional (typically 2 or 3D) representation of the data which preserves the distances between points in the high-dimensional space.
Iterative process that tries to minimize the residual sum of squares between the distances in projected space and the distances in the original space.
Formally, if \(x_i\) is the position of point \(i\) in the original \(m\)-dimensional space, and \(\hat{x}_i\) is the position of point \(i\) in the projected \(q\)-dimensional space (\(q < m\)), then we want to minimize the stress: \(S = \sqrt{\sum_{i < j}(||x_i-x_j|| - ||\hat{x}_i-\hat{x}_j||)^2}\).
See also: https://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling
\(S = \sqrt{\sum_{i < j}(||x_i-x_j|| - ||\hat{x}_i-\hat{x}_j||)^2}\)
More generally, we can substitute \(||x_i-x_j||\) by any “dissimilarity metric” \(D_{ij}\).
\(S = \sqrt{\sum_{i < j}(D_{ij} - ||\hat{x}_i-\hat{x}_j||)^2}\)
A “non-metric” version also exists, which focuses on preserving the ranking of the distances rather than the distances themselves (i.e. there should be a monotonic relationship between \(D_{ij}\) and \(||\hat{x}_i-\hat{x}_j||\), but they don’t need to be equal).
Note that MDS is very slow if the number of points gets large.
From Foucart, 2022 (PhD
dissertation)
https://research.adfoucart.be/thesis/7-iv.html
Example: agreement between experts in prostate cancer annotations.
\(D_{ij} = 1-\kappa_U(i, j)\), where \(\kappa_U(i, j)\) is the unweighted Cohen’s kappa (agreement measure) between the annotations of expert \(i\) and expert \(j\).

Example: agreement between experts in prostate cancer annotations.
\(D_{ij} = 1-\kappa_U(i, j)\), where \(\kappa_U(i, j)\) is the unweighted Cohen’s kappa (agreement measure) between the annotations of expert \(i\) and expert \(j\).
Residuals = “stress”: difference between distance in projected space and dissimilarity.

Some randomness to the process: always check the stability of the result!

Similar idea, but based on joint probabilities. Given two points \(i\) and \(j\), the conditional probability \(p_{j|i}\) (loosely interpretable as “probability that \(j\) is in the neighbourhood of \(i\)”) is defined as:
\(p_{j|i} = \frac{exp(-||x_i-x_j||^2)/2\sigma_i^2}{\sum_{k \neq i}exp(-||x_i-x_k||^2)/2\sigma_i^2}\) if \(i \neq j\), and \(p_{i|i} = 0\).
The joint probability \(p_{ij}\) is defined as: \(p_{ij} = \frac{p_{j|i}+p_{i|j}}{2N}\).
This similarity can also be computed in the projected space (\(q_{ij}\), with a slightly different definition out of scope for this course). t-SNE tries to minimize the Kullback-Leibler divergence between the distributions of probabilities before and after projection:
\(KL(P || Q) = \sum_{i \neq j}p_{ij}log\frac{p_{ij}}{q_{ij}}\).
From Shi et al., Visualizing
and understanding graph convolutional network, 2021.
Medical MNIST (https://www.kaggle.com/datasets/andrewmvd/medical-mnist)
Abdomen CT 
Breast MRI 
Chest CT

Chest XR

Hand XR

Head CT

Original feature space: 64x64 pixels = 4096 dimensions. Using a subset of 6x1000 images.
PCA \(\rightarrow\) 2 dimensions

MDS \(\rightarrow\) 2 dimensions

t-SNE \(\rightarrow\) 2 dimensions

Big advantage of PCA: the dimensions are interpretable.
Reprojection of vectors along the main directions in PC space. 
t-SNE: choice of perplexity (recommended between 5 and 50 in van der Maaten and Hinton, 2008):
Perplexity=5

Perplexity=30

Perplexity=50



self.encoder = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=3, padding=1, stride=2), # 32x32
nn.GELU(),
nn.Conv2d(16, 32, kernel_size=3, padding=1, stride=2), # 16x16
nn.GELU(),
nn.Conv2d(32, 32, kernel_size=3, padding=1, stride=2), # 8x8
nn.GELU(),
nn.Conv2d(32, 32, kernel_size=3, padding=1, stride=2), # 4x4
nn.GELU(),
nn.Flatten(),
nn.Linear(4*4*32, encoding_dim)
)
self.decoder = nn.Sequential(
nn.Linear(encoding_dim, 4*4*32),
nn.Unflatten(1, (32, 4, 4)),
nn.GELU(),
nn.ConvTranspose2d(32, 32, kernel_size=3, padding=1, stride=2, output_padding=1), # 8x8
nn.GELU(),
nn.ConvTranspose2d(32, 32, kernel_size=3, padding=1, stride=2, output_padding=1), # 16x16
nn.GELU(),
nn.ConvTranspose2d(32, 16, kernel_size=3, padding=1, stride=2, output_padding=1), # 32x32
nn.GELU(),
nn.ConvTranspose2d(16, 1, kernel_size=3, padding=1, stride=2, output_padding=1), # 64x64
nn.Tanh()
)
Given a data table of \(n\) individuals and \(p\) variables, group the individuals (or, less frequently, the variables) into a small number of homogeneous groups.
Note: this is still unsupervised, so we are looking for “natural” groups into which the data can be separated.
Two main types of approaches: non-hierarchical and hierarchical.
Non-hierarchical methods
Partitioning of the point cloud into \(k\) groups.


! The groups may differ from the actual “classes” of a supervised problem !
PCA with supervised classes

PCA with clusters (K-Means)

Hierarchical methods.
Partitioning into groups and subgroups.
Bottom-up methods (agglomerative)

Top-down methods (divisive)




There is no single best criterion or best method for clustering. Different methods, distance metrics, criterion will lead to different point of views on the data, which may sometimes be useful to combine and compare.
A distance metric must respect four properties:
A similarity metric only needs to respect (1), (2) and (3).
Example: cosine distance \(CD(x, y) = 1 - \frac{x \cdot y}{||x|| ||y||} = 1-cos(\angle xy)\) is not a true distance metric (if \(\angle xy = \angle xz = \frac{\pi}{4}\) and \(\angle xz = \frac{\pi}{2}\), then \(CD(x, y) = CD(y, z) = 1-\frac{1}{\sqrt{2}}\) and \(CD(x, z) = 1 \Rightarrow CD(x,y) + CD(y, z) = 2 - \frac{2}{\sqrt{2}} \approx 0.59 < 1\)).
For numerical data:
Euclidean distance: easy to understand and calculate, very common \(\rightarrow\) region of equal distance to a point = hypersphere \(\rightarrow\) good at identifying spherical clusters.
When variables have different variances, however, euclidean distance give too much importance to high-variance dimensions \(\rightarrow\) compute euclidean distance on normalized variables.
We can formulate distance metrics using a metric tensor such that \(d(x, y) = (x-y)^T M (x-y)\).
Euclidean distance: \(M = I \Rightarrow d(x, y) = \begin{pmatrix} x_0-y_0 & ... & x_p-y_p \end{pmatrix} \begin{pmatrix} 1 & ... & ... \\ ... & 1 & ... \\ ... & ... & 1 \end{pmatrix} \begin{pmatrix} x_0-y_0 \\ ... \\ x_p-y_p \end{pmatrix}\)
Standardized euclidean distance: \(M = diag(1/s_i^2) \Rightarrow d(x, y) = \begin{pmatrix} x_0-y_0 & ... & x_p-y_p \end{pmatrix} \begin{pmatrix} 1/s_0^2 & ... & ... \\ ... & 1/s_i^2 & ... \\ ... & ... & 1/s_p^2 \end{pmatrix} \begin{pmatrix} x_0-y_0 \\ ... \\ x_p-y_p \end{pmatrix}\).
Mahalanobis distance (takes into account the inter-variable correlations \(\rightarrow\) use of the covariance matrix): \(M = S^{-1}\).
Same as Euclidean distance after PCA and scaling. (Note: PCA, then scaling in PCA space, not scaling in feature space + PCA!)

For categorical data:
Start by forming the indicator table (cf. MCA) \(\rightarrow m = \sum_p m_p\) dimensions, with \(m_p\) the number of possible categories for variable \(p\). Then, we can compute a “euclidean distance”:
\(d^2(x, y) = \sum_{j=1}^m (x_j-y_j)^2\), which is equal to the number of discordant values between the two individuals in the indicator table. This is equal to \(2 \times\) the number of variables for which the values differ.
\(\begin{matrix} x = & 1 0 & 0 1 & 0 1 0 \\ y = & 0 1 & 0 1 & 1 0 0 \\ (x - y)^2 = & 1 1 & 0 0 & 1 1 0 \end{matrix} \rightarrow d^2(x, y) = 4\)
Since we have \(0 \leq d^2(x, y) \leq 2p\) with \(p\) the number of initial variables, we can normalize this distance into a similarity measure:
\(s(x, y) = \frac{2p-d^2(x, y)}{2p} \rightarrow 0 \leq s(x, y) \leq 1\) is the % of variables with the same value.
\(\begin{matrix} x = & 1 0 & 0 1 & 0 1 0 \\ y = & 0 1 & 0 1 & 1 0 0 \\ (x - y)^2 = & 1 1 & 0 0 & 1 1 0 \end{matrix} \rightarrow d^2(x, y) = 4\)
\(\rightarrow s(x, y) = \frac{6-4}{6} = \frac{1}{3}\)
What about mixed data ?
Possibility 1: as in MCA, convert numerical data into categorical data by binning the variables \(\rightarrow\) lots of information lost.
Possibility 2: use “mixed” measure. General idea: \(d^2(x, y) = \sum_{j=1}^p \delta_j(x, y)\) where \(\delta_j\) is the contribution of variable \(j\) to the distance.
\(\delta_j\) are for instance constructed so that \(0 \leq \delta_j \leq 1\) and \(\delta_j(x, y) = 0 \Leftrightarrow x = y\). So we have similarly scaled and independent contributions of the different variables to the total distance.
Example: for numerical variables, \(\delta_j(x, y) = \frac{(x_j-y_j)^2 - min_{x^*, y^*}(x_j^*-y_j^*)^2}{max_{x^*, y^*}(x_j^*-y_j^*)^2-min_{x^*, y^*}(x_j^*-y_j^*)^2}\)
Bottom-up approach: iteratively group closest individuals (or groups).
Start from all separated individuals: \(G = \{g_i\}, g_i = \{x_i\}\) where \(g_i\) is the \(i\)-th group and \(x_i\) the \(i\)-th individual.
Compute the pairwise distance of the \(n\) initial groups: \(D_{n \times n}\) and \(D_{ij} = d(g_i, g_j) = d(x_i, x_j)\). This will be give us a symmetric matrix with 0s on the diagonal.
Find the non-diagonal minimum of \(D\) (i.e. the pair of elements that are closest together). Let \(m, n\) be the indices of these elements: merge those two elements so that \(G \leftarrow G \setminus \{g_m, g_n\} \cup \{g_k = g_m \cup g_n\}\) (i.e. remove “old” groups, add new group formed by the union of the two).
Recompute distance matrix \(D_{n \times n} \leftarrow D_{n-1 \times n-1}\) with \(D_{ij} = d(g_i, g_j)\).
Repeat step 2 and 3 until a single element remains.
How do we define \(d(g_i, g_j)\) when \(g_i\) and/or \(g_j\) contain more than a single element ? Different possibilities:
Single linkage = “minimum distance”: \(d(g_i, g_j) = min_{x_a \in g_i, x_b \in g_j} d(x_a, x_b)\)
Complete linkage = “maximum distance”
Average linkage: \(d(g_i, g_j) = \frac{1}{|g_i||g_j|} \sum_{x_a \in g_i, x_b \in g_j} d(x_a, x_b)\)

Other common metric: Ward criterion \(\rightarrow\) minimizes the intra-cluster variance:
\(d(g_i, g_j) = \sum_{x \in g_i \cup g_j} ||x-\mu_{g_i \cup g_j}||^2 - \sum_{x \in g_i}||x-\mu_{g_i}||^2 - \sum_{x \in g_j} ||x - \mu_{g_j}||^2\)

Representation of the clusters’ hierarchy with the dendogram: shows the pairings. To get n clusters, we can “cut” the tree at the desired “height”.

Criteria for where to cut the tree:
May be useful to remove outliers before starting the clustering.
Summary: the results of the hierarchical clustering will be influenced by…
Note: same principle can also be applied on the features for dimensionality reduction (i.e. pair most similar features and “merge” them into agglomerated features).
Basic principle: partition the point cloud into n clusters that optimizes some “fitness” function. Usually iterative methods:
Most well-known non-hierarchical clustering method: K-Means.
In its most “basic” version:
K-Means assumptions:
Initialization: random initialization is not optimal (and not stable). Possible solutions:







Combining methods can lead to better (or at least more robust) results. Possible pipeline:
Objective: describe a dataset and find the natural structure of the data, the interactions (relationships) between variables, and the natural groupings of the data.
Example pipeline:
Different ways to control the “quality” of the clustering.
Hypothesis test: if \(H_0\) is that there is no natural partitioning of the data(and \(H_1\) that there is a natural partitioning into k groups).
We can choose a test statistic (e.g. ratio between inter- and intra-cluster inertia \(\lambda = \frac{I_B}{I_W}\)). What would this statistic look like under \(H_0\)?
\(\rightarrow\) simulation is a useful tool in this case !
Example: take global parameters of the point cloud (mean, variance…), generate \(N\) random point clouds based on those parameters.
Given a clustering method:
Other possibility: run different clustering methods (or with different parameters), compare the results using a quality index.
Example: Dunn’s index.
\(Dunn = \frac{S_{min}}{D_{max}}\) where \(S_{min}\) is the minimum separation between clusters (minimum distance between two individuals from different clusters) and \(D_{max}\) is the maximum diameter (maximum distance between two individuals of the same clusters).
Dunn’s index should be maximized in better clustering solution (i.e. more compact groups, more separation between groups).
…but remember that such a quality index don’t tell anything about how “meaningful” the clusters are!
Interpreting the results requires looking into the data: what do the points in the same clusters have in common?
\(\Rightarrow\) data analysis can’t be focused only on the numbers and the abstraction of the “real world”.
To find the groups, it’s a lot easier when we have some labeled examples: \(\rightarrow\) supervised (or semi-supervised methods).