STAT-H400 - Lab 4

Unsupervised methods

Objectives

Perform dimensionality reduction.
Find clusters in multivariate data.

Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.

For this lab, we will use the “Breast Cancer Wisconsin” dataset (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic), which is available directly from scikit-learn using sklearn.datasets.load_breast_cancer. It contains 30 cell nuclei features from images of breast cancer biopsies, as well as the diagnosis (benign / malign).

The breast_cancer.py (available on the Virtual University or on Gitlab) file contains a helper class to easily get the variables and separate a train and test set. Example usage:

from breast_cancer import BreastCancerDataset

data = BreastCancerDataset()

print(f"{data.Xtrain.shape[0]} cases in training set.")
print(f"{data.Xtest.shape[0]} cases in test set.")

# Show mean and standard deviation of the training set for each feature
features = data.columns
for i in range(len(data.columns)):
    print(features[i], data.Xtrain[:, i].mean(), data.Xtrain[:, i].std(ddof=1))

Exercise 1

Compute the zero-centered matrix for the data by subtracting the mean value of each variable.
Compute the variance-covariance matrix. You can use the cov method from Numpy. Look out for the rowar parameter of that method and make sure that you compute the matrix correctly. What should be its shape?
Perform an eigenvalue decomposition to get the eigenvalues and eigenvectors of the covariance matrix. You can use the linalg.eig method from Numpy.
Use the PCA class from scikit-learn to perform the PCA on the variables. Check that the components and explained variance correspond to the eigenvalues and eigenvectors.
How many variables do you need to keep 90% of the explained variance?
Standardize the data by dividing each variable by its standard deviation. Re-compute the PCA. How many variables do you need now?

Exercise 2

The scikit-learn library provides several clustering methods, described in their User Guide.

Use the AgglomerativeClustering class to perform a hierarchical clustering. Plot the results using a dendrogram. How do you interpret those results? Test the impact of changing the linkage method (average, complete, single). Test the impact of using the original data, or the PCA-transformed data.

Note: to ensure that you compute the full tree and that you can create the dendrogram, you need to initialize the AgglomerativeClustering with: AgglomerativeClustering(distance_threshold=0, n_clusters=None).

Use the KMeans class to perform non-hierarchical clustering. Choose the number of clusters based on the hierarchical clustering results.
Is the distribution of the diagnosis variable similar in all clusters?