Unsupervised methods
Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.
For this lab, we will use the “Breast Cancer Wisconsin” dataset (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic),
which is available directly from scikit-learn using sklearn.datasets.load_breast_cancer
.
It contains 30 cell nuclei features from images of breast cancer
biopsies, as well as the diagnosis (benign / malign).
The breast_cancer.py
(available on the Virtual
University or on Gitlab)
file contains a helper class to easily get the variables and separate a
train and test set. Example usage:
from breast_cancer import BreastCancerDataset
= BreastCancerDataset()
data
print(f"{data.Xtrain.shape[0]} cases in training set.")
print(f"{data.Xtest.shape[0]} cases in test set.")
# Show mean and standard deviation of the training set for each feature
= data.columns
features for i in range(len(data.columns)):
print(features[i], data.Xtrain[:, i].mean(), data.Xtrain[:, i].std(ddof=1))
cov
method from Numpy. Look out for the rowar
parameter of that
method and make sure that you compute the matrix correctly. What should
be its shape?linalg.eig
method from Numpy.PCA
class from scikit-learn to perform the PCA on the variables. Check that
the components and explained variance correspond to the eigenvalues and
eigenvectors.The scikit-learn library provides several clustering methods, described in their User Guide.
AgglomerativeClustering
class to perform a hierarchical clustering. Plot the results using a dendrogram
.
How do you interpret those results? Test the impact of changing the
linkage method (average, complete, single). Test the impact of
using the original data, or the PCA-transformed data.Note: to ensure that you compute the full tree and that you can create the dendrogram, you need to initialize the AgglomerativeClustering with:
AgglomerativeClustering(distance_threshold=0, n_clusters=None)
.
KMeans
class to perform non-hierarchical clustering. Choose the number of
clusters based on the hierarchical clustering results.