STAT-H400 - Lab 5

Classification

Objectives

Create a proper pipeline for supervised methods.
Train and compare the results of some simple classifiers.
Use Linear Discriminant Analysis to reduce the dimensionality of a supervised problem.

Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.

For this lab, we will use both the “Breast Cancer Wisconsin” dataset from lab 4 and the “Alzheimer” dataset from lab 3. In both cases, we have an outcome variable (benign/malign or alzheimer/no alzheimer), and we want to predict this outcome variable based on the others.

Exercise 1

We will start by creating a supervised pipeline with a simple linear classifier: the RidgeClassifier.

On the Alzheimer data, using the numerical variables only, use cross-validation to estimate the accuracy and F1-score of the RidgeClassifier for different values of the alpha parameter. Adapt the following code, which performs 5-fold cross-validation for one value of alpha:

from sklearn.model_selection import cross_validate
from sklearn.linear_model import RidgeClassifier
from alzheimer import AlzheimerDataset

dataset = AlzheimerDataset("alzheimers_disease_data.csv")
train_data, test_data = dataset.random_split(test_ratio=0.2, random_state=0)
X = train_data.numericals.to_numpy()
Y = train_data.outcome.to_numpy()[..., 0]

clf = RidgeClassifier(alpha=1.0)
scores = cross_validate(clf, X, Y, cv=5, scoring=['accuracy', 'f1'])

Based on your results from Lab 2, run the experiment using only numerical variables which are likely to have a relationship with the outcome variable. Do you improve your cross-validation results?
For the best value of alpha that you found, re-train the RidgeClassifier on the whole training set and compute the accuracy, precision, recall and F1-score on the test set.
(Optional) Using both numerical and nominal data, train a DecisionTreeClassifier and estimate its results using cross-validation. Re-run the experiment selecting only the variables with a relationship with the outcome variable.

Exercise 2

On the Breast Cancer data, use the LinearDiscriminantAnalysis class to reduce the dimensionality of the data using LDA. How do the “discriminative directions” compare to the “principal components” computed in Lab 4?
(Optional) Using cross-validation, find the best parameters for a RidgeClassifier and for a LogisticRegression classifier on the Breast Cancer data.
(Optional) Using the best parameters for both of these methods, re-train the classifiers on the whole training set and compute the F1-scores on the test set. Use the appropriate statistical test to determine if the results are significantly different at a significance level of 0.05.