Multivariate statistics II
Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.
For this lab, we will use the alzheimer_disease_data.csv
file. This file contains a synthetic dataset, adapted from a dataset
created by Rabie El Kharoua and available on Kaggle https://www.kaggle.com/dsv/8668279,
with different types of informations about a sample of 2.149 patients:
demographic, lifestyle factors, medical history, clinical measurements,
cognitive assessments, symptoms… and an outcome variable: the Alzheimer
diagnostic. Do not use the dataset from Kaggle, as it
has been significantly modified to make it more realistic. Use the file
from the Virtual University.
The overall goal of the lab will be to analyze this data in order to determine if we can make certain assumptions about the variables (normality, uniform distribution…), if there are potential biases in the dataset that we should be aware of, and if we can find some interesting relationships either between the independent variables, or between the outcome variable and some of the independent variables.
Note: even though we are not going to be working on a classifier at the moment, we will still separate the data into a training set and a test set. This is because we will try to do classification at some point, and we don’t want to make choices on the classification method based on conclusions that we made using the full dataset!
The alzheimer.py
file contains two helper classes to
help you process the data. Here is a quick example showing how to split
the data into a training and test set, and how to retrieve the values
for a specific variable:
= AlzheimerDataset("alzheimers_disease_data.csv") # put the path to the CSV file here
dataset # set a random state for replicability between your experiments
= dataset.random_split(test_ratio=0.2, random_state=0)
train_data, test_data
= train_data["Diabetes"]
values # get the labels for the possible values
= train_data.labels("Diabetes")
labels # print the count per value for the Diabetes variable
print([(labels[v], (values==v).sum()) for v in np.unique(values)])
# note that this is a shorter way of writing:
for v in np.unique(values):
print(labels[v], (values==v).sum())
# get a pandas.DataFrame with all nominal variables:
= train_data.nominals
cats # get a pandas.DataFrame with all the numerical variables:
= train_data.numericals nums
Objectives: test the potential relationships between independent variables.
Objectives: test the potential relationships between the outcome variable and the independent variables.
Note: it is here particularly important to only use training set data!