STAT-H400 - Lab 3

Multivariate statistics II

Objectives

Visualize bivariate data and show the relationships between variables.
Test relationships between different types of variables and draw reasonable conclusions from the analysis.

Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.

For this lab, we will use the alzheimer_disease_data.csv file. This file contains a synthetic dataset, adapted from a dataset created by Rabie El Kharoua and available on Kaggle https://www.kaggle.com/dsv/8668279, with different types of informations about a sample of 2.149 patients: demographic, lifestyle factors, medical history, clinical measurements, cognitive assessments, symptoms… and an outcome variable: the Alzheimer diagnostic. Do not use the dataset from Kaggle, as it has been significantly modified to make it more realistic. Use the file from the Virtual University.

The overall goal of the lab will be to analyze this data in order to determine if we can make certain assumptions about the variables (normality, uniform distribution…), if there are potential biases in the dataset that we should be aware of, and if we can find some interesting relationships either between the independent variables, or between the outcome variable and some of the independent variables.

Note: even though we are not going to be working on a classifier at the moment, we will still separate the data into a training set and a test set. This is because we will try to do classification at some point, and we don’t want to make choices on the classification method based on conclusions that we made using the full dataset!

Exercise 1

The alzheimer.py file contains two helper classes to help you process the data. Here is a quick example showing how to split the data into a training and test set, and how to retrieve the values for a specific variable:

dataset = AlzheimerDataset("alzheimers_disease_data.csv") # put the path to the CSV file here
# set a random state for replicability between your experiments
train_data, test_data = dataset.random_split(test_ratio=0.2, random_state=0)

values = train_data["Diabetes"]
# get the labels for the possible values
labels = train_data.labels("Diabetes")
# print the count per value for the Diabetes variable
print([(labels[v], (values==v).sum()) for v in np.unique(values)])
# note that this is a shorter way of writing:
for v in np.unique(values):
    print(labels[v], (values==v).sum()) 

# get a pandas.DataFrame with all nominal variables:
cats = train_data.nominals
# get a pandas.DataFrame with all the numerical variables:
nums = train_data.numericals

Check the distribution of values for the nominal variables. From the description of the variable, which ones do you think should have equal frequencies of values in order to have an unbiased dataset? Propose appropriate visualizations.
Use the appropriate statistical test to determine if thoses variables have equal frequencies of values. Can you conclude that the dataset is unbiased or biased? What limitations does that put on conclusions that you may later draw?
Check the distribution of value for the numerical variables with a visualization. Do some of those variables follow an approximately normal distribution? Do they potentially follow a uniform distribution? Make hypotheses and accept or reject them with the appropriate statistical test.

Exercise 2

Objectives: test the potential relationships between independent variables.

Based on the results of the test from the previous exercise, select the appropriate test for the 1 vs 1 relationships between the different variables:
- For numerical vs numerical, should you use a Pearson or a Spearman correlation coefficient?
- For categorical vs categorical, should you use a Chi-square independence test or a McNemar test?
- How can you test categorical vs numerical variables?
As you are making lots of different tests, what impact does that have on the significance level that you should use?
Given those results, do you expect that it is possible to reduce the dimensionality of the problem by merging together some variables that are highly correlated?

Exercise 3

Objectives: test the potential relationships between the outcome variable and the independent variables.

Note: it is here particularly important to only use training set data!

Test the independence hypothesis between the outcome variable and each independent variable.
Given those results, do you expect that is is possible to reduce the dimensionality of the problem by selecting variables that are more likely to have discriminatory power?