STAT-H400 - Lab 2

Multivariate statistics I

Objectives

Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.

Exercise 1

Objectives: visualize bivariate data & perform adequate hypothesis tests.

Using the instrument_experiment.csv data from the previous lab:

  1. Visualize the (possible) relationship between different pairs of variables:
    1. measure_before vs measure_after
    2. measure_before vs temperature
    3. hospital vs measure_before
    4. instrument_id vs measure_before
  2. From these visualization, what would you conclude on the potential dependence between variables?
  3. For the same pairs of variables as above, test the indepence hypothesis using the adequate statistical tests. Don’t forget to test for any assumption such as normality or equality of variance.

Exercise 2

The data we used so far is from an experiment that lacks a control group. The conclusions we have been able to draw so far are therefore limited. In the instrument_experiment_control.csv file, we have an improved dataset where, this time, patients were split between a “CONTROL” and “TREATMENT” group.

Using this new dataset, try to answer the following question:

  1. Were the patients randomly split between the two groups? (i.e. the treatment group is independent from the instrument, hospital and observer).
  2. Is there a significant difference in the effect between the two groups?
  3. What is the success rate and effect size of the treatment?
  4. Using bootstrapping, determine a 95% confidence interval on those two values.

Exercise 3 (optional)

Objectives: experiment with simulations to find the distribution of the Kolmogorov-Smirnov statistic.

This is an additional, optional exercise which can help getting a better sense of how we can use simulations to find the shape of test statistic distributions.

  1. Create a method to randomly select n samples from a continuous distribution 𝒟 of your choice and compute the two-sided Kolmogorov-Smirnov statistic (i.e. distance between the cumulative distribution functions) between the sample and the theoretical distribution.
  2. By repeatedly calling this method, build a histogram of frequencies for the values of the statistic when H0 is respected (as the sample is randomly drawn from the theoretical distribution).
  3. Compare the simulated distribution of the statistic to the theoretical distribution, which you can find using scipy.stats.kstwo.
  4. Repeat for different distributions 𝒟 and/or different sample sizes n to show that the results are consistent.

Hint: to compute the statistic, you may either try to reimplement it yourself (which is a nice exercise but may be a bit difficult), or use the statistic returned by scipy.stats.kstest.