Objectives
- Visualize bivariate data and show the relationships between
variables.
- Test relationships between different types of variables in
one-sample studies.
- Use bootstrapping methods to estimate confidence intervals.
Note on the use of Generative AI tools (ChatGPT, Copilot,
etc.): while not forbidden, I highly recommend to not
use GenAI tools until you really understand the code that you
are trying to write. GenAI tools are unreliable, and notoriously bad as
a “learning” tool. They require a lot of micromanagement to be accurate.
They can be helpful as a productivity boost for skilled developers, but
will stop you from growing your own skills if used too soon.
Exercise 1
Objectives: visualize bivariate data & perform adequate
hypothesis tests.
Using the instrument_experiment.csv
data from the
previous lab:
- Visualize the (possible) relationship between different pairs of
variables:
- measure_before vs measure_after
- measure_before vs temperature
- hospital vs measure_before
- instrument_id vs measure_before
- From these visualization, what would you conclude on the potential
dependence between variables?
- For the same pairs of variables as above, test the indepence
hypothesis using the adequate statistical tests. Don’t forget to test
for any assumption such as normality or equality of variance.
Exercise 2
The data we used so far is from an experiment that lacks a
control group. The conclusions we have been able to
draw so far are therefore limited. In the
instrument_experiment_control.csv
file, we have an improved
dataset where, this time, patients were split between a “CONTROL” and
“TREATMENT” group.
Using this new dataset, try to answer the following question:
- Were the patients randomly split between the two groups?
(i.e. the treatment group is independent from the instrument, hospital
and observer).
- Is there a significant difference in the effect between the
two groups?
- What is the success rate and effect size of the
treatment?
- Using bootstrapping, determine a 95% confidence interval on
those two values.
Exercise 3 (optional)
Objectives: experiment with simulations to find the distribution of
the Kolmogorov-Smirnov statistic.
This is an additional, optional exercise which can help getting a
better sense of how we can use simulations to find the shape of test
statistic distributions.
- Create a method to randomly select n samples from a continuous
distribution 𝒟 of your choice and
compute the two-sided Kolmogorov-Smirnov statistic
(i.e. distance between the cumulative distribution functions) between
the sample and the theoretical distribution.
- By repeatedly calling this method, build a histogram of frequencies
for the values of the statistic when H0 is respected (as the
sample is randomly drawn from the theoretical distribution).
- Compare the simulated distribution of the statistic to the
theoretical distribution, which you can find using
scipy.stats.kstwo
.
- Repeat for different distributions 𝒟 and/or different sample sizes n to show that the results are
consistent.
Hint: to compute the statistic, you may either try to
reimplement it yourself (which is a nice exercise but may be a bit
difficult), or use the statistic returned by scipy.stats.kstest
.