STAT-H400 - Lab 1

Univariate statistics

Objectives

Understand the types of data in a dataset.
Visualize data from different types.
Summarize univariate data.
Use hypothesis tests on univariate data.

Note on the use of Generative AI tools (ChatGPT, Copilot, etc.): while not forbidden, I highly recommend to not use GenAI tools until you really understand the code that you are trying to write. GenAI tools are unreliable, and notoriously bad as a “learning” tool. They require a lot of micromanagement to be accurate. They can be helpful as a productivity boost for skilled developers, but will stop you from growing your own skills if used too soon.

Exercise 1

Objectives: learn to quickly get a sense of the information presented in a dataset.

We will start by using the instrument_calibration.csv file (provided on the Virtual University). It contains a synthetic dataset about the calibration of different instruments, made at different hospitals and by different operators.

Read the data in the instrument_calibration.csv file.
For each of the variables, identify the type of data.
For each of the variables, compute informative summaries.
For each of the variables, propose a visualization.

What conclusions can you draw from these information? How can you handle the missing values in the data?

Exercise 2

Objectives: perform simple univariate hypothesis test, perform simple operations on datasets.

Process the data in the instrument_calibration.csv file so that you get, for each instrument, the distribution of the difference between measured and target value.
For each instrument, estimate the accuracy and precision of the measurements.
For each instrument, test the hypothesis that it provides an unbiased measure.

Which instrument(s) would you trust more ? How could you take the biases into account?

Are the operators influencing the measure?

Exercise 3

Objectives: compute confidence intervals on univariate data and test normality.

Compute 95% confidence intervals for the value of the mean difference between measured and target value, using the original data and using the bias-adjusted data.
Draw a QQ-plot of the mean difference distribution. Does it follow a normal distribution?

Hint: you can use the qqplot function from the statsmodel.api module.

Draw a QQ-plot of the temperature distribution. Does it follow a normal distribution?

Exercise 4

Objectives: draw appropriate conclusions from experimental data.

We will use the instrument_experiment.csv file. It contains (synthetic) experimental data using the same hospital, observers and instruments as before (i.e. the biases, precisions, etc. that you computed still apply). This time, there are two measures: one before an operation, and one after. It is expected that, when the operation is successful, the measure decreases by a certain amount.

Try to determine, from the data:

How often is the operation successful ?
When it is successful, what is the size of the effect on the measure ?

Justify your choices and propose appropriate visualizations, statistical tests and/or confidence intervals.