Parameters and statistics

Plain English

A parameter is a function that describes some measurable characteristic of a population (see Population and sampling). It is generally unknown. The term is also used for the value of that function when computed on the population.

Example: the “average weight of patients with disease A”. Knowing that would require measuring the weight of every patient – even undiagnosed ones!

A statistic is a function that describes some characteristic of a sample. Its value for a sample can be computed exactly.

Example: the “average weight for a sample of 100 patients”.

A statistic is typically used as an estimator of a parameter.

Example: we estimate that the average weight of all patients is equal (or close) to the average weight of the 100 sampled patients.

Parameters and statistics are random variables and can typically be characterized by a distribution which may or may not have a known shape (such as a normal distribution).

Mathematical formalism

Let \(\mathcal{D}\) be the population distribution of a random variable \(X\): \(X \sim \mathcal{D}\), where \(X\) represent a measurable characteristic of the population.

A sample of \(X\) will be defined as: \(\mathcal{S} = \{X_i\}\) with \(X_i \sim \mathcal{D}\) as independent and identically distributed variables.

Sampling \(X\) will give us a set of observed values \(x = \{x_i\}\).

A parameter is defined at the population level as a function of \(X\): \(\Theta = f(X)\).

A statistic is defined at the sample level as a function of \(\mathcal{S}\): \(T = f(\mathcal{S})\), which will give us an observed value \(t = f(x)\).

Example

Population: adults with diabetes.

\(X\): blood sugar concentration in an adult, follows unknown distribution \(\mathcal{D}\).

Parameter: variance in blood sugar concentration in the population \(\rightarrow Var[X] = \sigma^2\), unknown.

Sample: 50 adults with diabetes. \(\{X_1, ..., X_i, ..., X_{50}\}\) describes a sample as a concept (not any particular sample).

Statistic: sample variance, \(S^2 = \frac{1}{49}\sum_{i=1}^{50}(X_i-\bar{X})^2\), with \(\bar{X} = \frac{1}{50}\sum_{i=1}^{50}X_i\) the sample mean.

Observed values in the sample: \(\{x_1, ..., x_i, ..., x_{50}\}\), the blood sugar concentration of 50 individual patients.

Observed value of the statistic: \(s^2 = \frac{1}{49}\sum_{i=1}^{50}(x_i-\bar{x})^2\).

If we take a new sample, \(X\), \(X_i\), \(S^2\), \(Var[X]\) remain identical. Only \(x_i\) and, therefore, \(s^2\), change.

\(X\), \(X_i\), \(S^2\), \(Var[X]\) are random variables: they don’t have a single value but a distribution. \(s^2\) and \(x_i\) have defined values.

Probability density function

A distribution \(\mathcal{D}\) is characterized by a probability density function (pdf), \(f(x)\). For a continuous distribution, \(f(x)\) describes the probability that a value drawn from that distribution falls into an interval \([a, b]\), through the relationship: \(P(a \leq x \leq b) = \int_a^b f(x)\). Graphically, it means that the probability that \(x\) is in the interval \([a, b]\) is equal to the area under the curve between \(a\) and \(b\).

For instance, if we have \(\mathcal{D} = N(0, 1)\), the standard normal distribution, we have \(f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\), and \(a=0.5, b=1.2\), we can use scipy.stats.norm to get \(P(0.5 \leq x \leq 1.2) = 0.19\):

from scipy import stats

a, b = 0.5, 1.2
Z = stats.norm(loc=0, scale=1)
p = Z.cdf(b)-Z.cdf(a) # CDF(a) = P(x <= a)

# plot pdf and interval:
x = np.linspace(Z.ppf(0.01), Z.ppf(0.99), 100)
section = np.linspace(a, b, 100)

fig, ax = plt.subplots()
ax.plot(x, Z.pdf(x), 'r-', label='Z pdf')
ax.fill_between(section, Z.pdf(section), color='r', alpha=0.5, label='P(a <= x <= b)')
ax.text(a+(b-a)/2, 0.1, f"{Pab:.2f}", horizontalalignment='center')
ax.legend()

And we can draw a very large sample to check how many values we get in that interval:

x = Z.rvs(size=1_000_000)
in_interval = (x >= a) & (x <= b)
print(f"{100*in_interval.mean():.2f}% values in interval")

Result: 19.36% values in interval