Parameters and statistics

A parameter is a function that describes some measurable characteristic of a population (see Population and sampling). It is generally unknown. The term is also used for the value of that function when computed on the population.

A statistic is a function that describes some characteristic of a sample. Its value for a sample can be computed exactly.

Parameters and statistics are random variables and can typically be characterized by a distribution which may or may not have a known shape (such as a normal distribution).

Mathematical formalism

Let \(\mathcal{D}\) be the population distribution of a random variable \(X\): \(X \sim \mathcal{D}\), where \(X\) represent a measurable characteristic of the population.

A sample of \(X\) will be defined as: \(\mathcal{S} = \{X_i\}\) with \(X_i \sim \mathcal{D}\) as independent and identically distributed variables.

A parameter is defined at the population level as a function of \(X\): \(\Theta = f(X)\).

A statistic is defined at the sample level as a function of \(\mathcal{S}\): \(T = f(\mathcal{S})\), which will give us an observed value \(t = f(x)\).

Example

\(X\): blood sugar concentration in an adult, follows unknown distribution \(\mathcal{D}\).

Parameter: variance in blood sugar concentration in the population \(\rightarrow Var[X] = \sigma^2\), unknown.

Sample: 50 adults with diabetes. \(\{X_1, ..., X_i, ..., X_{50}\}\) describes a sample as a concept (not any particular sample).

Statistic: sample variance, \(S^2 = \frac{1}{49}\sum_{i=1}^{50}(X_i-\bar{X})^2\), with \(\bar{X} = \frac{1}{50}\sum_{i=1}^{50}X_i\) the sample mean.

Observed values in the sample: \(\{x_1, ..., x_i, ..., x_{50}\}\), the blood sugar concentration of 50 individual patients.

Observed value of the statistic: \(s^2 = \frac{1}{49}\sum_{i=1}^{50}(x_i-\bar{x})^2\).

If we take a new sample, \(X\), \(X_i\), \(S^2\), \(Var[X]\) remain identical. Only \(x_i\) and, therefore, \(s^2\), change.

\(X\), \(X_i\), \(S^2\), \(Var[X]\) are random variables: they don’t have a single value but a distribution. \(s^2\) and \(x_i\) have defined values.

Probability density function

A distribution \(\mathcal{D}\) is characterized by a probability density function (pdf), \(f(x)\). For a continuous distribution, \(f(x)\) describes the probability that a value drawn from that distribution falls into an interval \([a, b]\), through the relationship: \(P(a \leq x \leq b) = \int_a^b f(x)\). Graphically, it means that the probability that \(x\) is in the interval \([a, b]\) is equal to the area under the curve between \(a\) and \(b\).

For instance, if we have \(\mathcal{D} = N(0, 1)\), the standard normal distribution, we have \(f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\), and \(a=0.5, b=1.2\), we can use scipy.stats.norm to get \(P(0.5 \leq x \leq 1.2) = 0.19\):

from scipy import stats

a, b = 0.5, 1.2
Z = stats.norm(loc=0, scale=1)
p = Z.cdf(b)-Z.cdf(a) # CDF(a) = P(x <= a)

# plot pdf and interval:
x = np.linspace(Z.ppf(0.01), Z.ppf(0.99), 100)
section = np.linspace(a, b, 100)

fig, ax = plt.subplots()
ax.plot(x, Z.pdf(x), 'r-', label='Z pdf')
ax.fill_between(section, Z.pdf(section), color='r', alpha=0.5, label='P(a <= x <= b)')
ax.text(a+(b-a)/2, 0.1, f"{Pab:.2f}", horizontalalignment='center')
ax.legend()

And we can draw a very large sample to check how many values we get in that interval:

x = Z.rvs(size=1_000_000)
in_interval = (x >= a) & (x <= b)
print(f"{100*in_interval.mean():.2f}% values in interval")

Plain English

Mathematical formalism

Example

Probability density function