Multivariate data analysis (STAT-H400)

5. Supervised methods

Adrien Foucart

Introduction

What is this module about ?

Outcome variables

Individual \(X_1\) \(X_j\) \(X_p\) \(Y\)
i \(x_{i1}\) \(x_{ij}\) \(x_{ip}\) \(y_i\)

In addition to the independent variables \(X_j\), we now have a dependent (= outcome) variable \(Y\).

Line vector \(x_{i.}\) represents one individual in a p-dimensional space \(\rightarrow\) “variable space” or “feature space”. To that line vector is associated a target \(y_i\).

Outcome variables

Different tasks will have different types for \(Y\).

\(Y\) is… Task
Binary categorical Binary classification or detection
Nominal Multiclass classification
Vector of binary categoricals Multiclass detection/classification
Numerical Simple or Multiple regression
(depending on number of \(X_j\))
Vector of numericals General multivariate regression

The difference between binary classification and detection is that in a binary classification problem, the binary categorical variable is expressed as “Class A or Class B” (example: “Cat” or “Dog”), whereas in a detection problem it is expressed as “Class A or Not Class A” (example: “Cat” or “No Cat”). It’s an important distinction because, typically, in a detection problem, our evaluation will be focused on the “positive” category, and we will use metrics such as the F1-score, Recall and Precision, which ignores the “True Negatives”. In a classification problem, we don’t have a “negative” and “positive” class, so we have to be careful to use metrics which are symmetricalin how they treat the classes.

Outcome variables

Example: a generative large language model receives as input a list of “tokens” (i.e. vectors in an “embedding” space) and outputs the “most likely next token” from a list of possible tokens.

Is it…

  1. A multiple regression task
  2. A multivariate regression task
  3. A binary classification task
  4. A multiclass classification task

As it chooses from a “list of possible tokens”, it is a multiclass classification problem were the categories are the possible tokens.

Outcome variables

Example: a generative image diffusion model receives as input a list of “tokens” and outputs a matrix of pixel values forming an image that “matches” the tokens (the “prompt”).

Is it…

  1. A multiple regression task
  2. A multivariate regression task
  3. A binary classification task
  4. A multiclass classification task

This time, the output is one numerical value per pixel \(\rightarrow\) this is a multivariate regression task.

Classification vs Regression

Correctly classified if \(Y_{pred} = Y_{true}\)

Summary of results: confusion matrix and associated metrics (e.g. accuracy, F1-score…).

Pred \ True Benign Cancer
Benign a b
Cancer c d

Good regression quality if \(|Y_{pred} - Y_{true}| \approx 0\).

Summary of results: average error measures (e.g. RMSE).

Supervised learning problem

Given a set of examples \(\mathcal{D} = \{(\vec{X_i}, \vec{Y_i})\}\), find a relationship \(\vec{Y} = f(\vec{X}, \vec{\theta})\) with \(\vec{\theta}\) a vector of parameters of the model, such that \(\vec{Y_i} - f(\vec{X_i}, \vec{\theta})\) is minimized for all \(Y_i, X_i\).

Very often (but not always) the process of solving the problem will be: choose the general form of \(f(\vec{X}, \vec{\theta})\) (the “model”), and then use an optimization technique to find the “best” \(\theta\) (“learn” the best parameters).

Example: simple linear regression \(\rightarrow\) set \(f(x; a, b) = ax + b\); find \(a, b\) so that \(\frac{1}{n} \sqrt{\sum_{i=1}^n (y_i - ax_i - b)^2}\) is minimized on \(\mathcal{D}\).

Supervised learning problem

A “supervised learning” solution will therefore include…

Two main “classes” of methods:

The supervised learning pipeline

Training set and overfitting

A training set can be simply defined as a dataset used for “training” the model, i.e. to find the “best” \(\vec{\theta}\) such that the error made by the model is minimized

…but the best \(\vec{\theta}\) for the training set are not what we are looking for. We want the best \(\vec{\theta}\) for any data that our model may need to process. In other words, we need the best \(\vec{\theta}\) for the population of all potential test cases.

Training set and overfitting

Extreme example: if our model \(f(\vec{x}, \vec{\theta})\) is simply a look-up table of our training set, so that \(f(\vec{x_i}) = \vec{\theta_i} = \vec{y_i} \forall x_i, y_i \in \mathcal{D}_{train}\).

This certainly minimizes the error on the training set, but our model is actually undefined for any previously unseen case!

Slightly less extreme: nearest neighbour algorithm. For a given \(x\), set \(y\) to the value of \(y_i\) such that \(d(x, x_i)\) is the minimal distance from all points in the training set.

\(\rightarrow\) defined everywhere… but clearly very sensitive to any noise in the data.

Over- and under-fitting

Overfitting is a phenomenon in machine learning where the model is too accurate on the training data, to the point that it doesn’t generalize well to new data. It is often a sign that the model is too complex.

Underfitting is when the model is too generic and fails to capture the complexity of the data. It is often a sign that the model is not complex enough.

Training, validation, test

To avoid (or at least measure) overfitting: split the training set into a training and a validation set.

The training set is used to determine the best \(\vec{\theta}\) of the model. The validation set is used to determine the best of the pipeline: choice of model, optimization parameters, stopping criterion, etc. Better scores on the validation set are more likely to result in better scores on the population.

A separate test set is used to provide a final evaluation of the complete solution.

Training, validation, test

It is very important to have a clear strategy of how to split the data before we start training models.

Main rules:

Training, validation, test

Training, validation, test

For the training and validation set, we can make better use of the available data with cross-validation.

Image: Gufosawa for Wikipedia. “Test data” should be “validation data” using our terminology.

Training, validation, test

Typical procedure with cross-validation:

  1. For each set of hyper-parameters that you want to test (e.g. model type, learning rate, etc.), perform k-fold cross-validation with the training set and compute a score \(s_i\) on each of the \(k\) iterations.
  2. Determine which hyper-parameters lead to the best results: e.g. highest average score \(\frac{1}{k}\sum_{i=1}^k s_i\). You can use a statistical test such as Friedman (if you used the same folds!) to check if the hyper-parameters have a significant impact.
  3. Re-train your model on the entire training set using the best hyper-parameters
  4. Evaluate your model on the test set.

Training, validation, test

IMPORTANT

You can (ideally) only use your test set once. It is a final test. After that, you have gained some knowledge about how your model behaves on the test set.

Any alteration of the model, training procedure, etc. that you do after that is done with knowledge of the test set. It is therefore no longer a test set: it has become part of the training set.

, you often have to do some back-and-forth with the test set, or the re-use the same benchmarks in different experiments. But you should be aware that each usage makes it more likely that you are overfitting the test set.

Quick quizz

To train a malign/benign cancer classifier, you gather data in a large, multi-centric study involving five hospitals and a total of 1.268 patients, which are distributed like this:

Benign Malign
Hospital 1 162 69
Hospital 2 169 77
Hospital 3 172 74
Hospital 4 187 85
Hospital 5 188 85

How you would split your dataset to maximize the generalizability of your results?

Given the large sample size, keeping one hospital aside as a test set seems reasonable.

Then, the four other hospitals can be used in a 4-fold cross-validation scheme, so that you always have a separate hospital in the validation set from the training set.

Evaluation

General evaluation procedure

We evaluate the ability of a system or algorithm \(\mathcal{A}\) to perform a task (classification, regression…) \(\mathcal{T}\) on a population \(\mathcal{P}\), on which we measure variables \(X_i\). On a training and test set \(D_{train}\) and \(D_{test}\), we have pairs of measures \((X_i, Y_i)\) where \(Y_i\) is the desired output of the system.

The system will produce an output: \(\hat{Y}_{i} = \mathcal{A}(X_i)\).

To evaluate the system, we have an assessment metric \(\mathcal{M}\) so that we can compute a set of scores \(\{S_{i} = \mathcal{M}(\hat{Y}_{i}, Y_i)\}\) on all examples in \(D_{test}\).

General evaluation procedure

\(\Rightarrow\) None of these assumptions are practically verified… the discussion of how right or wrong they are put limitations on the conclusions of a study.

Evaluation metrics

For classification problems, the evaluation metric is typically based on the confusion matrix. For \(m\) categories:

GT \ Pred \(k=1\) \(k=2\) \(k=m\)
\(k=1\) \(C_{11}\) \(C_{12}\) \(C_{1m}\)
\(k=2\) \(C_{21}\) \(C_{22}\) \(C_{2m}\)
\(k=m\) \(C_{m1}\) \(C_{m2}\) \(C_{mm}\)

(global): \(ACC = \frac{1}{N}\sum_{i=1}^{m}C_{ii}\) with \(N\) the total number of cases in the test set. \(0 \leq ACC \leq 1\)

Evaluation metrics

(per-class): \(SEN_{i} = \frac{C_{ii}}{\sum_{j=1}^mC_{ij}} \rightarrow\) ratio of “real” \(k=i\) that are correctly predicted. (= true positive rate, recall, hit rate, power…). \(0 \leq SEN_i \leq 1\)

(per-class): \(PRE_{i} = \frac{C_{ii}}{\sum_{j=1}^mC_{ji}} \rightarrow\) ratio of “predicted” \(k=i\) that are really correct. (= positive predictive value). \(0 \leq PRE_i \leq 1\)

(per-class): \(F1_i = 2 \frac{PRE_i \times SEN_i}{PRE_i + SEN_i}\). \(0 \leq F1_i \leq 1\)

Evaluation metrics

(“simple average”) (global): \(MF1 = \frac{1}{m}\sum_{i=1}^mF1_i\). \(0 \leq MF1 \leq 1\)

(“harmonic of averages”) (global): other version which is confusingly given the same name, first compute \(PRE = \frac{1}{m}\sum_{i=1}^mPRE_i\) and \(SEN = \frac{1}{m}\sum_{i=1}^mSEN_i \Rightarrow MF1 = 2 \frac{PRE \times SEN}{PRE + SEN}\). \(0 \leq MF1 \leq 1\)

: Matthews Correlation Coefficient (\(-1 \leq MCC \leq 1\)), Cohen’s kappa (\(-1 \leq \kappa \leq 1\)), Geometric Mean of \(SEN_i\) (\(0 \leq GM \leq 1\))…

Evaluation metrics

Segmentation problems in image analysis = per-pixel classification \(\rightarrow\) we generally compute the confusion matrix per image. In a binary segmentation problem, we would have:

GT \ Pred 0 1
0 TN FP
1 FN TP

= F1-score of the positive class = \(\frac{2TP}{2TP+FP+FN}\).

= Jaccard Index = \(\frac{TP}{TP+FP+FN}\).

Example

GT \ Pred Benign Grade 1 Grade 2
Benign 18 7 1
Grade 1 8 31 7
Grade 2 1 17 24

Per-class metrics

Class SEN PRE F1
Benign 0.69 0.67 0.68
Grade 1 0.67 0.56 0.61
Grade 2 0.57 0.75 0.65

Global metrics

Evaluation metrics

Regression problems:

Comparison of multiple algorithms

If we consider the results of each algorithm as a random variable, observed on a sample \(D_{test}\) of the population \(\mathcal{P}\) of all possible test cases…

Then we can test the null hypothesis that the algorithms have the same results in \(\mathcal{P}\) based on the observed results in \(D_{test}\).

\(\Rightarrow\) tests on paired samples.


\(\Rightarrow\) tests on paired samples. Which tests? It depends on the type of metric!

Categorical results per-case (typical in classification problems)

True Class \(A_1\) \(A_2\) \(A_m\)
1 1 1 2
3 2 3 3

Which we can transform into a “correct/incorrect” table:

\(A_1\) \(A_2\) \(A_m\)
Correct Correct Incorrect
Incorrect Correct Correct

\(\Rightarrow\) Cochran’s Q test (or McNemar if only two algorithms) + post-hocs.

Numerical results per-case (typical in segmentation, regression…)

\(A_1\) \(A_2\) \(A_m\)
0.71 0.85 0.92
0.87 0.91 0.56

\(\Rightarrow\) Friedman test + post-hocs.

Example

Average scores:

1 2 3 4
0.71 0.59 0.76 0.69

Friedman p-value \(< 1e-16 \rightarrow\) \(H_0\) rejected.

Nemenyi p-values \(< 0.05\)?:

1 2 3 4
1 - YES YES NO
2 YES - YES YES
3 YES YES - YES
4 NO YES YES -

\(\Rightarrow\) Algorithm 3 is significantly better than all the others. Algorithm 2 is significantly worse than all others. Algorithms 1 and 4 are not significantly different from each other.

Classification

Classification problem

Given a set of examples \(\mathcal{D} = \{(\vec{X_i}, \vec{Y_i})\}\), with \(\vec{Y_i}\) a vector of categorical variables, find a relationship \(\vec{Y} = f(\vec{X}, \vec{\theta})\) with \(\vec{\theta}\) a vector of parameters of the model, such that \(\mathcal{L}(\{\vec{Y_i}, f(\vec{X_i}, \vec{\theta})\})\) is minimized, with \(\mathcal{L}\) as loss function (or error function).

If \(\vec{Y_i}\) is a single binary variable \(Y_i \in [0, 1]\), we have a binary classification problem.

If \(\vec{Y_i}\) is a single nominal variable \(Y_i \in \{C_1, ..., C_m\}, m>2\), we have a multi-class classification problem.

If \(\vec{Y_i}\) has multiple variables \(\vec{Y_i} = (Y_{i0}, ... Y_{ik})\) (where \(Y_{ij}\) may be binary or not), we have a multi-label classification (in the binary case) or a multi-output classification (\(m>2\) classes) problem.

Classification problem

Other way of looking at the problem: a classifier is a method that divides feature space into decision regions, such that the “prediction” rule is that any point \(x \in R_k\) is assigned to class \(C_k\).

Classification problem

Other way of looking at the problem: a classifier is a method that divides feature space into decision regions, such that the “prediction” rule is that any point \(x \in R_k\) is assigned to class \(C_k\).

Classification problem

Classification error: when \(x\) belongs to class \(C_i\) but is assigned to class \(C_j \neq C_i\) by the classifier.

Classifiers

Bayesian classifier

A Bayesian classifier frames the problem as a minimization of the probability of classification error:

\(P_e = \sum_{i=1}^n P(f(X) = C_i, Y \neq C_i)\)

This is done by computing (based on the training set) \(P(C_k | X) \forall C_k\), the a posteriori class probabilities for \(X\), and selecting the \(C_k\) such that \(P(C_k | X)\) is maximum. The a posteriori class probabilities can be computed using Bayes Theorem:

\(P(C_k | X) = \frac{P(C_k) P(X | C_k)}{P(X)}\)

Bayesian classifier

\(P(C_k | X) = \frac{P(C_k) P(X | C_k)}{P(X)}\)

\(P(C_k)\) is the a priori probability of \(C_k\), which can be estimated as the % of cases in the training set that belong to that class.

\(P(X | C_k)\) is the class density (probability of observing \(X\) in \(C_k\)).

\(P(X)\) is the probability of observing \(X\) independently of the class: it is therefore the density of the point cloud in feature space.

Bayesian classifier

\(P(C_k | X) = \frac{P(C_k) P(X | C_k)}{P(X)}\)

Since \(P(X)\) is completely independent from the classification function, maximizing \(P(C_k | X)\) only requires maximizing \(P(C_k) P(X | C_k)\).

How do we practically compute \(P(C_k | X)\)?

Bayesian classifier

Example: Gaussian Naive Bayes

Estimated Gaussian parameters for \(P(X | C_k)\):

\(\mu_0 = (0.59, 0.58), \sigma_0 = (0.05, 0.05)\)
\(\mu_1 = (0.44, 0.35), \sigma_1 = (0.09, 0.09)\)

Discriminant analysis

Similar idea to PCA, but instead of finding the directions of overall maximum variance in the point cloud, we search for the direction which maximizes the separation between the classes.

The directions can then be used for dimensionality reduction (like PCA), but discriminant analysis can also be formulated as a method of classification.

Linear Discriminant analysis

Using the zero-centered point cloud (as with PCA), the first discriminant factor is the direction such that inter-class variance is maximum.

Let \(B\) be the weighted variance-covariance matrix of the class centroids \(g_k\): \(B_{m \times m} = \frac{1}{n}\sum_{k=1}^mn_kg_kg_k^T\), and \(S\) be the variance-covariance matrix of the (zero-centered) point cloud…

The discriminant factors will be the eigenvectors of \(S^{-1}B\).

Note that there will be at most \(m-1\) non-zero eigenvalues. If we have two classes, we have one discriminant factor, which is the axis of projection that maximizes inter-class separability.

Linear Discriminant analysis

Decision (hyper-)plane (or here, line): perpendicular to discriminant factor, at the mid-point between the class centroids (in factorial space).

\(\rightarrow\) equivalence to Bayesian form seen previously, with added linearity constraint.

Quadratic Discriminant analysis

Removing the linear constraint gives us Quadratic Discriminant analysis, which if the inputs are independent (i.e. the variance-covariance matrix is diagonal) is equivalent to the Gaussian Naive Bayes classifier.

Logistic regression

Logistic regression frames the classification as a regression problem followed by a thresholding. Let’s look at the binary classification case, where the two available classes are \(\{0, 1\}\), and with a single variable \(x\):

First, we fit a logistic function \(p(x_i) = \frac{1}{1+e^{\beta_0+\beta_1x_i}}\) to find the best parameters \(\beta_0, \beta_1\) based on the training data. (\(x_i\) is the value of \(x\) for the \(i\)-th case)

Once fitted, the value of \(p(x_i)\) on any given point can be interpreted as the probability of class 1.

Logistic regression

Generalisation to multiple variables: \(p(x_i) = \frac{1}{1+e^{\beta_0+\sum_{j=1}^p \beta_jx_{ij}}}\)

Fit = solve an optimization problem \(\rightarrow\) find \(\{\beta_j\}\) to minimize \(\sum_{i=1}^n(-y_ilog(p(x_i))-(1-y_i)log(1-p(x_i))) + R(\beta)\), where \(R(\beta)\) is a regularization term.

Regularization

Regularization is a common strategy to limit the risk of overfitting. A very common way to do regularization is to add a term to a function to optimize (“loss function”, “cost function”, etc.) that penalizes complexity in the model.

Typical strategies:

Logistic regression

LDA

Logistic regression

Logistic regression is a more direct method to find linear separation, and will tend to produce the best (linear) separation (in the Bayesian sense). LDA will work better if the classes distributions are Gaussian, have similar variance-covariance matrices, and we have a good estimation of those variance-covariance matrices (i.e. enough representative data!).

Optimization

Many classification algorithms, such as logistic regression, rely on solving an optimization problem.

These are usually solved using an iterative approach (similar to what we’ve seen with the K-Means clustering):

  1. Set initial conditions \(\beta_{(0)}\)
  2. Evaluate cost function \(L(\beta_{(t)}, p, x, y)\)
  3. Compute update step towards local minimum/maximum \(\beta_{(t+1)} = \beta_{t} + \delta(L)\), where \(\delta(L)\) is an update function.
  4. Repeat 2-3 until stopping criterion is met.

Optimization

  1. Compute update step towards local minimum/maximum \(\beta_{(t+1)} = \beta_{t} + \delta(L)\).

Very often, this step is based on the gradient of \(L\): \(\nabla_\beta L(\beta; p, x, y)\). Gradient descent methods use an update rule of the form:

\[\beta_{(t+1)} = \beta_{(t)} - \eta \nabla_\beta L(\beta; p, x, y)\]


Image shamelessly stolen from 3Blue1Brown, “Gradient descent, how neural networks learn”: https://www.youtube.com/watch?v=IHZwWFHWa-w

k-Nearest Neighbours

Nearest neighbour algorithm \(\rightarrow\) very likely to overfit. Regularization strategy: use \(k\) nearest neighbours and take the majority vote.

1-NN

5-NN

k-Nearest Neighbours

Distance-based method \(\rightarrow\) requires to choose the distance metric! (cf unsupervised methods).

But…

Multiclass extensions

Slightly confusing terminology:

Multiclass extensions

For multiple target variables, this is equivalent to fitting multiple independent models.

For multiple categories (\(m>2\)), several strategies:

Example: using the Iris dataset \(\rightarrow\) 3 different species of iris flowers (Setosa, Versicolor, Virginica), 4 features (Sepal/Petal length/width), here reduced to 2 with PCA.

Example with QDA

One-vs-All

Example with QDA

One-vs-One

Example with LDA

One-vs-All

May fail when one of the classes is not separable from “the rest”, but is separable from each other class separately!

Example with LDA

One-vs-One

Classification summary

Regression

Regression problem

Given a set of examples \(\mathcal{D} = \{(\vec{X_i}, \vec{Y_i})\}\), with \(\vec{Y_i}\) a vector of numerical variables, find a relationship \(\vec{Y} = f(\vec{X}, \vec{\theta})\) with \(\vec{\theta}\) a vector of parameters of the model, such that \(\mathcal{L}(\{\vec{Y_i}, f(\vec{X_i}, \vec{\theta})\})\) is minimized, with \(\mathcal{L}\) as loss function (or error function).

If \(\vec{Y_i}\) is a single numerical variable \(Y_i\) and \(\vec{X_i}\) is a single numerical variable \(X_i\), we have a simple regression problem.

If \(\vec{Y_i}\) is a single numerical variable \(Y_i\) and \(\vec{X_i}\) has multiple variables, we have a multiple regression problem.

If \(\vec{Y_i}\) has multiple variables \(\vec{Y_i} = (Y_{i0}, ... Y_{ik})\), we have a multivariate linear regression problem.

Regression problem

Often represented with one axis as the target, but that means we can only look at one independent variable at a time (if we look in 2D) \(\rightarrow\) other visualizations possible.

Linear regression

In a linear regression (simple or multiple) problem, we are trying to fit a linear model through the point cloud:

\(\hat{Y_i} = \theta_0 + \sum_{j=1}^p \theta_j X_{ij}\)

If there are no exact solutions, we will in general have: \(\hat{Y_i} = \theta_0 + \sum_{j=1}^p \theta_j X_{ij} + \epsilon_i\), with \(\epsilon_i\) the residual error of the model for the \(i\)-th case.

In matrix form, we have \(\hat{Y} = X\theta + \epsilon\)

The Ordinary Least Squares method consists in optimizing \(\theta\) so that \(||X\theta - Y||^2\) is minimized on the dataset.

Linear regression

The Ordinary Least Squares method consists in optimizing \(\theta\) so that \(||X\theta - Y||^2\) is minimized on the dataset.

This minimization problem has an exact solution if \(X^TX\) is invertible:

\[\theta = (X^TX)^{-1}X^TY\]

with \(\theta = \begin{pmatrix}\theta_0 \\ ... \\ \theta_p\end{pmatrix}, X = \begin{pmatrix}1 & X_{11} & ... & X_{1p} \\ 1 & ... & ... & X_{2p}\\ 1 & ... & ... & ... \\ 1 & X_{n1} & ... & X_{np}\end{pmatrix}, Y = \begin{pmatrix}Y_1 \\ ... \\ Y_n\end{pmatrix}\)

def ordinary_least_squares(X: np.ndarray, Y: np.ndarray):
    X_ = np.ones((X.shape[0], X.shape[1]+1)) # add column
    X_[:, 1:] = X
    return np.linalg.inv(X_.T @ X_) @ X_.T @ Y

Linear regression

Example: diabetes dataset, after standardization + PCA (computed on training set), taking first component vs target.

Linear regression

This minimization problem has an exact solution if \(X^TX\) is invertible:

Conditions:

Otherwise:

Linear regression

Example: Ridge regression \(\rightarrow\) uses ordinary least squares with L2 regularization \(\rightarrow\) minimizes \(||X\theta - Y||^2 + \alpha ||\theta||^2\)

Ridge Regression

Ridge regression \(\rightarrow\) uses ordinary least squares with L2 regularization \(\rightarrow\) minimizes:

\[||X\theta - Y||^2 + \alpha ||\theta||^2\]

Example on diabetes dataset:

Ridge Classification

Similarly to logistic regression, we can use the Ridge regressor as a classificator:

Non-linear regression

Linear regression is very easy to compute and very easy to interpret, but we are obviously limited in the kind of functions that we can model. There are several options to introduce non-linearity.

Option 1: make strong assumptions about the shape of the data, and use optimization method to fit parametric model.

Example: polynomial model \(\rightarrow\) find \(\theta_i\) so that \(y = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_k x^k\)

Linear solver still works by treating \(x^i\) as separate variables.

Non-linear regression

Example: polynomial model \(\rightarrow\) find \(\theta_i\) so that \(y = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_k x^k\)

Linear solver still works by treating \(x^i\) as separate variables.

poly = PolynomialFeatures(degree=5)
Xt = poly.fit_transform(X[:, np.newaxis])
reg = Ridge(alpha=1.)
reg.fit(Xt, Y)    

Non-linear regression (or classification)

Option 2: find a transformation into a new feature space where linear regression (or classification) becomes possible.

This will often be a higher dimensionality feature space \(\rightarrow\) instead of the dimensionality reduction that we’ve done so far, we need to do dimensionality expansion!

Kernel trick

Option 2: find a transformation into a new feature space where linear regression (or classification) becomes possible.

Example with classification:

Transformation: \((x_1, x_2) \rightarrow (x_1, x_2, x_1^2-0.3 + x_2^2-0.6)\)

See also: Visually Explained on YouTube <https://www.youtube.com/watch?v=Q7vT0–5VII>

Kernel trick

In general: we can’t really know what the exact transformation is… but we don’t need to, thanks to the kernel trick.

Let \(F(x)\) be our transformation function, \(K(x, x') = F(x)F(x')^T\) is the corresponding Kernel function. For most linear models, the training (and prediction) process can be expressed in terms of \(K(x, x')\)! That means that we don’t actually need to formulate \(F(x)\), as long as we can formulate \(K(x, x')\).

Kernel trick

Example: polynomial kernel with degree 2. If \(x = (x_1, x_2)\) is two-dimensional, we have \(F(x) = (1, x_1, x_2, x_1^2, x_2^2, x_1x_2)\), a projection into a 6-dimensional space.

Then, we have \(K(x, x') = F(x)F(x')^T = (1 + x^Tx')^2\)

KernelRidge regressor with polynomial kernel:

reg = KernelRidge(alpha=1., 
                  kernel='poly')
reg.fit(X, Y)

Ensemble learning

A common technique to get complex decision functions with less risk of overfitting is ensemble learning.

Instead of training one very complex model on the whole training set…

… train several smaller models on parts of the dataset, hiding some data points and/or some features

For prediction: majority voting (classification) or averaging (regression).

Towards deep learning

The core elements in machine learning

Through the different supervised and unsupervised techniques, we’ve seen the core elements of machine learning solutions:

The core elements in machine learning

Example:

Given a training dataset \(\mathcal{D}_{train} = \{(x_i, y_i)\}\) and a test dataset \(\mathcal{D}_{test} = \{(x_j, y_j)\}\):

The core elements in deep learning

A deep learning model will perform the feature extraction/transformation and the model optimization steps all at the same time. This is generally done using large neural networks.


Example: AlexNet, from Krizhevsky et al., 2012.

The core elements in deep learning

Deep learning models = succession of layers which often perform “almost-linear” transformations of the feature space.

Example: “dense layer” \(\rightarrow f(x, \theta) = ReLU(\sum_i x_i \theta_i)\), where \(ReLU\) is the Rectified Linear Unit \(ReLU(x) = x \text{ if } x>0, 0 \text{ otherwise}\).

Successive linear transformations + small nonlinearities = completely non-linear relationship between the network inputs and outputs.

The core elements in deep learning

Layers can perform dimensionality reduction or dimensionality expansion, depending on their role in the network… or keep the dimensionality but change the semantics of the dimensions.

Example: convolutional layers in image networks, transforming for instance a \(128 \times 128 \times 3\) dimensional image input (dimensions = mostly independent localization) into a \(16 \times 16 \times 192\) dimensional feature space (dimensions = mostly independent characteristics).

Deep learning is machine learning is data analysis

Example: GPT evolution

Conclusions

Module 1

Module 2

Module 3

Module 4

Module 5