All supervised ML training follows the same loop:
The goal is to find the parameters that gives good predictions on unseen data.
The loss function quantifies how bad the model’s predictions are. Examples include mean squared error and cross entropy.
We want to minimize this loss function and hence make good predictions.
We update weights w by moving in the direction that reduces the loss.
\[ w \leftarrow w - \eta \, \nabla_w J(w) \]
Repeated updates → parameters that minimise the loss.
We can visualize the loss function in 2D easily but n-dimensional is a little harder to imagine!
In an ideal world as we update our model / take gradient descent steps the loss reduces.
However if we keep training forever, with a sufficiently complex model, eventually we could fit every training example.
All supervised ML training follows the same loop:
The goal is to find the parameters that gives good predictions on unseen data.
Is this a good fit of a descision boundary?
We need to reserve some data to test out model.
As we keep increasing model complexity eventually we perform worse on the test data.
How to split into train, validation and test?

In machine learning, the terms bias and variance help us understand the types of errors that can arise during model training and prediction.
Bias measures how well the model approximates the true relationship between features and the label. It is defined as the difference between the expected prediction of the model and the actual value we aim to predict.
High bias indicates an overly simplistic model that does not capture the complexity of the data well, leading to underfitting.
Variance, on the other hand, measures the sensitivity of the model to variations in the training data. A model with high variance pays too much attention to the training data and may not generalize well to unseen data, leading to overfitting.
The Bias-Variance Tradeoff describes the challenge of finding a balance between bias and variance to minimize the overall error of the model.
Model error can be decomposed as:
$$ \text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$Bias is the error introduced by approximating a real-world problem, which may be complex, with a simplified model. Variance is the error introduced by excessive sensitivity to small fluctuations in the training data. The irreducible error is noise that cannot be eliminated, regardless of the model chosen.
Regularisation (last week) introduces bias into the model but reduces variance:
Regularisation helps find a balance between bias and variance.







When tuning model complexity:
The goal of model training is to achieve the optimal balance, minimizing both bias and variance to reduce the total error.
Even with the same overall accuracy, models can make different kinds of mistakes:
These two error types can have very different real-world consequences.
| True positive | True negative | |
|---|---|---|
| Model says positive | ✅ Correct: true positive (TP) | ❌ False positive (FP) |
| Model says negative | ❌ False negative (FN) | ✅ Correct: true negative (TN) |
In statistics: FP ≈ “Type I error”, FN ≈ “Type II error”.
Example: binary classifier for breast cancer screening.
Question: which error should the system try hardest to avoid?
In medical screening, we often prefer “too many alarms” (FP) over missed cases (FN).
We therefore need tools to explore this trade-off as we change the threshold.
Two cancer classifiers might have the same overall accuracy, but:
Which model would be better for cancer diagnosis?
Model A has few false negatives
Most classifiers output a score or probability, not just yes/no.
Suppose we use a classifier to decide whether a tumour is benign or malignant.
Small changes to the threshold can change who gets a scan, surgery, or is sent home.
Small changes to the threshold can change who gets a scan, surgery, or is sent home.
| true value is positive | true value is negative | |
|---|---|---|
| predicted positive | true positive TP | false positive FP |
| predicted negative | false negative FN | True negative TN |
$\mbox{true positive rate}=\frac{TP}{TP+FN}$
and
$\mbox{false positive rate}=\frac{FP}{FP+TN}$
we can see how well the prediction works by plotting the true value as a function of $z$ for each data point in the training sample:

The different categories (TP, FP, TN, FN) can be visualised on this plot:

If we are more worried about false negative than about false positive, we can move the decision boundary to the left:

Of course if means more false positives...
If we are more worried about false positive than about false negative, we can move the decision boundary to the right:

Of course if means more false negatives...
The curve describing this trade-off is the ROC curve (Receiver Operating Characteristic). It is the collection of (FP rate, TP rate) values for all values of the decision boundary.

Move the threshold to the left:

Move the threshold to the right:

ROC curve axes:
AUC = area under the ROC curve.
AUC = 0.90 means “in 90% of positive–negative pairs, the model gives a higher score to the positive case”.
We can plot multiple ROC curves on the same axes:
Each point on the ROC curve corresponds to a particular threshold:
The ROC curve does not tell us “the” correct threshold, but helps us choose one.
ROC curves are insensitive to class imbalance:
For very imbalanced data, Precision–Recall curves can be more informative.
Precision-Recall (PR) curve:
Focuses only on the positive class, useful when positives are rare (e.g. disease, fraud).
Same idea as ROC: sweep the threshold, plot (Recall, Precision).
ROC curves treat positives and negatives symmetrically.
But when positives are rare (e.g., cancer, fraud):
Precision-Recall curves focus on the positive class directly.
For more than two classes, ROC is usually extended via:
Still the same idea of comparing scores for positives vs negatives.
Support Vector Machines are a powerful and versatile Machine Learning tool used for both classification and regression tasks. They are particularly well-suited for problems where the data is high-dimensional and the number of features exceeds the number of samples.
We focus on a binary classification problem where the labels are \( y = +1 \) for positive class and \( y = -1 \) for negative class.
The goal of an SVM is to find the hyperplane that best separates the two classes by maximizing the margin, which is the minimal distance between the data points and the decision boundary.



For a linear model, the decision function is:
\[ z = w_0 + \vec{x} \cdot \vec{w} \]
The distance \( d \) from a point \( \vec{x} \) to the decision boundary \( z = 0 \) is proportional to \( z \):
\[ d = \frac{z}{\lVert \vec{w} \rVert} \]

The SVM optimization problem aims to:
These two goals are in conflict and are balanced using optimization techniques.
\[ \begin{aligned} & \text{Minimize} && \frac{1}{2} \lVert \vec{w} \rVert^2 \\ & \text{Subject to} && y^{(i)} (w_0 + \vec{x}^{(i)} \cdot \vec{w}) \geq 1 \quad \forall i \end{aligned} \]
Introduces slack variables \( \xi^{(i)} \) to allow margin violations:
\[ \begin{aligned} & \text{Minimize} && \frac{1}{2} \lVert \vec{w} \rVert^2 + C \sum_{i} \xi^{(i)} \\ & \text{Subject to} && y^{(i)} (w_0 + \vec{x}^{(i)} \cdot \vec{w}) \geq 1 - \xi^{(i)}, \quad \xi^{(i)} \geq 0 \quad \forall i \end{aligned} \]
The loss for the SVM also uses the hinge function, but offset such that we penalise values up to 1:
$$ J(w) = \frac{1}{2}\vec w\cdot \vec w + C \sum h_1( y_i p(x_i,w)) $$where $p(x_i,w)$ is the model prediction $\vec x\cdot \vec w + w_0$ and $h_1$ is the shifted hinge function.
$$ h_1(x) = \max(0, 1- x).$$
$C$ is a model parameter controlling the trade-off between the width of the margin and the amount of margin violation.
Here we use the iris dataset again, but we rescaled the features so that they have 0 mean and unit standard deviation.



Here we use the cancer data set we used for previous lectures and exercises.



Adding data to the training set only affects the model if the additional point falls into the margin.
The model is completely defined by the data samples at the boundary or inside the margin (this is where the name comes from, these data samples are the "support" vectors)
Note: Unlike in the logisitic regression case, there is no probabilistic interpretation for a SVM.
Overfitting occurs when a model learns the noise in the training data to the detriment of its performance on new data. Regularisation helps to mitigate overfitting by adding a complexity penalty to the loss function.

We modify the loss function to include a penalty term:
\[ J_{\text{pen}}(X, y, \vec{w}) = J(X, y, \vec{w}) + \lambda \cdot \text{Penalty}(\vec{w}) \]

Small values of \( \lambda \) mean weak regularisation, large values of \( \lambda \) mean strong regularisation.
We now look at a one-dimensional example. Suppose we have the relationship:
\[ y = 7 - 8x - \frac{1}{2} x^2 + \frac{1}{2} x^3 + \epsilon \]

We fit the data using polynomials of different orders \( k \):
\[ p_w(x) = \sum_{i=0}^{k} w_i x^i \]
The loss function to minimise is the Mean Squared Error (MSE):
\[ J(x, y, \vec{w}) = \sum_{i} \left( p_w(x^{(i)}) - y^{(i)} \right)^2 \]

The second-order polynomial provides a reasonable fit but misses the \( x^3 \) term:

This results in higher bias but potentially lower variance.
The third-order polynomial fits the data well and recovers coefficients close to the true values:

| Term | True Coefficient | Estimated Coefficient |
|---|---|---|
| \( w_0 \) | 7 | Approximate value |
| \( w_1 \) | -8 | Approximate value |
| \( w_2 \) | -0.5 | Approximate value |
| \( w_3 \) | 0.5 | Approximate value |
The tenth-order polynomial overfits the data:

We modify the loss function to include the regularisation term:
\[ J_{\text{pen}}(x, y, \vec{w}, \lambda) = J(x, y, \vec{w}) + \lambda \sum_{i=0}^{k} w_i^2 \]
This is known as Ridge Regression (L2 Regularisation).
With regularisation, the tenth-order polynomial coefficients have smaller magnitudes:

The regularisation parameter \( \lambda \) controls the strength of the penalty:
We can use techniques like cross-validation to select the optimal \( \lambda \).
When data is limited, we use cross-validation to make efficient use of it.
K-Fold Cross-Validation:
