The assignment will cover topics you've already seen in the notebooks
This course covered extra ML techniques to aid your learning but will NOT be assessed
One would think that the more features one has to describe samples in a dataset the better one would be able to perform a classification task. Unfortunately with the increase of the number of features comes the difficulty of fitting a multi-dimensional model.
This is generally referred to as the curse of dimensionality and we will see a few surprising effects that explain why more features can make life difficult.
We can ask the question "In the hypercube $-1\leq x_i\leq 1$, how many points are no further apart to the center than 1?"
This is equivalent to asking what is the ratio of the unit "ball" to the volume of the smallest "cube" enclosing it.

In high dimensions most points are in "corners" rather than in the "centre".
Looking at the unit cube, we can calculate the average distance between any two points.
$$ d=\sqrt{\sum_i x_i^2} $$

The average distance increases with the dimension.
We can also plot the distribution of distances:

The likelihood of small distances drops as the dimension increases.
One interesting question to ask is how close to the edges points are. To quantify it we will calculate what is the thickness $t$ of the outer layer of the unit cube that contain half the points if the points are randomly distributed.
The volume inside is given by
$$ V_i = (1-2t)^d \qquad V_i=\frac12 \Rightarrow t = \frac{1-2^{-1/d}}{2}$$

In 35 dimensions half of the points are in a outer layer 0.01 thin.
If we have many features, odds are that many are correlated. If there are strong relationships between features, we might not need all of them.
With principal component analysis we want to extract the most relevant/independant combination of features.
It is important to realise that PCA only looks at the features without looking at the labels, it is an example of unsupervised learning.
Correlated features

Uncorrelated features:

The idea for PCA is to project the (standardised) data on a subspace with fewer dimensions.

Standarization involves transforming data so that each feature has a mean of 0 and a standard deviation of 1.
PCA is sensitive to the scale of the variables. Features with larger scales can dominate the principal components, skewing the results.
Illustrative Example:
If we project onto the first component we get variance 1:

If we project onto the second component we also get variance 1:

But projecting onto a different direction gives a different variance, here larger than 1:

And here smaller than one:

Performing PCA gives a new basis in feature space that include the direction of largest and smallest variance.
There is no guarantee that the most relevant features for a given classification tasks are going to have the largest variance.
If there is a strong linear relationship between features it will correspond to a component with a small variance, so dropping it will not lead to a large loss of variance but will reduce the dimensionality of the model.
The first step is to normalise and center the features.
$$ x_i \rightarrow a x_i +b $$such that
$$ \langle x_i\rangle = 0 \;,\qquad \langle x_i^2\rangle = 1$$The covariance matrix of the data is then given by
$$ \sigma = X^T X $$If $X$ is the $n_d\times n_f$ data matrix of the $n_d$ training samples with $n_f$ features. The covariance matrix is a $n_f\times n_f$ matrix.
After centering and normalising the data, PCA identifies the directions along which the dataset varies the most. These directions are obtained by diagonalising the covariance matrix
\[ \sigma = X^{T}X . \]The principal components are exactly the eigenvectors of this covariance matrix:
\[ \sigma\, v_j = \lambda_j\, v_j , \]where each eigenvector \(v_j\) defines a direction of maximal variance, and the corresponding eigenvalue \(\lambda_j\) equals the variance of the data along that direction.
Because the covariance matrix is symmetric, its eigenvectors form an orthonormal set. This means the principal components are pairwise orthogonal (i.e., independent), ensuring that the projected features are uncorrelated.
When we only consider the \(k\) principal axes of a dataset we will lose some of the variance of the dataset.
Assuming the eigenvalues are ordered in size we have
\[ \sigma_k \equiv \mathrm{Tr}(X_k^T X_k) = \sum_{j=1}^k \epsilon_j^2 \]\(\sigma_k\) is the variance our reduced dataset retained from the original; it is often referred to as the explained variance.
In practice, one often considers the explained variance ratio, defined as
\[ \frac{\sigma_k}{\sum_{j=1}^d \epsilon_j^2}, \]which measures the fraction of the total variance (the sum of all eigenvalues) captured by the first \(k\) principal components. This ratio is typically used to decide how many components are needed to retain a desired amount of information.
We consider a dataset of handwritten digits, compressed to an 8x8 image:

These have a 64-dimensional space but this is clearly far larger than the true dimension of the space:
PCA should help us limit our features to things that are likely to be relevant.
Performing PCA we can see how many eigenvectors are needed to reproduce a given fraction of the dataset variance via a cumulative scree plot:

We can keep 50% of the dataset variance with less than 10 features.
The eight most relevant eigenvectors are:

The least relevant eigenvectors are:

If we reduce the data to be 2-dimensional or 3-dimensional we can get a visualisation of the data.


The parameter $k$ can be used to control overfitting.
The $k$-nearest neighbors method is an instance-based learning algorithm.
Key Idea: Similar data points are likely to have similar target values.
Advantages:
Disadvantages:
A generalization of both Euclidean and Manhattan distance:
$$d(p, q) = \left( \sum_{i=1}^n |q_i - p_i|^p \right)^{1/p}$$
k = 1 (Too Small)

k = Large (Too Big)

Q: If we have a dataset with $N$ points (50 red, 40 blue), and we set $k = N$, what happens?
A: The model will always predict the majority class (Red) for every single input.
The decision boundary disappears.
When $k=1$, kNN effectively divides the space into regions based on which point is closest.
These regions are called Voronoi Cells.

Why did we learn these topics together?
kNN relies entirely on Distance.
As we saw in the "Curse of Dimensionality" section:
In high dimensions, "nearest" becomes meaningless. All points are roughly equidistant.
The parameter $k$ can be used to control overfitting.
We can use the iris dataset:





We can use the 8x8 digits picture example after applying PCA to reduce it to 2 dimensions:




This creates a standard machine learning pipeline:
PCA rescues kNN from the curse.