Summative assignment

  1. Released 13th Jan 2026 at 2 pm and collected 2 pm 14th Jan 2026.
  2. Exam should take 3 hours but you have 24 hours to complete it.
  3. The server load is intense with everyone using it at the same time so the 24 hour window eases this.
  4. The exam does not take 24 hours to complete, aim for 3 hours.
  5. Entire script should run within 5 minutes.
  6. No collusion and No GenAI allowed.

Summative assignment Potential Topics

The assignment will cover topics you've already seen in the notebooks

  1. Data Cleaning and Feature Engineering.
  2. Logistic Regression
  3. Non-linear models
  4. Regularisation
  5. Bias–Variance Tradeoff
  6. Feed forward Neural Networks
  7. Dimensionality Reduction
  8. PCA and kNN (this week)

Summative assignment Non-Topics

This course covered extra ML techniques to aid your learning but will NOT be assessed

  1. CNNs
  2. Reinforcement Learning
  3. Transformers
  4. LTSM
  5. RNNs
  6. GANs (Generative Adversarial Networks)
  7. Self-Supervised Learning
  8. Support Vector Machines

Using Pandas¶

Pandas uses dataFrames to represent data. Pandas has many helper functions to read data.

In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv('grades.csv')
df
Out[2]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false
2 Alice 90 67 87 31 true
3 Bob 78 76 68 89 true
4 Claire 54 32 21 11 false

Common operations on a dataFrame:

Drop a row (return a new dataFrame)

In [3]:
df.drop(2)
Out[3]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false
3 Bob 78 76 68 89 true
4 Claire 54 32 21 11 false

Drop several rows:

In [4]:
df.drop([2,3,4])
Out[4]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false

Drop some columns

In [5]:
df.drop('passed', axis=1)
Out[5]:
name Ex 1 Ex 2 Ex 3 Ex 4
0 John 86 57 45 32
1 Mary 13 36 24 53
2 Alice 90 67 87 31
3 Bob 78 76 68 89
4 Claire 54 32 21 11

Drop several columns:

In [6]:
df.drop(['Ex 2', 'Ex 3'], axis=1)
Out[6]:
name Ex 1 Ex 4 passed
0 John 86 32 true
1 Mary 13 53 false
2 Alice 90 31 true
3 Bob 78 89 true
4 Claire 54 11 false

Transform columns:

In [7]:
df['passed'].apply(lambda x: x=='true')
Out[7]:
0     True
1    False
2     True
3     True
4    False
Name: passed, dtype: bool

The dataFrame is not modified!

In [8]:
df
Out[8]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false
2 Alice 90 67 87 31 true
3 Bob 78 76 68 89 true
4 Claire 54 32 21 11 false

To modify it assign the modified column to istself:

In [9]:
df['passed'] = df['passed'].apply(lambda x: x=='true')
df
Out[9]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 True
1 Mary 13 36 24 53 False
2 Alice 90 67 87 31 True
3 Bob 78 76 68 89 True
4 Claire 54 32 21 11 False

Create new columns:

In [10]:
df['average'] = (df['Ex 1'] + df['Ex 2']+df['Ex 3']+df['Ex 4'])/ 4
df
Out[10]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed average
0 John 86 57 45 32 True 55.00
1 Mary 13 36 24 53 False 31.50
2 Alice 90 67 87 31 True 68.75
3 Bob 78 76 68 89 True 77.75
4 Claire 54 32 21 11 False 29.50
In [11]:
df['mention'] = df['average'] > 70
df
Out[11]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed average mention
0 John 86 57 45 32 True 55.00 False
1 Mary 13 36 24 53 False 31.50 False
2 Alice 90 67 87 31 True 68.75 False
3 Bob 78 76 68 89 True 77.75 True
4 Claire 54 32 21 11 False 29.50 False

DataFrames can be used as input to sklearn functions.

In [12]:
from sklearn.preprocessing import StandardScaler
sScaler = StandardScaler()
In [13]:
firstTerm = df[['Ex 1', 'Ex 2']]
sScaler.fit(firstTerm)
Out[13]:
StandardScaler()
In [14]:
sScaler.transform(firstTerm)
Out[14]:
array([[ 0.76533169,  0.19834601],
       [-1.79747626, -1.02673227],
       [ 0.90575952,  0.78171661],
       [ 0.48447602,  1.30675016],
       [-0.35809097, -1.26008051]])
In [15]:
from sklearn.neighbors import KNeighborsClassifier
In [16]:
kn = KNeighborsClassifier(n_neighbors=2)
kn.fit(df[['Ex 1','Ex 2', 'Ex 3', 'Ex 4']],df['passed'])
Out[16]:
KNeighborsClassifier(n_neighbors=2)
In [17]:
kn.predict([
    [10,10,10,10],
    [50,60,70,80]
])
Out[17]:
array([False,  True])

Using scikit-learn¶

scikit-learn and pandas are the common tools for data science in python.

In [1]:
import sklearn
import numpy as np
import matplotlib.pyplot as plt

Scikit-learn¶

sklearn has many of the tools needed to set up a data analysis pipeline:

  • preprocessors
  • models
  • model selection

Preprocessor¶

Preprocessors include

  • standardScaler: shifts and scale the data to have mean 0 and standard deviation 1.
  • Normalizer: normalises the features for each data sample to have unit lenght
  • MinMaxScaler: shifts and scales the data so it fits in a given interval
  • OneHotEncoder: transforms class labels to a one-hot encoded matrix of 0 or 1 values
  • PolynomialFeatures: Creates polynomial features
  • ...

Models¶

in sklearn.linear_model:

  • LogisticRegression: the logistic regression classifier discussed in Lecture 2.
  • Ridge: the ridge regression discussed in Lecture 4
  • Perceptron: the perceptron model discussed in Lecture 1

in sklearn.neural_network:

  • MLPclassifier: the multiple layer perceptron 'classic' neural network discussed in lecture 5 and 6.

in sklearn.neighbors:

  • KNeighborsClassifier: the $k$-neighbours classifier discussed in Lecture 7.

in sklearn.svm:

  • SVC: the support vector classifier discussed in Lecture 3.

Interface¶

The preprocessors and models in sklearn have a common functions:

  • fit: fits to the data to set the model/preprocessor parameters
  • transform(): transforms the input data and returns the transformed data
  • fit_transform(): do both operations

Models have common functions:

  • predict(X): make a prediction for new data X
  • score(X,y): gives the score for data X and targets y

fit example

In [2]:
from sklearn.preprocessing import StandardScaler
stdScaler = StandardScaler()
randomData = np.random.normal(2,3,size=(1000,1) )
stdScaler.fit(randomData)
Out[2]:
StandardScaler()

After the fit the standard scaler has leaned the mean and standard deviation of the dataset

In [3]:
stdScaler.mean_, stdScaler.scale_
Out[3]:
(array([1.9907856]), array([2.92175702]))

It can now apply the same transformation to unseen data:

In [4]:
stdScaler.transform([
    [2],
    [5],
    [-1]
])
Out[4]:
array([[ 0.00315372],
       [ 1.02993315],
       [-1.02362571]])

Tools¶

in model_selection:

  • learning_curve: can be used to produce learning curves.
  • train_test_split: can be used to separate a given dataset in a training and validation sample.
  • GridSearchCV: can be used to scan through a grid of parameter through cross validation.

Model selection with GridSearchCV¶

We start with the same dataset as in one of the exercises:

In [17]:
def fn(x):
    return 7 - 8*x - 0.5*x**2 + 0.5*x**3
  
n_train = 100
np.random.seed(1122)
xs = np.linspace(0, 5)
rxs = 5 * np.random.random(n_train)
X1D = np.array([rxs]).T
ys1D = fn(rxs) + np.random.normal(size = (n_train) )
In [18]:
plt.plot(xs, fn(xs), 'b--')
plt.plot(rxs, ys1D, 'ok')
plt.xlabel('x');
plt.ylabel('y');
In [19]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

polynomial_features = PolynomialFeatures(degree=8)
X_train = polynomial_features.fit_transform(X1D)

alpha_values = np.logspace(-4, 4, 100)
parameters = {'alpha': alpha_values}
r = Ridge()
Rsearch = GridSearchCV(r, parameters, cv=5)
Rsearch.fit(X_train, ys1D);

Our grid search has trained a ridge regression for each values of $\alpha$ and performed a 5-fold cross validation, so we will have access to an average and an uncertainty estimate.

In [8]:
Rsearch.cv_results_.keys()
Out[8]:
dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_alpha', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

We can now plot the score as a function of $\alpha$:

In [9]:
scores = Rsearch.cv_results_['mean_test_score']
scores_std = Rsearch.cv_results_['std_test_score']
plt.fill_between(alpha_values, scores - scores_std,
                 scores + scores_std, alpha=0.1, color="g")
plt.plot(alpha_values, scores)
plt.xscale('log')
plt.xlabel(r'Regularisation parameter $\alpha$')
plt.ylabel('Average score');

We can access the best model using best_estimator_:

In [10]:
xval = np.arange(0,5.1,0.1).reshape(-1, 1)
pxval = polynomial_features.transform(xval)
ypred = Rsearch.best_estimator_.predict(pxval)

plt.plot(rxs, ys1D,'ok')  
plt.plot(xval, ypred , color='r')

plt.xlabel('x')
plt.ylabel('y');

Pipelines¶

You noticed that we had to remember all the steps of the training to make the prediction of the model for the preceding plot. This is akward and error-prone.

We can use Pipeline to create all steps of an analysis in one object.

In [11]:
from sklearn.pipeline import Pipeline

analysis_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=8)), 
    ('ridge', Ridge())
])

This pipeline can be used as a normal model, for example we can use it in a grid search:

In [12]:
degrees = [5,6,7]
parameters = {
    'ridge__alpha': alpha_values, 
    'poly__degree': degrees
}
Psearch = GridSearchCV(analysis_pipeline, parameters, cv=5)
Psearch.fit(X1D, ys1D);

Notice how parameters of specific steps can be set!

We can plot the scores for each polynomial order:

In [13]:
for j in range(3):
    scores = Psearch.cv_results_['mean_test_score'][j*100:(j+1)*100]
    scores_std = Psearch.cv_results_['std_test_score'][j*100:(j+1)*100]
    plt.fill_between(alpha_values, scores - scores_std,
                 scores + scores_std, alpha=0.1, label="n={}".format(degrees[j]))
    plt.plot(alpha_values, scores)
plt.xscale('log')
plt.legend()
plt.xlabel(r'Regularisation parameter $\alpha$')
plt.ylabel('Test score');

And plot the best estimator's prediction:

In [14]:
xval = np.arange(0,5.1,0.1).reshape(-1, 1)
ypred = Psearch.best_estimator_.predict(xval)

plt.plot(rxs, ys1D,'ok')  
plt.plot(xval, ypred , color='r')

plt.xlabel('x')
plt.ylabel('y');

Notice how we did not need to explicitely perform all the steps.

We can also use this estimator in other tools such as learning_curve:

In [15]:
from sklearn.model_selection import learning_curve

train_sizes = np.linspace(.1, 1.0, 10)

train_sizes, train_scores, test_scores = learning_curve(
    Psearch.best_estimator_, X1D, ys1D, cv=5, train_sizes=train_sizes)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
In [16]:
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.ylim(0,1); plt.grid(); plt.legend(loc="best");