Advanced Neural Networks

CNNs, Transformers, GenAI, and RL

CNNs for Computer Vision

The Convolution operation

A Kernel (Filter) slides over the input.
Performs element-wise multiplication and sum.
Produces a Feature Map.

Hyperparameter: Stride

Stride controls how the filter moves:

Stride = 1: Filter moves 1 pixel at a time.
Stride = 2: Filter jumps 2 pixels (downsamples the image).

Higher stride reduces the spatial dimension of the output.

Hyperparameter: Padding

Convolutions shrink the image size. To maintain dimensions, we use Padding.

Valid Padding: No padding (image shrinks).
Same Padding: Zero-padding around edges so Output Size = Input Size.

Calculating Output Dimensions

Given input $W$, filter size $F$, stride $S$, and padding $P$, the output size is:

$$ Output = \lfloor \frac{W - F + 2P}{S} \rfloor + 1 $$

Question

$$ Output = \lfloor \frac{W - F + 2P}{S} \rfloor + 1 $$

Input: 5x5 image
Filter: 3x3
Stride: 1
Padding: 0

Formula: $(5 - 3 + 0)/1 + 1$

Output: 3x3 Feature Map

Pooling Layers

Pooling reduces dimensions to reduce parameters and computation.

Max Pooling: Takes the maximum value in a window (most common).
Average Pooling: Takes the average value.

Pooling provides translation invariance (small shifts don't change the max).

Classic Architecture: VGG16

Uses only 3x3 convolutions stacked on top of each other.
Very deep (16 layers).
Concept: Small filters stacked deeply > One large filter.

The Problem with Depth

Why not just stack 100 layers?

Degradation Problem: As networks get deeper, accuracy gets worse (not just due to overfitting, but optimization difficulties).

Gradients vanish as they travel back through many layers.

Solution: Residual Networks (ResNet)

Introduced Skip Connections (or Shortcut Connections).

The input $x$ is added to the output of the layer: $F(x) + x$.
Allows gradients to flow directly through the network.

Transfer Learning

Rarely in production should you train from scratch.

Take a network trained on a huge dataset (e.g., ImageNet).
Freeze the early layers (feature extractors).
Replace the final classification layer.
Train only the new layer on your small dataset.

Data Augmentation

What if you don't have enough data?

Fabricate it.

Data Augmentation

What if you don't have enough data?

Fabricate it.

Random Rotations
Flips (Horizontal/Vertical)
Color Jitter (Brightness/Contrast)
Zoom/Crop

Makes the model robust to variations.

Sequence Models

Handling Time, Text, and Audio.

The Limitation of Feedforward Networks

Standard networks (MLPs/CNNs) expect a fixed input size.

But real life is sequential:

Text sentences (variable length)
Audio signals
Stock prices

Recurrent Neural Networks (RNNs)

RNNs process data one step at a time, maintaining an internal Hidden State.

$$ h_t = \tanh(W_x x_t + W_h h_{t-1} + b) $$

The state $h_t$ acts as the "memory" of the network.

Unrolling the RNN

We can visualize an RNN by "unrolling" it across time steps.

Same weights ($W_x, W_h$) are shared across all time steps.
Allows processing of arbitrary sequence lengths.

Problem: Vanishing Gradient in Time

When training via Backpropagation Through Time (BPTT):

Gradients must flow from the last time step to the first.

If the sequence is long, gradients are multiplied many times.
Gradients $\to 0$. The network forgets early inputs.

Solution: LSTM (Long Short-Term Memory)

LSTMs introduce a separate Cell State ($C_t$) that runs straight down the entire chain with only minor linear interactions.

It acts as a conveyor belt for information.

LSTM Gates

LSTMs use gates (sigmoid functions) to control information flow:

Forget Gate: What info to throw away from cell state?
Input Gate: What new info to store in cell state?
Output Gate: What to output based on cell state?

GRU (Gated Recurrent Unit)

A simplified version of LSTM.

Combines Forget and Input gates into an "Update Gate".
Computationally cheaper, often performs just as well.

Applications of Sequence Models

One-to-Many: Image Captioning (Image $\to$ Text)
Many-to-One: Sentiment Analysis (Text $\to$ Rating)
Many-to-Many: Machine Translation (English $\to$ French)

Transformers

Current State-of-the-Art

The Problem with RNNs

Sequential computation: Cannot parallelize training. (Slow!)
Long-term memory: Even LSTMs struggle with very long contexts (e.g., thousands of words).

Question: How can we look at the whole sentence at once?

Attention is All You Need (2017)

Google researchers proposed the Transformer architecture.

Key idea: Throw away recurrence. Use Self-Attention.

"The animal didn't cross the street because it was too wide."

Self-attention helps the model understand "it" refers to "street", not "animal".

Self-Attention Mechanism

Every word attends to every other word. For each token, we generate three vectors:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I pass along?

The Attention Formula

A "soft dictionary lookup":

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

$QK^T$: Similarity between Query and Key.
Softmax: Normalizes scores to probabilities.
$V$: Weighted sum of values.

Multi-Head Attention

Why have one attention mechanism when you can have many?

Multiple "heads" run in parallel.
One head might focus on grammar.
Another might focus on semantic relationships.
Another might focus on rhyme or tone.

Positional Encoding

Since Transformers process all words in parallel, they have no concept of "order".

We must add Positional Encodings (vectors representing position 1, 2, 3...) to the input embeddings so the model knows the word order.

The Transformer Block

A standard block consists of:

Multi-Head Attention
Add & Norm (Residual connection + Normalization)
Feed Forward Network
Add & Norm

The Transformer Block

A standard block consists of:

Multi-Head Attention
Add & Norm (Residual connection + Normalization)
Feed Forward Network
Add & Norm

Encoders vs Decoders

Encoder (e.g., BERT): Looks at the whole sentence (bidirectional). Good for understanding, classification, sentiment.

Decoder (e.g., GPT): Looks only at past words (autoregressive). Good for generation (predicting the next word).

Tokenization

Neural networks don't read text; they read numbers.

Tokens: Words or sub-words.
"The cat sat" $\to$ [101, 582, 993]
Modern tokenizers (Byte-Pair Encoding) break down rare words into chunks.

Why Tokenization?

Neural networks operate on vectors, not characters or words.
Directly mapping every word → integer would require:
- A huge vocabulary (millions of words).
- No way to handle misspellings or unseen terms.
Solution: Subword tokenization, balancing:
- Expressiveness
- Vocabulary size
- Robustness to unseen text

Tokenization Approaches

Character-level
- Tiny vocab, but very long sequences.
Word-level
- Simple, but huge vocab & poor handling of rare words.
Subword-level (BPE / Unigram)
- Modern standard.
- Keeps common words intact.
- Splits rare or complex words into chunks.
- Examples:
  - "unbelievable" → ["un", "believable"]
  - "electroencephalography" → many fragments

How Byte-Pair Encoding (BPE) Works

Start with a vocabulary of single characters.
Count all adjacent character pairs.
Merge the most frequent pair (e.g., "th").
Repeat tens of thousands of times.
End result:
- Whole common words
- Useful subwords
- Short fragments for rare words

BPE Example Breakdown

The string "tokenization" might tokenize into:
- ["token", "ization"]
- or ["tok", "en", "ization"]
The exact split depends on the model's learned merge rules.
Common patterns → long tokens
Rare patterns → short fragments

Tokenization Artefacts

BPE is frequency-driven, so weird patterns appear:

"solidgoldmagikarp" → single token
Space often merges with common words (e.g., " the")

Dataset effects:
- High-frequency internet slang → stable tokens ("lol")
- Scientific vocabulary → often fragmented
These quirks influence model behaviour and reasoning.

Why Tokenization Matters for Costs

Models bill per token, not per word.
Different languages produce different token counts:
- English: few tokens/word
- German/Finnish: compounds → many tokens
- Chinese/Japanese: ~1 char = 1 token
Long variable names, emojis, punctuation all inflate token usage.

Tokenization Affects Reasoning

Models reason over token sequences, not words.
If a word becomes many fragments, the concept becomes “longer” for the model:
- "neutrino" → multiple small tokens
- "dog" → typically a single token
Consequences:
- Misspellings break reasoning paths
- Long numbers are hard (many tokens)
- Non-Latin languages degrade in small models

BERT (Bidirectional Encoder Representations)

Masked Language Modeling: Hide 15% of words and ask the model to guess them.
"The [MASK] sat on the mat."
Allows the model to learn context from both left and right directions.

GPT (Generative Pre-trained Transformer)

Causal Language Modeling: Predict the next token based only on previous tokens.
Trained on massive amounts of internet text.
Few-shot learner: Can perform tasks with minimal examples.

Generative AI

Creating new data rather than classifying old data.

Discriminative vs Generative

Discriminative: $P(Y|X)$ - Given an image, is it a cat or dog?
Generative: $P(X)$ - Generate a plausible image of a cat.

Autoencoders

Unsupervised learning method.

Encoder: Compresses input into a small "Latent Space" (Bottleneck).
Decoder: Reconstructs input from the latent vector.

Forces network to learn efficient data representation.

Latent Space

The "Bottleneck" is not just compression; it's a map of concepts.

Nearby points in latent space $\to$ similar output images.
"King" - "Man" + "Woman" = "Queen" (Vector arithmetic).

Diffusion Models (The Modern Era)

Used by DALL-E, Stable Diffusion, MidJourney.

Idea: Slowly destroy an image by adding noise until it is random static. Then, learn to reverse the process.

Diffusion: Forward vs Reverse

Forward Process: Easy. Just add Gaussian noise.
Reverse Process: Hard. Train a U-Net (neural network) to predict the noise added at each step and subtract it.

Text-to-Image Conditioning

How do we control what image is generated?

We inject Text Embeddings (from a Transformer like CLIP) into the diffusion process.
The noise subtraction is guided by the text prompt.

Reinforcement Learning

Learning by trial and error.

The RL Loop

Agent: The learner (Neural Net).
Environment: The world (Game, Robot sim).
Action ($a$): What the agent does.
State ($s$): Current situation.
Reward ($r$): Feedback (+1 for goal, -1 for crash).

Goal: Maximize Return

The agent wants to maximize the Expected Cumulative Reward.

$$ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... $$

$\gamma$ (Gamma) is the Discount Factor: Rewards now are better than rewards later.

Exploration vs Exploitation

The fundamental dilemma:

Exploitation: Do what you know yields reward.
Exploration: Try something random to see if there's a better payoff.

Epsilon-Greedy strategy: 10% of the time, act randomly.

Q-Learning

We want to learn a function $Q(s, a)$:

"How good is it to take action $a$ in state $s$?"

Deep Q-Networks (DQN)

For complex games (Atari, Doom), the state space is too big for a table.

Solution: Use a Neural Network to approximate the Q-function.

$$ \text{Input (Pixels)} \to \text{CNN} \to \text{Q-Values for actions} $$

Policy Gradients

Instead of learning Q-values, learn the Policy $\pi(s)$ directly.

The network outputs a probability distribution over actions.
If an action leads to a win, increase probability of that action.

Used in ChatGPT (RLHF - Reinforcement Learning from Human Feedback).

AlphaGo & AlphaZero

Combined:

Monte Carlo Tree Search: Planning ahead.
Deep RL: Evaluating board positions.

Result: Defeated world champions in Go, Chess, and Shogi.

Conclusion & Future

Current Challenges

Compute Cost: Training huge models costs millions.
Data Hunger: Running out of high-quality internet text.
Explainability: "Black Box" problem remains.
Bias & Safety: Models hallucinate and reflect training biases.

Summary

CNNs: Still the standard for Vision.
RNNs/LSTMs: Legacy approach.
Transformers: Current SOTA with Attention.
Diffusion: SOTA for image generation.
RL: Decision making agents.