Module 4: The Transformer Architecture

The Architecture So Far

Let's trace what happens when we predict the next word. In every model, the final step is the same:

representation @ lm_head → vocabulary scores → softmax → prediction

The lm_head matrix converts a representation into scores for each word in the vocabulary. The question is: what representation are we feeding to lm_head?

Module 2: Raw Embeddings

In Module 2, we fed raw embeddings directly to lm_head:

Module 2:

`"the cat sat" → embeddings → @ lm_head → scores`

What lm_head receives: The embedding for "sat"
Problem: No context - "sat" doesn't know about "the" or "cat"

Each word's embedding was converted to vocabulary scores independently. The word "sat" had no information about the words around it.

Module 3: Attention-Mixed Embeddings

In Module 3, attention mixed context into each position before lm_head:

Module 3:

"the cat sat" → embeddings → attention → @ lm_head → scores

What lm_head receives: "sat" mixed with context from "the" and "cat"
Improvement: Now "sat" knows it follows a noun phrase

This helped significantly - the model could see that "cat" is the subject and predict verbs more accurately. But there's still room for improvement.

What's Missing?

If you test the Module 3 model extensively, predictions are decent but not great. Two issues limit performance:

Unstable values: As representations flow through attention, values can drift. Some numbers grow large while others shrink. This inconsistency makes the model less reliable.

Information loss: Attention transforms the input completely. The original embedding information can get lost in the mixing process, especially if we wanted to stack multiple attention layers.

These are practical engineering problems, not fundamental limitations. The solutions are straightforward.

The Missing Components

We'll add two components that stabilize the network:

Layer Normalization rescales values to a consistent range (mean=0, std=1) before major operations. This prevents values from drifting too far.

Residual Connections add the original input back after each transformation: output = input + transform(input). This preserves information and creates "highways" for data to flow through.

Module 4 (our tiny model):

"the cat sat" → embeddings → attention + LN + residual → @ lm_head → scores

What lm_head receives: Stabilized, context-aware representation
Improvement: More reliable predictions

What About Feed-Forward Networks?

Real transformers (GPT-2, GPT-3, etc.) also include a feed-forward network (FFN) with non-linear activation:

Full Transformer Block

The FFN enables complex pattern transformations like "if negation appears, flip sentiment." At scale with large vocabularies and complex text, this matters.

For our tiny 20-word model, FFN doesn't significantly improve predictions. The patterns are simple enough that attention + normalization + residuals handle them well.

We'll cover FFN at the end of this module - it's part of the real transformer architecture, even if our toy model doesn't need it.

Module 4 Roadmap

Stabilizing the Network

Layer Normalization - Keeping values in consistent range
Residual Connections - Preserving information flow

Assembling Our Model

The Transformer Block - Putting attention + LN + residuals together

Scaling to Real Transformers

Why Attention Alone Isn't Enough - Limits of linear operations
Activation Functions - How non-linearity enables conditional logic
Feed-Forward Networks - What GPT adds for complex patterns

Generation

Sampling Strategies - Converting probabilities to tokens
Text Generation - Running your complete tiny GPT

The Complete Picture

Complete Model Review - Tracing through the full architecture
Connecting to the Paper - How our model maps to "Attention Is All You Need"

What You'll Build

By the end of Module 4, you'll have a working transformer that generates text. More importantly, you'll understand:

Why each component exists (not just how it works)
What breaks if you remove it
How our tiny model relates to GPT-3's architecture

The architecture you build is the same pattern used by GPT-2, GPT-3, and GPT-4. The only differences are scale (more layers, bigger dimensions) and the FFN component that matters at that scale.

Let's start by stabilizing our network with layer normalization.