Module 4: The Transformer Architecture
The Architecture So Far
Let's trace what happens when we predict the next word. In every model, the final step is the same:
representation @ lm_head → vocabulary scores → softmax → prediction
The lm_head matrix converts a representation into scores for each word in the vocabulary. The question is: what representation are we feeding to lm_head?
Module 2: Raw Embeddings
In Module 2, we fed raw embeddings directly to lm_head:
Module 2: `"the cat sat" → embeddings → @ lm_head → scores` What lm_head receives: The embedding for "sat" Problem: No context - "sat" doesn't know about "the" or "cat"
Each word's embedding was converted to vocabulary scores independently. The word "sat" had no information about the words around it.
Module 3: Attention-Mixed Embeddings
In Module 3, attention mixed context into each position before lm_head:
Module 3:
"the cat sat" → embeddings → attention → @ lm_head → scores
What lm_head receives: "sat" mixed with context from "the" and "cat"
Improvement: Now "sat" knows it follows a noun phrase
This helped significantly - the model could see that "cat" is the subject and predict verbs more accurately. But there's still room for improvement.
What's Missing?
If you test the Module 3 model extensively, predictions are decent but not great. Two issues limit performance:
Unstable values: As representations flow through attention, values can drift. Some numbers grow large while others shrink. This inconsistency makes the model less reliable.
Information loss: Attention transforms the input completely. The original embedding information can get lost in the mixing process, especially if we wanted to stack multiple attention layers.
These are practical engineering problems, not fundamental limitations. The solutions are straightforward.
The Missing Components
We'll add two components that stabilize the network:
Layer Normalization rescales values to a consistent range (mean=0, std=1) before major operations. This prevents values from drifting too far.
Residual Connections add the original input back after each transformation: output = input + transform(input). This preserves information and creates "highways" for data to flow through.
Module 4 (our tiny model): "the cat sat" → embeddings → attention + LN + residual → @ lm_head → scores What lm_head receives: Stabilized, context-aware representation Improvement: More reliable predictions
What About Feed-Forward Networks?
Real transformers (GPT-2, GPT-3, etc.) also include a feed-forward network (FFN) with non-linear activation:
The FFN enables complex pattern transformations like "if negation appears, flip sentiment." At scale with large vocabularies and complex text, this matters.
For our tiny 20-word model, FFN doesn't significantly improve predictions. The patterns are simple enough that attention + normalization + residuals handle them well.
We'll cover FFN at the end of this module - it's part of the real transformer architecture, even if our toy model doesn't need it.
Module 4 Roadmap
Stabilizing the Network
- Layer Normalization - Keeping values in consistent range
- Residual Connections - Preserving information flow
Assembling Our Model
- The Transformer Block - Putting attention + LN + residuals together
Scaling to Real Transformers
- Why Attention Alone Isn't Enough - Limits of linear operations
- Activation Functions - How non-linearity enables conditional logic
- Feed-Forward Networks - What GPT adds for complex patterns
Generation
- Sampling Strategies - Converting probabilities to tokens
- Text Generation - Running your complete tiny GPT
The Complete Picture
- Complete Model Review - Tracing through the full architecture
- Connecting to the Paper - How our model maps to "Attention Is All You Need"
What You'll Build
By the end of Module 4, you'll have a working transformer that generates text. More importantly, you'll understand:
- Why each component exists (not just how it works)
- What breaks if you remove it
- How our tiny model relates to GPT-3's architecture
The architecture you build is the same pattern used by GPT-2, GPT-3, and GPT-4. The only differences are scale (more layers, bigger dimensions) and the FFN component that matters at that scale.
Let's start by stabilizing our network with layer normalization.