Module 3: Attention Mechanism

Where We Are in the Transformer

Recall the transformer block structure from Module 1:

Transformer Block Structure

Module 2 skipped the transformer blocks entirely - we went straight from embeddings to the LM head. Now we add the first component inside the block: attention.

Building Toward Tiny GPT: Adding Context

In Module 2, you built a simple next-word predictor using matrix multiplication and softmax. It works, but it has a critical flaw: it treats every word independently.

Consider predicting after "the" in these two sequences from our 20-word vocabulary:

"the big cat sat on the" → should predict "big mat"
"the small dog ran to the" → should predict "small house"

Same input word "the", but the Module 2 predictor produces identical probabilities for both cases. During training, the model learned patterns like "big cat sat on the big mat" and "small dog ran to the small house". To predict correctly, it needs to connect "the" with earlier size words ("big" or "small"). But the predictor processes "the" in isolation, producing the same output regardless of what appeared earlier.

The same problem appears with ambiguous words. Consider "bank":

"I deposited money at the bank" - financial institution
"The cat sat on the river bank" - edge of a river

Without attention, "bank" gets the same representation in both cases. The model can't adjust based on "money" versus "river".

Processing each word in isolation can't solve this. In Module 2, you multiply a single embedding by a weight matrix and get a result. That operation doesn't see other words in the sequence. It can't compare "the" with earlier words like "big" or "small" to adjust its prediction.

Attention fixes this by comparing words to each other. It uses matrix multiplication differently - instead of transforming one word in isolation, it computes relationships between all words in the sequence. This lets the model measure which words are related and should influence each other. By the end of this module, you'll upgrade your Module 2 predictor to consider context, enabling it to make predictions based on the patterns it learned from training data.

Attention Context Problem

Why This Matters

Attention is the core innovation that made modern LLMs possible. Before attention, models struggled with long-range dependencies and context. They processed words sequentially, forgetting earlier context by the time they reached later words.

Attention changed everything. It lets every word directly compare itself to every other word, no matter how far apart they are in the sentence. The model can connect "bank" at position 5 with "money" at position 2, instantly recognizing the financial context.

This mechanism enables the capabilities you see in ChatGPT:

Understanding pronoun references: "The cat sat on the mat because it was tired" - attention connects "it" to "cat"
Following long conversations: The model attends to relevant earlier statements
Capturing relationships: Subject-verb agreement, cause and effect, semantic connections

More importantly, attention is parallelizable. Unlike sequential processing, you can compute all attention scores simultaneously. This makes training on massive datasets practical.

The Path Ahead

This module builds attention step by step:

Weighted Context - Why averaging embeddings isn't enough, and how weighted sums let the model focus on relevant words.

Query and Key (Single Word) - How one embedding transforms into two vectors: a query ("what am I looking for?") and a key ("how should others find me?").

Q and K Matrices - Stack queries and keys for a whole sentence. See how Q × K^T computes all pairwise attention scores at once.

Attention Dimensions - Understand d_model, d_k, and sequence length (L). See why the L×L score matrix creates context window limits in real LLMs.

Scores to Weights - Convert raw scores to proper attention weights using scaling (divide by √d_k), causal masking (don't look at future), and softmax.

Values: What We Mix - Attention weights tell us "how much" to look at each word. Values tell us "what" we get from looking. Complete the Q, K, V picture.

Multi-Head Attention - Run multiple attention patterns in parallel. Different heads learn different relationships - grammar, semantics, position.

Positional Encoding - Attention doesn't know word order by default. Positional encodings fix this by adding position information to embeddings.

Building a Context-Aware Predictor - Put everything together into a working predictor that uses attention to understand context.

Each article builds on the previous one. By the end, you'll understand how attention works and implement it yourself.