The Transformer Block: The Core Engine
In this section, we’ll open up the “decoder-only transformer” and look inside. The Transformer Block is the fundamental reasoning unit. Each block refines the model’s understanding through a two-step process:
- Gathering context – figuring out what’s relevant from surrounding tokens.
- Thinking about it – transforming that context into a deeper representation
The Attention Mechanism (The 'Gathering' Step)
So how does a Transformer "weigh the importance" of other words? It uses a clever process called the Attention Mechanism.
The Old Way: Imagine a conveyor belt carrying books past you, one at a time. You can read the current book and take notes, but as the belt moves on, earlier books fade from memory. If a book at the beginning had a detail you need now, you can't easily go back. Early AI models worked like this, reading words in sequence, causing important details from the start of a sentence to get lost.
The Transformer Way (With Attention): Now, imagine you're a researcher standing in the center of a library, able to see the title of every book at once. You can instantly spot the most relevant books for your research, no matter where they are. This is what the Attention Mechanism does—it lets every word look at all other words in the sentence simultaneously.
To do this efficiently, the mechanism uses a system of three questions for each word, each represented as a vector.
-
Query (Q): "What am I looking for?" This vector is the word's specific question to understand its own role.
-
Key (K): "What do I have?" This vector is like the title on the spine of a book—a short description of what each word offers.
- Value (V): "What information do I actually provide if chosen?" This vector contains the word's meaningful content that gets passed along.
So, is the Value vector just the word's embedding?
That's the right intuition! The Value, Key, and Query vectors are all created directly from the word's current vector (which, at the start, is its embedding).
Think of it like this: the model takes the original embedding and learns to create three purpose-built versions of it. It's like a single employee (the embedding) being asked to perform three roles in a meeting:
- Ask questions (the Query).
- State their area of expertise (the Key).
- Provide the detailed data when asked (the Value).
So, while the Value vector contains the core information from the embedding, it's a version that has been specifically prepared for the job of being shared with other words.
Let's see this in action. For the sentence: "The rocket finally landed on its destination." When the model processes the word "landed," it acts as a researcher:
When the model processes the word "landed,"
- It generates its Query: a vector representing what the word is looking for. This Query can be thought of as the model asking, "What is the subject and location of this action?"
- It scans all the Keys: It looks at the "title" of every other word. The word "rocket" has a Key that says something like, "I am a noun, a physical object." This is a strong match for the Query. The word "finally" has a Key like, "I am an adverb," which is a weaker match.
- It pulls the most relevant Value: The attention scores act as weights. The score for "rocket" will be high, so the model places a heavy weight on the Value of "rocket." The model then combines these weighted Values to form a new, context-rich representation for "landed."
This process happens for every single word at the same time, creating a deep understanding of how all the words in a sentence relate to each other. This is why it's often called self-attention—the sentence is looking at itself to understand its own context.
Summary: The Three Questions of Attention: Query, Key, Value
Think of attention like a library research system:
- Query (Q): "What am I looking for?" - Your research question
- Key (K): "What's available?" - The catalog/index of all books
- Value (V): "What information do I get?" - The actual content of the books
For each word:
- It asks a question (Query)
- Checks what all other words offer (Keys)
- Retrieves relevant information (Values)
- Combines these based on relevance scores