The Transformer: LLMs' Core Architecture

How Does Prediction Actually Work?

You learned that LLMs predict the next word by assigning probabilities to every word in their vocabulary. Type "The cat sat on the" and the model predicts "mat" with high probability. But how does the model go from text input to these probability scores?

The answer: the transformer architecture. This is the blueprint that defines how ChatGPT, Claude, and GPT-4 process text. The transformer is a specific design - a pipeline of mathematical operations that converts text into predictions.

Every major LLM uses variations of this architecture. The tiny GPT you'll build in this course uses the same transformer design as GPT-4, just smaller. Understanding this architecture means understanding how all modern LLMs work.

The Transformer Pipeline

The transformer processes text through a sequence of transformations. Each transformation refines the representation until the model can make accurate predictions.

Transformer Pipeline

This pipeline runs every time you ask the model to predict the next word. The same architecture, the same sequence of steps.

Breaking Down the Components

Each stage in the pipeline serves a specific purpose. Understanding what each component does helps you see why the architecture works.

Tokenization breaks text into pieces the model can process. "The cat" becomes two tokens. Each token gets a numeric ID. This converts text (which computers can't process directly) into numbers (which they can).

Embeddings convert token IDs into vectors - lists of numbers that capture meaning. Token ID 15 ("cat") becomes a vector like [0.23, -0.51, 0.82, ...] with 768 dimensions. These numbers encode what the model learned about "cat" during training.

Transformer blocks are where the real work happens. The model stacks multiple blocks (GPT-2 uses 12, GPT-3 uses 96). Each block contains two key operations:

Attention finds relationships between words. It answers: "Which words should influence each other?" When processing "cat," attention looks at "The" and figures out this is a specific cat, not cats in general.
Feed-forward networks transform each word's representation based on its context. After attention gathers relevant information, the feed-forward network processes that information to refine understanding.

Prediction layer converts the final representations into probability scores for every word in the vocabulary. The model computes: "Given everything I learned about this context, how likely is each possible next word?"

Data Flow Example

Follow a concrete example through the pipeline:

Data Flow Example

The same vector passes through each block, getting refined at each stage. Early blocks learn simple patterns (parts of speech, basic grammar). Later blocks learn complex patterns (semantic relationships, long-range dependencies).

Why This Architecture Works

The transformer architecture has specific properties that make it effective for language.

Parallel processing. Unlike older approaches that processed words one at a time, transformers process all words simultaneously. The model looks at the entire sentence at once, computing relationships between every pair of words in parallel. This is faster and captures more context.

Context awareness. Attention mechanisms let each word incorporate information from every other word. When processing "bank" in "river bank," attention looks at "river" and understands this means shoreline, not a financial institution. The same word gets different representations depending on context.

Stacking for depth. Multiple transformer blocks build hierarchical understanding. Block 1 might learn "The cat" is a noun phrase. Block 5 might learn the entire sentence describes a location. Block 10 might predict what locations commonly follow "cat sat on the." Each layer builds on previous layers.

Learned transformations. The model doesn't use hand-coded rules. Instead, it contains millions or billions of numbers (called parameters or weights) that get adjusted during training. These numbers encode everything the model knows about language - grammar, facts, reasoning patterns - all learned from examples.

The Architecture is Universal

The transformer architecture isn't just for language. The same design powers:

Vision models process images by treating image patches like tokens. DALL-E and Stable Diffusion use transformer variants.

Code generation models like GitHub Copilot use transformers trained on source code instead of natural language.

Protein folding predictions use transformers to model amino acid sequences.

Audio models process sound by converting audio to token sequences.

The core idea - attention mechanisms that find relationships, feed-forward networks that transform representations, stacking for depth - applies across domains. Once you understand transformers for text, you understand the architecture powering most modern AI systems.

Next, you'll learn about the two modes this architecture operates in: training and inference. Training is how the model learns patterns from billions of examples. Inference is how it uses those learned patterns to make predictions when you use ChatGPT or Claude.