Predicting the Next Word

From House Prices to Words

Recall the housing price predictor: we had 4 inputs (size, bedrooms, location, age) and predicted 3 outputs (price, probability of selling, days on market). Each output needed its own set of weights, so we used a matrix with 3 rows—one row per output.

Language prediction works the same way. Instead of predicting 3 housing values, we predict 20 scores—one for each word in our vocabulary. Given the word "cat", we need to score how likely each vocabulary word is to come next: "sat" gets a high score, "ran" gets a high score, "blue" gets a low score.

Predictor Matrix Comparison

The Vocabulary Matrix

Our vocabulary has 20 words: the, cat, dog, sat, ran, on, mat, house, a, big, small, quickly, slowly, and, is, red, blue, to, PAD, END.

We need 20 outputs—one score per word. The transformation matrix has 20 rows (one per vocabulary word) and 64 columns (matching the embedding dimension):

Vocabulary Matrix

When we multiply this matrix by a 64-dimensional word embedding, we get 20 scores:

Matrix Multiply Dimensions

What Changes, What Stays the Same

The matrix dimensions never change: always 20 rows (one per vocabulary word), always 64 columns (matching embedding size). The input is always a 64-dimensional vector. The output is always 20 scores.

What changes between predictions is the content of the input vector—which 64 numbers we feed in—not the size:

Same Matrix Different Inputs

The Architecture Pattern

This pattern appears throughout language models:

Input: Always a fixed-size vector (64 dimensions in our example)
Transform: Matrix multiplication with learned weights
Output: Scores for all possible next words

The input size never changes. What can change is the content of that input vector. In the simple predictor we've built, the input is just one word's embedding. Later, we'll see how to create better input vectors that contain information from multiple words - but they'll still be 64 dimensions, just with different numbers inside.

The key insight: prediction always works the same way. Feed a 64-dimensional vector into the matrix, get 20 scores out, pick the highest. The challenge is creating the best possible input vector to feed in.

Why This Matters

Understanding this architecture clarifies what language models actually do:

They're scorers, not searchers: The model doesn't search through possible sentences. It computes scores for all vocabulary words simultaneously through matrix multiplication.
Fixed architecture, flexible content: The matrix size is determined by vocabulary size (20 words) and embedding size (64 dims). These are fixed. What changes is the input vector content.
The bottleneck: Everything about context, meaning, and relationships must be packed into those 64 input numbers. If the input is just one word's embedding, you lose all context. The next module shows how to create richer input representations that capture context—but they're still 64 dimensions, just with more information encoded in them.

The transformation from embedding to scores is straightforward matrix multiplication. The challenge is getting the right input to multiply.