Module 2: Predicting the Next Word
Building Toward Tiny GPT
Remember the 20-word language model you saw in Module 1? By the end of this course, you'll build it from scratch. This module takes the first step: learning to predict the next word using simple mathematical operations.
Your Module 2 predictor won't understand context yet. Given "The cat sat on the mat", it can't tell that "mat" relates to "sat" (sitting on something). It treats each word independently. But this simple version teaches the fundamental operations that every language model uses: matrix multiplication and softmax.
By Module 4, you'll add attention (Module 3) and transformer blocks (Module 4) to create the full tiny GPT. For now, focus on the basics: transforming embeddings into predictions.
From Numbers to Predictions
Module 1 showed how "cat" becomes a vector of numbers like [0.8, 0.2, 0.5, ...]. But those numbers don't predict anything yet. When you type "The cat sat on the" into ChatGPT, how do those embedding vectors become a prediction that the next word should be "mat" or "floor"?
The answer is mathematical transformations. The model takes embedding vectors and pushes them through a series of mathematical operations—matrix multiplication, addition, normalization—to produce probabilities for every possible next word. These operations have adjustable parameters that the model learned from billions of examples.
This module teaches you those transformations. You'll learn the core mathematical building blocks that make up neural networks and language models. By the end, you'll understand how embeddings flow through calculations to become predictions, and you'll build a simple next-word predictor yourself.
What You'll Learn
The journey is straightforward: Embeddings → Transformations → Probabilities. We start with embeddings and end with predictions for the next word.
Functions with Learnable Parameters - Neural networks are just functions with adjustable parameters. You'll see how a simple function like output = weights @ input + bias can be adjusted to perform different transformations. These parameters are what the model "learns" from examples.
Matrix Multiplication - This is the core operation. You'll understand how multiplying a vector by a matrix transforms it into a new representation. When you see [0.8, 0.2, 0.5] multiplied by a 768×50000 matrix, you get 50000 numbers—one score for each word in the vocabulary.
From Scores to Probabilities - Matrix multiplication gives you raw scores like [2.3, 5.1, -1.2, ...]. The softmax function converts these scores into probabilities that sum to 1.0. This gives you the actual prediction: "mat" has 35% probability, "floor" has 25%, and so on.
Building a Next-Word Predictor - All these pieces combine. You'll implement a simple predictor that takes word embeddings, transforms them through matrix multiplication, applies softmax, and outputs probabilities for the next word.
Why This Matters
Why learn the math instead of just using the API?
Understanding these operations is what separates using LLMs from understanding them. When you know that predictions come from matrix multiplication followed by softmax, you understand why:
- The model can only predict words in its vocabulary (the output matrix has one column per word)
- Temperature works by scaling scores before softmax (we'll cover this in detail)
- The model's "knowledge" lives in the weight matrices (billions of learned numbers)
- Training means adjusting these parameters to make better predictions
More importantly, these building blocks aren't just for LLMs. Every modern neural network—for images, speech, video—uses these same core operations. Learn them once, understand how all deep learning systems work.
The Path Ahead
Functions with Learnable Parameters - See how functions with adjustable parameters can learn patterns. You'll understand what "learning" means mathematically.
Matrix Multiplication Core - Learn the fundamental operation that transforms vectors. We start with a housing price example to build intuition before applying it to embeddings.
From Embeddings to Scores - Connect matrix multiplication to language prediction. See how embedding vectors multiply with a weight matrix to produce scores for each possible next word.
Softmax: Scores to Probabilities - Convert raw scores into probabilities that sum to 1.0. This gives you actual predictions you can sample from.
Building a Next-Word Predictor - Implement a complete predictor from scratch using everything you've learned. This brings all the concepts together into working code.
Each article is short and focused. By the end, you'll understand the complete transformation from word embeddings to next-word predictions. This foundation is essential—Module 3 builds on these operations to add context-awareness through attention.