Embeddings - Words as Vectors
From Tokens to Meaning
We've seen how text becomes tokens. The token "cat" might have ID 5234 in our vocabulary. But a single number—an ID—doesn't capture meaning. The model can't tell that "cat" and "dog" are related (both animals) while "cat" and "table" are not.
This is where embeddings come in. An embedding converts each token into a list of numbers called a vector. Instead of representing "cat" as the single number 5234, we represent it as a list like [0.2, -0.5, 0.8, 0.1, ...] with hundreds of dimensions.
These numbers aren't random. They're learned from data such that similar words have similar vectors. "cat" and "dog" end up with similar lists of numbers. "cat" and "table" have different lists.
What is an Embedding?
An embedding is a list of numbers (a vector) that represents a token. Think of it like coordinates in space. The word "cat" might map to the point (0.2, -0.5, 0.8) in 3-dimensional space. In reality, LLMs use hundreds of dimensions—GPT-2 uses 768 numbers per token.
Each number in the embedding vector represents some learned feature or aspect of the word's meaning. We don't know exactly what each dimension means (the model learns this automatically), but similar words end up close together in this high-dimensional space.
Example embeddings (simplified to 4 dimensions for readability):
- "cat" → [0.7, 0.2, -0.1, 0.5]
- "dog" → [0.6, 0.3, -0.2, 0.4]
- "table" → [-0.3, 0.8, 0.5, -0.2]
- "chair" → [-0.2, 0.7, 0.6, -0.3]
Notice: "cat" and "dog" have similar values. "table" and "chair" are similar to each other. But "cat" and "table" are quite different.
Why Vectors Capture Meaning
By using hundreds of numbers instead of one, we can encode rich information. Different dimensions might capture different aspects:
- Some dimensions might encode "is it an animal?"
- Others might encode "is it alive?"
- Others might encode "is it typically found indoors?"
- And so on for hundreds of dimensions
The model learns these representations automatically by processing billions of words. Words that appear in similar contexts end up with similar embeddings.
If you see "The ___ chased the mouse" in training data, both "cat" and "dog" could fill the blank. Over millions of examples, the model learns that "cat" and "dog" should have similar representations.
The Famous Word Math Example
Here's something remarkable: you can do math with word embeddings. The classic example:
king - man + woman ≈ queen
This actually works! If you:
- Take the embedding for "king"
- Subtract the embedding for "man"
- Add the embedding for "woman"
- Find the closest word to the result
You get "queen" (or something very close to it).
Why does this work? The embedding for "king" encodes "royal + male". The embedding for "man" encodes "male". Subtracting "man" removes the "male" aspect, leaving "royal". Adding "woman" (which encodes "female") gives you "royal + female" = "queen".
Measuring Similarity
How do we measure if two embeddings are similar? We calculate their distance. Closer vectors = more similar words.
The simplest measure is Euclidean distance: the straight-line distance between two points in space. If "cat" is at (0.7, 0.2, -0.1, 0.5) and "dog" is at (0.6, 0.3, -0.2, 0.4), the distance is:
distance = √[(0.7-0.6)² + (0.2-0.3)² + (-0.1-(-0.2))² + (0.5-0.4)²] = √[0.01 + 0.01 + 0.01 + 0.01] = √0.04 = 0.2
Small distance means high similarity. Another common measure is cosine similarity, which looks at the angle between vectors rather than absolute distance. We'll explore this in the next article.
How Models Use Embeddings
When an LLM processes text, the first step is always:
- Convert text to tokens (tokenization)
- Look up each token's embedding in a table
- Feed these embedding vectors to the model
The embedding table is just a big array. For a vocabulary of 204.
The model learns these embedding values during its training process. Initially, they're random numbers. After training on billions of words, similar words naturally end up with similar embeddings.
Embedding Dimensions
Real LLMs use large embeddings:
- GPT-2 small: 768 dimensions
- GPT-2 large: 1,280 dimensions
- GPT-3: 12,288 dimensions (for the largest model)
More dimensions allow capturing more nuanced meaning. But they also mean more computation and memory. There's a trade-off between embedding size and efficiency.
How Embeddings Are Learned
You don't manually create embeddings. The model learns them automatically. During training, the model:
- Starts with random embedding vectors
- Tries to predict text using these embeddings
- Adjusts the embeddings when predictions are wrong
- Repeats billions of times
After training, words that help make similar predictions end up with similar embeddings. The model discovers that "cat" and "dog" should be similar because they both fit in contexts like "The ___ ran across the room."
We won't dive into training details in this course. We'll focus on inference—how to use pre-trained embeddings to build a working LLM. Understanding inference is the foundation for everything else.
What's Next
Embeddings convert tokens into rich numerical representations. The next article explores how to measure similarity between these vectors more precisely using cosine similarity—a measure that focuses on direction rather than distance.
Later in the course, you'll see how the model uses these embeddings to understand context through the attention mechanism. But first, we need to understand similarity measurements.