Embeddings - Words as Vectors

From Tokens to Meaning

We've seen how text becomes tokens. The token "cat" might have ID 5234 in our vocabulary. But a single number—an ID—doesn't capture meaning. The model can't tell that "cat" and "dog" are related (both animals) while "cat" and "table" are not.

This is where embeddings come in. An embedding converts each token into a list of numbers called a vector. Instead of representing "cat" as the single number 5234, we represent it as a list like [0.2, -0.5, 0.8, 0.1, ...] with hundreds of dimensions.

These numbers aren't random. They're learned from data such that similar words have similar vectors. "cat" and "dog" end up with similar lists of numbers. "cat" and "table" have different lists.

Token to Embedding

What is an Embedding?

An embedding is a list of numbers (a vector) that represents a token. Think of it like coordinates in space. The word "cat" might map to the point (0.2, -0.5, 0.8) in 3-dimensional space. In reality, LLMs use hundreds of dimensions—GPT-2 uses 768 numbers per token.

Each number in the embedding vector represents some learned feature or aspect of the word's meaning. We don't know exactly what each dimension means (the model learns this automatically), but similar words end up close together in this high-dimensional space.

Example embeddings (simplified to 4 dimensions for readability):

"cat" → [0.7, 0.2, -0.1, 0.5]
"dog" → [0.6, 0.3, -0.2, 0.4]
"table" → [-0.3, 0.8, 0.5, -0.2]
"chair" → [-0.2, 0.7, 0.6, -0.3]

Notice: "cat" and "dog" have similar values. "table" and "chair" are similar to each other. But "cat" and "table" are quite different.

Try It Yourself

Why Vectors Capture Meaning

By using hundreds of numbers instead of one, we can encode rich information. Different dimensions might capture different aspects:

Some dimensions might encode "is it an animal?"
Others might encode "is it alive?"
Others might encode "is it typically found indoors?"
And so on for hundreds of dimensions

The model learns these representations automatically by processing billions of words. Words that appear in similar contexts end up with similar embeddings.

If you see "The ___ chased the mouse" in training data, both "cat" and "dog" could fill the blank. Over millions of examples, the model learns that "cat" and "dog" should have similar representations.

Embedding Space

The Famous Word Math Example

Here's something remarkable: you can do math with word embeddings. The classic example:

king - man + woman ≈ queen

This actually works! If you:

Take the embedding for "king"
Subtract the embedding for "man"
Add the embedding for "woman"
Find the closest word to the result

You get "queen" (or something very close to it).

Why does this work? The embedding for "king" encodes "royal + male". The embedding for "man" encodes "male". Subtracting "man" removes the "male" aspect, leaving "royal". Adding "woman" (which encodes "female") gives you "royal + female" = "queen".

Try It Yourself

Measuring Similarity

How do we measure if two embeddings are similar? We calculate their distance. Closer vectors = more similar words.

The simplest measure is Euclidean distance: the straight-line distance between two points in space. If "cat" is at (0.7, 0.2, -0.1, 0.5) and "dog" is at (0.6, 0.3, -0.2, 0.4), the distance is:

distance = √[(0.7-0.6)² + (0.2-0.3)² + (-0.1-(-0.2))² + (0.5-0.4)²]
         = √[0.01 + 0.01 + 0.01 + 0.01]
         = √0.04 = 0.2

Small distance means high similarity. Another common measure is cosine similarity, which looks at the angle between vectors rather than absolute distance. We'll explore this in the next article.

Try It Yourself

How Models Use Embeddings

When an LLM processes text, the first step is always:

Convert text to tokens (tokenization)
Look up each token's embedding in a table
Feed these embedding vectors to the model

The embedding table is just a big array. For a vocabulary of 204.

The model learns these embedding values during its training process. Initially, they're random numbers. After training on billions of words, similar words naturally end up with similar embeddings.

Embedding Table Lookup

Embedding Dimensions

Real LLMs use large embeddings:

GPT-2 small: 768 dimensions
GPT-2 large: 1,280 dimensions
GPT-3: 12,288 dimensions (for the largest model)

More dimensions allow capturing more nuanced meaning. But they also mean more computation and memory. There's a trade-off between embedding size and efficiency.

How Embeddings Are Learned

You don't manually create embeddings. The model learns them automatically. During training, the model:

Starts with random embedding vectors
Tries to predict text using these embeddings
Adjusts the embeddings when predictions are wrong
Repeats billions of times

After training, words that help make similar predictions end up with similar embeddings. The model discovers that "cat" and "dog" should be similar because they both fit in contexts like "The ___ ran across the room."

We won't dive into training details in this course. We'll focus on inference—how to use pre-trained embeddings to build a working LLM. Understanding inference is the foundation for everything else.

Training to Inference Flow

What's Next

Embeddings convert tokens into rich numerical representations. The next article explores how to measure similarity between these vectors more precisely using cosine similarity—a measure that focuses on direction rather than distance.

Later in the course, you'll see how the model uses these embeddings to understand context through the attention mechanism. But first, we need to understand similarity measurements.