Module 1: From Text to Numbers

Module 1: The Foundation

You saw the tiny GPT we're building in the course overview. To make that work, the model needs to convert text into numbers. This module teaches that transformation—the essential first step before any prediction or attention can happen.

By the end of Module 1, you'll understand how "The cat sat on the mat" becomes vectors of numbers that the model can process. This isn't just encoding—the numbers capture meaning and relationships. Words with similar meanings get similar numbers. This mathematical representation is what makes language models possible.

The Journey from Text to Numbers

Large language models can't work with text directly. When you type "The cat sat on the mat" into ChatGPT, the model doesn't see letters or words. It sees numbers—lots of them. Every piece of text goes through a transformation pipeline that converts human-readable text into mathematical representations the model can process.

This module teaches you that transformation pipeline. You'll learn how LLMs break text into pieces, convert those pieces into vectors of numbers, and use those numbers to understand meaning and relationships. By the end of this module, you'll understand the first critical step in how LLMs work: representing language as mathematics.

Text to Numbers Pipeline

What You'll Learn

The big picture is straightforward: Text → Tokens → Embeddings → Similarity. We start with raw text and end with numbers that capture meaning.

Breaking Text into Tokens - LLMs don't process full sentences at once. They split text into smaller pieces called tokens. You'll learn why "running" becomes ["run", "ning"] and how this handles any word, even ones the model has never seen before.

Converting Tokens to Numbers - Each token becomes a vector—a list of numbers like [0.2, -0.5, 0.8, ...]. These aren't random. Words with similar meanings get similar numbers. You'll see how this mathematical representation captures relationships between words.

Finding Similarity - Once words are vectors, we can measure how related they are using simple math. "Cat" and "dog" have similar vectors. "Cat" and "quantum" don't. This similarity measurement becomes the foundation for everything LLMs do.

Why This Matters

Why go through all this instead of just using words directly?

The answer is that mathematics gives us generalization. When the model sees "The cat sat on the mat," it doesn't just memorize that exact sentence. The vector representations let it understand that "cat" is similar to "dog," that "sat" is similar to "stood," that "mat" is similar to "rug." It can then handle "The dog stood on the rug" without ever seeing that exact combination.

This mathematical representation is what makes LLMs powerful. They don't need rules for every possible sentence. They learn patterns from examples, and those patterns work on new text they've never encountered.

The Path Ahead

This module has five articles that build on each other:

Tokenization Basics - How text splits into processable pieces. You'll learn why models break text into tokens and see the basic mechanics of this process.

Subword Tokenization - Why modern LLMs use word fragments, not full words. You'll understand the engineering tradeoffs that led to this design choice.

Embeddings: Words as Vectors - How tokens become lists of numbers. You'll see concrete examples of word vectors and understand what these numbers represent.

Vector Similarity - Measuring relationships between word vectors. You'll learn the math that determines which words are related to each other.

Each article is short and focused. By the end, you'll understand the complete transformation from "Hello world" to the numbers an LLM actually processes. This foundation is essential—every subsequent module builds on these concepts.

What You Won't Learn Here

This module covers the input representation only. We don't yet explain how the model generates predictions, how it learns from examples, or how the transformer architecture works. Those topics come in later modules. For now, focus on understanding how text becomes numbers that capture meaning.