Tokenization Basics
From Text to Numbers
Computers work with numbers, not letters. When you type "Hello, world!" into an LLM, the model doesn't see those characters directly. It needs to convert the text into numbers it can process. This conversion happens in two steps: first, break the text into pieces called tokens, then convert each token into a number.
Tokenization is the process of splitting text into these smaller pieces. Think of it like breaking a sentence into words, except the rules are more flexible—sometimes a token is a word, sometimes it's part of a word, and sometimes it's punctuation.
Why Not Process Characters Directly?
Why not just convert each letter to a number? The letter 'A' could be 1, 'B' could be 2, and so on. This approach, called character-level tokenization, is simple but inefficient.
Consider the word "cat". With character-level tokenization, you get three separate tokens: 'c', 'a', 't'. The model must learn that these three characters, when appearing together in this order, represent the concept of a cat. It's like forcing the model to learn spelling rules before it can learn language patterns.
With word-level tokenization, "cat" becomes a single token. The model can immediately associate this token with everything it learned about cats, without worrying about individual letters. This is more efficient—similar to how you recognize whole words when reading, not individual letters.
Character-Level Tokenization
Despite its limitations, character-level tokenization is the simplest approach. Every character becomes a token: letters, numbers, punctuation, and spaces.
The vocabulary is tiny—just the characters you want to support. For English, you might need:
- 26 lowercase letters (a-z)
- 26 uppercase letters (A-Z)
- 10 digits (0-9)
- Common punctuation (.,!?;:'"-)
- Space and newline characters
That's roughly 100 tokens total. The advantage is universal coverage—any text can be represented using these characters. The disadvantage is inefficiency—the model must process many more tokens to represent the same text.
Word-Level Tokenization
Word-level tokenization splits text by spaces and punctuation. Each word becomes a token. This is closer to how humans process language—we read word by word, not character by character.
The process is straightforward: find spaces and punctuation, then split the text at those boundaries. The word "Hello" is one token. The comma "," is typically a separate token. The word "world" is another token.
However, word-level tokenization has a significant problem: vocabulary size. English has hundreds of thousands of words. If you want to support technical terms, names, and domain-specific vocabulary, you might need millions of tokens. This makes the model large and slow.
The Vocabulary Problem
Both character-level and word-level tokenization have trade-offs:
Character-level:
- Small vocabulary (~100 tokens)
- Can represent any text
- Very inefficient—many tokens per sentence
- Model must learn spelling patterns
Word-level:
- Large vocabulary (100,000+ tokens)
- Efficient—fewer tokens per sentence
- Cannot handle unknown words
- Struggles with misspellings and new terms
For example, if "unbreakable" isn't in your vocabulary, word-level tokenization fails. You could mark it as an "unknown" token, but then the model loses all meaning.
Real LLMs use a middle ground called subword tokenization, which we'll explore in the next article. Subword methods split rare words into smaller pieces that the model recognizes, combining the benefits of both approaches.
Special Tokens
Beyond regular words and characters, vocabularies include special tokens that serve specific control purposes. These tokens don't represent text content—they signal boundaries, padding, or instructions to the model.
The tiny GPT model you'll build in this course uses two special tokens in its 20-word vocabulary:
END marks where generation should stop. When the model predicts this token, generation terminates. Without an end marker, the model would continue generating text indefinitely. This token answers: "When should I stop?"
PAD fills sequences to equal length for batch processing. Neural networks process multiple examples simultaneously (batching), but all examples must have identical length. If one sequence has 3 tokens and another has 7 tokens, padding extends the shorter sequence to 7 tokens by appending PAD tokens. The model learns to ignore these during processing.
Production LLMs use additional special tokens for different purposes: separating conversations in chat models, marking different speakers in dialogue, denoting special formatting, or encoding instructions. GPT-4, for example, uses dozens of special tokens beyond regular vocabulary.
Special tokens occupy positions in the vocabulary just like regular words. In the tiny GPT vocabulary, PAD is token ID 18 and END is token ID 19. The model learns to treat these differently from content tokens through training.
During text generation, when the model predicts END, the generation loop terminates. When training on sequences of different lengths, padding tokens let you process batches efficiently without wasting computation on meaningless positions.
Tokens vs Words
A key point: tokens are not always words. Depending on the tokenization method:
One word might be multiple tokens:
- "unbreakable" → ["un", "break", "able"]
Multiple words might be one token:
- "New York" → ["NewYork"] (if treated as single entity)
Punctuation is usually separate tokens:
- "Hello!" → ["Hello", "!"]
Numbers can be split:
- "12345" → ["12", "345"] or ["1", "2", "3", "4", "5"]
When you use an LLM API, you're often charged per token, not per word. Understanding tokenization helps you estimate costs. A 100-word prompt might be 130 tokens or 80 tokens depending on the complexity of your vocabulary.
Counting Tokens
Different text has different token counts depending on content:
Simple English text: Roughly 1 token per word, sometimes fewer
- "The cat sat on the mat" → ~6-7 tokens
Complex or technical text: More tokens
- "Uncharacteristically" might be 3-4 subword tokens
Code: Often more tokens
- "function calculateTotal()" → 5-6 tokens
Other languages: Can vary significantly
- Languages without spaces (Chinese, Japanese) tokenize differently
- Some languages need more tokens per word
Why This Matters
Tokenization affects everything about how LLMs work:
Performance - Fewer tokens mean faster processing. Every token requires computation through all model layers. Efficient tokenization reduces the number of tokens, making the model faster.
Context length - Models have token limits, not word limits. GPT-4 might handle 8,000 tokens. If your tokenization is inefficient, you fit less actual content in that window.
Cost - API pricing is per token. Understanding tokenization helps you estimate costs before making requests.
Model behavior - The model only sees tokens, not raw text. If "ChatGPT" is one token but "ClaudeAI" is three tokens, the model treats them differently.
Rare words - Words not in the vocabulary become multiple tokens or unknown markers. This affects how well the model handles specialized domains.
What's Next
Character-level and word-level tokenization are too extreme. Modern LLMs use subword tokenization, which splits text into pieces that balance vocabulary size and efficiency. The most common method is called Byte-Pair Encoding (BPE), which we'll explore in the next article.
Subword tokenization is why "unbreakable" becomes ["un", "break", "able"]—the model recognizes common word parts and combines them. This approach handles rare words gracefully while keeping vocabulary size manageable.
Try It Yourself: Vocabulary Limits
Our tiny GPT model demonstrates the vocabulary limitation problem. With only 20 words, the model rejects any input containing unknown words. This shows why real LLMs need larger vocabularies.