Vectors and Arrays: Lists of Numbers

You've learned what transformers are at a high level. But to understand what happens inside those transformer blocks, you need to know a bit of math - specifically, how transformers manipulate numbers.

This section covers the basics in three short articles: vectors, operations on vectors, and matrices. That's all the linear algebra you need. After this, you'll learn how transformers convert text into these numerical representations.

Why Lists of Numbers?

Transformers represent everything as lists of numbers. Words become lists of numbers. Sentences become lists of lists of numbers. Even the model's internal "thoughts" are lists of numbers.

This seems strange at first. Why not use words, or symbols, or data structures? Because computers are really good at one thing: arithmetic. Adding, multiplying, comparing numbers - these operations are blazingly fast. Transformers leverage this by converting everything to numbers, then using billions of arithmetic operations to find patterns.

The mathematical term for "a list of numbers" is a vector. Understanding vectors is the foundation for understanding transformers.

What is a Vector?

A vector is just a list of numbers. Nothing fancy. If you've worked with arrays in programming, you already understand vectors.

A vector with 3 numbers:
[2.5, -1.0, 3.7]

Another vector with 5 numbers:
[0.2, 0.8, -0.5, 1.2, 0.0]

A vector representing a word (simplified):
[0.23, -0.51, 0.82, 0.15, -0.33]

The numbers in a vector can be positive, negative, or zero. They can be integers or decimals. The only requirement is that a vector contains numbers in a specific order.

The count of numbers in a vector is called its dimension or length. The vector [2.5, -1.0, 3.7] has dimension 3. The vector [0.2, 0.8, -0.5, 1.2, 0.0] has dimension 5.

Vectors in NumPy

Python's NumPy library provides fast operations on vectors. Transformers use NumPy (or similar libraries like PyTorch) for all vector operations.

Creating a vector is straightforward:

Try It Yourself

The output shows NumPy created array objects. These look like Python lists but have special capabilities for fast arithmetic that we'll explore in the next article.

Notice the dimension. Real transformer models use much longer vectors - typically 768 or 1024 numbers per word. Our examples use shorter vectors for readability.

Accessing Elements

Access individual numbers in a vector using indices, just like Python lists:

Try It Yourself

Indexing works exactly like Python lists. Index 0 is the first element. Negative indices count from the end. Slicing extracts sub-vectors.

This becomes important when transformers process sequences. The vector at position 0 might represent "the", position 1 represents "cat", position 2 represents "sat". Accessing the right elements matters.

Vector Shape

NumPy vectors have a shape attribute that tells you their dimensions:

Try It Yourself

The shape is shown as a tuple (n,) where n is the number of elements. This tuple format becomes important when working with matrices (2D arrays), which we'll cover in a later article.

Understanding shapes helps debug transformer code. If you expect a 768-dimensional vector but get a 512-dimensional one, shape mismatches will cause errors.

How Transformers Use Vectors

Transformers represent words as high-dimensional vectors. Instead of treating "cat" as a string of characters, the model uses a 768-dimensional vector like:

These vectors capture meaning through their numerical patterns. Related words have similar vectors. For example, the vectors for "cat" and "dog" might have high similarity because both are animals. The vectors for "cat" and "car" would differ significantly.

The model learns these representations during training by processing billions of text examples. After training, words with similar meanings end up with similar vector values - not because anyone programmed those relationships, but because the patterns emerged from the data.

GPUs can process millions of these vectors per second through arithmetic operations. This speed - combined with the ability to measure similarity between vectors mathematically - enables transformers to find patterns in text that would be difficult to detect through traditional string processing.

Next, you'll learn what you can do with vectors - adding them, scaling them, and computing similarity using the dot product.