Vector Operations: Addition, Scaling, and Dot Product
Three Operations, Billions of Times
Transformers perform three basic operations on vectors billions of times: addition, scaling (multiplying by a number), and dot product. That's it. These three simple operations combine to create the complex behavior you see in ChatGPT.
Understanding these operations is essential because they appear everywhere in transformers. Adding residual connections. Scaling attention scores. Computing similarity between word vectors. Every major component uses these building blocks.
The operations are simple arithmetic. The power comes from applying them to high-dimensional vectors (hundreds or thousands of numbers) and repeating billions of times with learned values.
Vector Addition: Combine Information
Add two vectors by adding their corresponding elements. The vectors must have the same dimension.
Element 0 of the result is element 0 of the first vector plus element 0 of the second vector. Element 1 is element 1 plus element 1. And so on.
Try it in NumPy:
NumPy performs element-wise addition automatically. No loops needed. This vectorized operation is extremely fast - millions of additions per second.
Transformers use addition constantly. Residual connections add the input back to the output. Positional encoding adds position information to word vectors. Attention combines information from multiple words through weighted addition.
Scalar Multiplication: Scale Magnitude
Multiply a vector by a single number (called a scalar) by multiplying every element by that number.
Every element gets multiplied by the scalar. The vector's direction stays the same, but its magnitude (length) changes.
NumPy makes this simple:
Scaling appears throughout transformers. Attention mechanisms compute weights (numbers between 0 and 1) and multiply word vectors by these weights. Temperature scaling divides logits by a temperature value. Layer normalization scales values to have consistent magnitude.
Notice that scaling preserves the relative relationships between elements. If element 0 was twice as large as element 1 before scaling, it remains twice as large after scaling.
Dot Product: Measure Similarity
The dot product multiplies corresponding elements and sums the results. It produces a single number that measures how similar two vectors are.
Multiply element 0 of the first vector by element 0 of the second. Do the same for all elements. Add up all the products. The result is a single number.
High dot product means the vectors point in similar directions. Low or negative dot product means they point in different directions.
The dot product is fundamental to transformers. Attention scores come from dot products between query and key vectors. Higher dot product means the words should attend to each other more. The entire attention mechanism is built on computing and using dot products.
NumPy provides np.dot() for dot products. The shorthand @ operator also works: v1 @ v2 is equivalent to np.dot(v1, v2).
Combining Operations: Weighted Averaging
These three operations work together to solve practical problems. Consider combining information from multiple sources based on their relevance. The dot product measures similarity, division normalizes scores, and scalar multiplication with addition creates the final result.
Here's an example showing all three operations working together:
This pattern appears throughout transformers. Dot products measure relationships between vectors. Normalization ensures weights are comparable. Scalar multiplication and addition combine information proportionally. Later modules build on these basics to create sophisticated mechanisms, but the core operations remain these three.
Next, you'll learn about matrices - 2D arrays that let you transform vectors and process multiple vectors simultaneously. Matrices build on these three operations to enable the parallel computation that makes transformers efficient.