Vectors and matrices, just enough to read AI articles
Contents
In the previous article, The small set of math that makes AI articles readable, vectors were glossed over as “just a row of numbers.” If you want to read Transformer or Attention formulas a bit more seriously though, it helps to push one more step into vectors and matrices.
Same stance as before: the goal isn’t to solve anything, just to be able to read.
No determinants, no inverses, no eigenvalues here. The dot product and the matrix product — these two are enough to read most of the formulas around LLMs.
A vector is a container of features written as numbers
A quick review first.
A vector is a row of numbers. In AI it holds features of a word, image, or audio clip.
- For words: “which meanings this word sits close to”, packed into hundreds to thousands of numbers
- For images: pixel values, or features like edges and colors laid out as numbers
- For audio: strength at each time step or frequency band, laid out as numbers
If we squash things down to 3 dimensions and write casually, a word vector looks like this (real ones are much higher-dimensional).
| Word | Row of numbers |
|---|---|
| King | [0.8, 0.1, 0.9] |
| Queen | [0.8, 0.9, 0.9] |
| Apple | [0.1, 0.1, 0.0] |
Think of each column as something like “royalty-ness”, “femininity”, “is it a human”. That’s good enough for intuition.
In practice, learned vectors don’t cleanly split into dimensions you can name. But the tendency “words with similar meanings sit near each other” does emerge.
Viewed as arrows, it’s just “direction and length”
A vector can also be drawn as an arrow.
In 2D, only two things matter.
- Length = overall strength of the feature
- Direction = which combination of features
Same idea in higher dimensions. Vectors for words with close meanings point in “roughly the same direction”; unrelated words point in “totally different directions” — that’s the mental picture.
The length is called the norm. It’s often written as with vertical bars on either side.
A note for readers from Japanese high school math
In Japanese high school math, vectors are written with an arrow on top, like . But in university math, and in AI papers and textbooks, the arrow is rarely used. Instead, vectors are written in bold like , or the convention “lowercase = vector, uppercase = matrix” is used from context.
From here on, this article writes or without an arrow. When you see a plain letter, assume it refers to a vector or a matrix depending on the context.
The norm notation also changes as a pair.
- Japanese high school: (one vertical bar). The arrow already signals “vector,” so one bar is enough
- University / AI: (two vertical bars). Without the arrow, two bars distinguish the norm from absolute value
Since this article drops the arrow, the norm is written with two bars, .
Addition and subtraction “compose meanings”
Addition and subtraction of vectors are straightforward.
You’re just adding position by position.
Notation for general components
From here on, we’ll sometimes describe vectors symbolically instead of with specific numbers. Let’s set up that notation first.
Vectors and written with their components visible look like this.
Reading guide:
- is the 1st number of vector , is the 2nd, and so on
- The small number in the bottom-right indicates “which position” — it’s not multiplication or a power
- just means “skip the middle”
- The final means “the number of elements in the vector”
This “subscript for position” idea carries over to matrices later. There, the subscript becomes two numbers, like , indicating “row and column”, but the idea is the same.
With this notation, addition is:
Same as the concrete example — adding position by position.
Moving meaning with word vectors
A famous story: word vectors sometimes satisfy relationships like this.
king - man + woman ≒ queen
Subtract “man-ness” from “king” and add “woman-ness”, and you land near “queen”. It doesn’t come out that cleanly every time in real models, but the intuition “you can move meaning by adding and subtracting vectors” is worth holding on to.
You see the same kind of thing in the image world with style transfer, where “adding a certain vector changes the style” — same root idea, adding and subtracting vectors that represent features.
Vector multiplication comes in several kinds
After addition and subtraction, multiplication is the next natural step. But vector multiplication isn’t a single operation — there are several kinds. The two you see most in AI articles are:
- Dot product — takes two vectors, returns one number
- Element-wise (Hadamard) product — multiplies position by position, returns a vector of the same length
There are others like the cross product, but they don’t come up much in LLM or image-generation explanations. We’ll start with the dot product — the one that appears most in Attention and similarity calculations.
The dot product is written as:
Using the notation we just set up, what it does is: “multiply the first pair, multiply the second pair, … and add them all up.”
Plugging in our vectors and :
“Multiply position by position, then add them up.” That’s the whole operation.
Now let’s put it next to the “weighted sum” from the previous article.
The middle part, , has exactly the same shape as the dot product. Rewriting the previous article’s computation in vector language: it’s “the dot product of the weight vector and the input vector ”.
The meaning: the dot product measures “how close in direction” two vectors are.
- Close direction → dot product is a large positive
- Perpendicular (unrelated) → dot product is near 0
- Opposite → dot product is negative
Why does “multiply components and sum” produce a direction comparison? A picture makes it click.
At the component level, the story is simple. When two vectors have components pointing the same way, multiplying those components gives a positive number. When components point opposite ways, the product is negative. Summing everything up: if the directions are close, positives accumulate and the result is large; if perpendicular, positives and negatives cancel to 0; if opposite, negatives accumulate and the result is negative.
So what each neuron in a neural network is doing can be rephrased as “measuring how closely the input aligns with the weight vector.”
The “length and angle” form
The dot product can also be written in a geometric form.
- and are the lengths of each vector
- is the angle between them
This article doesn’t go into trigonometry, so this form won’t show up much later. But knowing it helps the “direction closeness” story feel more grounded.
- Same direction (angle 0) → → dot product is the product of the lengths, maximum
- Perpendicular (angle 90°) → → dot product is 0
- Opposite (angle 180°) → → dot product is negative, minimum
Divide both sides by the product of lengths:
“The dot product divided by the two lengths” is exactly of the angle between the vectors.
That’s where the name cosine similarity comes from.
Cosine similarity and embedding search
Cosine similarity in search and retrieval is exactly this .
Dividing the dot product by the two lengths removes the size effect and boils “direction closeness” down to a single number.
Vector databases and embedding search are, at their core, doing this “direction closeness” calculation at scale.
Row or column — the name changes with how you write it
So far we’ve been writing vectors horizontally, like . Books and AI papers often write them vertically.
Quick note on bracket style before we go further. There are two common ways to wrap a matrix or column vector: parentheses ( ) and brackets [ ]. Japanese high school math used parentheses, while programming, Western textbooks, and AI papers tend to use brackets. Either means the same thing, but this article uses brackets from here on.
The vertical form is called a column vector; the horizontal form is called a row vector.
Same content, different orientation.
AI literature leans toward column vectors.
Hand calculations are easier when both are vertical
Personal tip: when calculating a dot product by hand, writing both as column vectors side by side is easier to track.
With both written vertically side by side, corresponding components like and end up on the same row. All that’s left is “multiply and add” row by row. With the horizontal form, corresponding components sit far apart — a little harder to track visually.
A small spoiler: a column vector is just a “matrix with 1 column”, and a row vector is a “matrix with 1 row”. The matrix in the next section is simply this shape extended in both directions.
A matrix is “a device for transforming vectors in bulk”
Now into matrices. A matrix is the column/row vector from before, extended in both directions.
This is a matrix (3 rows, 2 columns).
Which one is horizontal again?
In Japanese, the shape of the kanji gives a hint — 行 contains horizontal strokes, 列 contains vertical strokes — so 行 is rows and 列 is columns. That trick doesn’t carry over to English.
Personal mnemonic: the r in row has a stroke extending sideways → row is horizontal. The l in column is a vertical bar → column is vertical.
I mix up flex-direction: row and column in HTML/CSS often enough that this rescues me.
No idea if native speakers actually use this, so treat it as a personal life hack.
A matrix transforms vectors in bulk
What do we use a matrix for? Mostly: to transform a vector into another vector, in bulk.
Both and are vectors; is the matrix between them. That single line looks dry, but expanded out, all it’s doing is bundling several dot products together.
For a matrix times a 2-dim column vector :
Read each row of as a single vector — its dot product with becomes one component of .
- Row 1 · → 1st component of =
- Row 2 · → 2nd component of =
- Row 3 · → 3rd component of =
With specific numbers:
A 3-row, 2-column matrix times a 2-dim vector produces a 3-dim vector.
If sizes don’t match, you can’t multiply
Natural question: what happens if the sizes don’t line up? Simple answer: the multiplication just isn’t defined. For to work, the number of columns of the left operand has to equal the number of rows of the right.
- matrix × 2-dim column vector → 3-dim column vector (works)
- matrix × 3-dim column vector → doesn’t work
- matrix × matrix → matrix (works)
- matrix × matrix → doesn’t work
When shapes don’t match, you either transpose to line them up, or rethink how the matrix/vector was built.
shape mismatch errors in PyTorch or NumPy are almost always this: “left’s columns ≠ right’s rows.”
So a single layer of a neural network is roughly this shape.
The weighted sum from the previous article, written for a whole layer at once instead of one input at a time — that’s really all it is.
Matrix × matrix = “composition of transformations”
Neural networks stack many layers. Written as a formula:
The parentheses are deep and ugly, but it’s just transforming in sequence.
- transforms once
- transforms again
- transforms one more time
Ignoring activations, it’s just several matrices multiplied in a row.
That’s where the matrix product of two matrices shows up.
flowchart LR
A[Input vector x] -->|Apply W1| B[Intermediate vector]
B -->|Apply W2| C[Next intermediate vector]
C -->|Apply W3| D[Output vector y]
You can compute a matrix product by hand — “row 1 · column 1 dot product, row 1 · column 2 dot product, …” — but you don’t need to learn that. “Multiplying two matrices gives you a single combined transformation” is a good-enough read.
Swapping the order changes the answer
Another thing to watch: matrix multiplication doesn’t commute — swapping operand order changes the result. For plain numbers, holds. For matrices, and aren’t generally equal.
- Sometimes only one side is even defined because of size constraints
- Even when both are defined, the values match only in special cases
Geometrically: “rotate 90° then scale” and “scale then rotate 90°” end up different because the order of the transformations matters.
In AI formulas too, depends on the order of layers, so the order of the matrices shouldn’t be swapped casually.
Transpose is just “flip rows and columns”
Another symbol you’ll see a lot in AI articles is transpose. Written as or , with a T on the shoulder.
The operation is literally “flip rows and columns.”
Why does it come up? When multiplying vectors and matrices, the row/column sizes have to line up. When they don’t, transpose fixes the orientation — that’s most of what it’s used for.
The form of the dot product
The most common use of transpose in AI articles is writing the dot product of column vectors and like this:
Transposing one column vector (vertical) into a row vector (horizontal) turns it into “1 row × 1 column”, which the matrix-product rule handles naturally. Strictly, this is a matrix product, but the result is a scalar (one number) and equals . When you see in an AI article, read it as “the dot product, written in matrix-product form.”
With all this, the Attention formula becomes readable
With these tools, the most common LLM formula starts to read.
, , and look intimidating at first. They’re three matrices the Transformer builds internally, each with a specific role.
| Symbol | Role |
|---|---|
| (Query) | What the word currently being attended to is looking for in other words |
| (Key) | A label each word advertises — “I’m this kind of place” |
| (Value) | The actual content each word wants to pass to the next layer |
Library analogy: is the search query, is each book’s spine/title label, is the book’s contents. All three come from the same input (word embeddings) with different weight matrices applied — three views of the same thing.
With that setup, the rest of the formula reads as:
- computes the similarity (dot product) between all queries and all keys in one shot
- Dividing by keeps the values from getting too large
softmaxturns similarities into probability-like weights- Multiplying by at the end is a weighted average of the contents with those weights
What’s a “weighted average”?
“Weighted average” might not be a common phrase for everyone, so a quick recap. A plain (arithmetic) mean adds all values and divides by the count — every value counts equally. A weighted average attaches a “weight” to each value, so heavier weights pull harder on the result.
Example: three test scores, 70, 80, 90, and the final exam (90) counts double.
- Arithmetic mean:
- Weighted average (final doubled):
Same scores, different result depending on the weights. If all weights are equal, the weighted average collapses back to the arithmetic mean.
In Attention, the weights from softmax (which sum to 1) are applied to each word’s and added up. Words with higher attention contribute more, words with lower attention contribute less — that’s the “blend” coming out.
With the softmax from the previous article, plus the dot product and matrix product from this one, you can read Attention as “decide attention by query–key dot product, then pick up values with those weights.”
You don’t need the fine-grained behavior; just being able to scan the shape of the formula and see what’s being bundled together makes papers and release notes much less painful.
Skippable topics
At entry level, the following can stay on the skip list.
| Term | What it is | Why you can skip |
|---|---|---|
| Determinant | Indicator of how much a matrix scales space | Barely shows up in LLM explanations |
| Inverse matrix | The matrix that undoes another | You might see pseudo-inverses in AI articles, but they’re rare |
| Eigenvalues / eigenvectors | Values describing a matrix’s “habits” | Important for PCA and physics; rarely needed for LLM articles |
| Orthogonal matrix | A matrix that only rotates | When it appears, read it as “a rotation-like transformation” |
| Zero matrix | A matrix of all zeros | Know the definition; rarely needs deeper treatment at entry level |
| Identity matrix | Diagonal is 1, rest is 0. Multiplying by it leaves things unchanged | Shows up in papers (residual connections etc.); read as “the one that doesn’t change anything” |
Look them up when you need to. No need to grab them all up front.
What you can read now
With just this article’s scope, you can already outline most equations you’ll see in AI articles.
| Common shape | Reading |
|---|---|
| Transform vector with matrix and add a bias | |
| or | Measure how close in direction two vectors are |
| Bundle similarity calculations between many queries and keys | |
| Weight matrices per layer | |
| Length of a vector |
When each formula reads less like “something I have to compute” and more like “a note about what’s being bundled together,” you’re already halfway there.
Related reads
- The small set of math that makes AI articles readable Weighted sums, sigmoid, softmax, training flow. This article continues from there.
- Building an OCR typo detector with an encoder model + local LLM A practical example of building probabilities from per-position vectors.
- Trying RPG-parameter extraction from character images with a local vision LLM Multimodal example of treating images and text in the same vector space.
- MoonshotAI (Kimi) proposes AttnRes — replacing residual connections with attention in Transformers, 1.25× compute efficiency A look at how and softmax are actually used in research.
Glossary (feel free to skip)
| Term | Meaning |
|---|---|
| Scalar | A single plain number. Not a 1-dim vector — a raw value |
| Vector | A row of numbers. In AI, used as a container for features |
| Matrix | A table of numbers. A device that transforms a vector into another vector, in bulk |
| Dimension | How many numbers in a vector. In LLM contexts you’ll see things like “4096-dim embedding” |
| Norm | Length of a vector. Written as |
| Dot product | How close in direction two vectors are. Written as or |
| Cosine similarity | Dot product divided by the product of lengths. For comparing direction only |
| Matrix product | Multiplication of two matrices. Corresponds to composition of transformations |
| Transpose | Flip rows and columns of a matrix. Written as or |
| Embedding | A vector representation of a word or image. Similar meanings end up close in vector space |
Next up, I’m planning to head into probability and statistics — wrapping the likelihood and cross-entropy that sit behind softmax in the same “just-readable” style.