Tech 18 min read

Vectors and matrices, just enough to read AI articles

IkesanContents

In the previous article, The small set of math that makes AI articles readable, vectors were glossed over as “just a row of numbers.” If you want to read Transformer or Attention formulas a bit more seriously though, it helps to push one more step into vectors and matrices.

Same stance as before: the goal isn’t to solve anything, just to be able to read.

No determinants, no inverses, no eigenvalues here. The dot product and the matrix product — these two are enough to read most of the formulas around LLMs.

A vector is a container of features written as numbers

A quick review first.

A vector is a row of numbers. In AI it holds features of a word, image, or audio clip.

  • For words: “which meanings this word sits close to”, packed into hundreds to thousands of numbers
  • For images: pixel values, or features like edges and colors laid out as numbers
  • For audio: strength at each time step or frequency band, laid out as numbers

If we squash things down to 3 dimensions and write casually, a word vector looks like this (real ones are much higher-dimensional).

WordRow of numbers
King[0.8, 0.1, 0.9]
Queen[0.8, 0.9, 0.9]
Apple[0.1, 0.1, 0.0]

Think of each column as something like “royalty-ness”, “femininity”, “is it a human”. That’s good enough for intuition.

In practice, learned vectors don’t cleanly split into dimensions you can name. But the tendency “words with similar meanings sit near each other” does emerge.

Viewed as arrows, it’s just “direction and length”

A vector can also be drawn as an arrow.

Two 2D vectors Arrows from the origin to (2, 3) and to (3, 1) x y (2, 3) (3, 1)
2D vectors can be drawn as arrows from the origin. Vectors inside AI are hundreds or thousands of dimensions, so you can't draw them, but the intuition carries over.

In 2D, only two things matter.

  • Length = overall strength of the feature
  • Direction = which combination of features

Same idea in higher dimensions. Vectors for words with close meanings point in “roughly the same direction”; unrelated words point in “totally different directions” — that’s the mental picture.

The length is called the norm. It’s often written as x\lVert x \rVert with vertical bars on either side.

A note for readers from Japanese high school math

In Japanese high school math, vectors are written with an arrow on top, like a\vec{a}. But in university math, and in AI papers and textbooks, the arrow is rarely used. Instead, vectors are written in bold like x\mathbf{x}, or the convention “lowercase = vector, uppercase = matrix” is used from context.

From here on, this article writes xx or WW without an arrow. When you see a plain letter, assume it refers to a vector or a matrix depending on the context.

The norm notation also changes as a pair.

  • Japanese high school: a|\vec{a}| (one vertical bar). The arrow already signals “vector,” so one bar is enough
  • University / AI: a\lVert a \rVert (two vertical bars). Without the arrow, two bars distinguish the norm from absolute value

Since this article drops the arrow, the norm is written with two bars, x\lVert x \rVert.

Addition and subtraction “compose meanings”

Addition and subtraction of vectors are straightforward.

(2,3)+(3,1)=(5,4)(2, 3) + (3, 1) = (5, 4)

You’re just adding position by position.

Notation for general components

From here on, we’ll sometimes describe vectors symbolically instead of with specific numbers. Let’s set up that notation first.

Vectors aa and bb written with their components visible look like this.

a=(a1, a2, , an)a = (a_1,\ a_2,\ \ldots,\ a_n) b=(b1, b2, , bn)b = (b_1,\ b_2,\ \ldots,\ b_n)

Reading guide:

  • a1a_1 is the 1st number of vector aa, a2a_2 is the 2nd, and so on
  • The small number in the bottom-right indicates “which position” — it’s not multiplication or a power
  • \ldots just means “skip the middle”
  • The final nn means “the number of elements in the vector”

This “subscript for position” idea carries over to matrices later. There, the subscript becomes two numbers, like a11a_{11}, indicating “row and column”, but the idea is the same.

With this notation, addition is:

a+b=(a1+b1, a2+b2, , an+bn)a + b = (a_1 + b_1,\ a_2 + b_2,\ \ldots,\ a_n + b_n)

Same as the concrete example — adding position by position.

Moving meaning with word vectors

A famous story: word vectors sometimes satisfy relationships like this.

king - man + woman ≒ queen

Subtract “man-ness” from “king” and add “woman-ness”, and you land near “queen”. It doesn’t come out that cleanly every time in real models, but the intuition “you can move meaning by adding and subtracting vectors” is worth holding on to.

You see the same kind of thing in the image world with style transfer, where “adding a certain vector changes the style” — same root idea, adding and subtracting vectors that represent features.

Vector multiplication comes in several kinds

After addition and subtraction, multiplication is the next natural step. But vector multiplication isn’t a single operation — there are several kinds. The two you see most in AI articles are:

  • Dot product — takes two vectors, returns one number
  • Element-wise (Hadamard) product — multiplies position by position, returns a vector of the same length

There are others like the cross product, but they don’t come up much in LLM or image-generation explanations. We’ll start with the dot product — the one that appears most in Attention and similarity calculations.

The dot product is written as:

ab=a1b1+a2b2++anbna \cdot b = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

Using the notation we just set up, what it does is: “multiply the first pair, multiply the second pair, … and add them all up.”

Plugging in our vectors (2,3)(2, 3) and (3,1)(3, 1):

(2,3)(3,1)=2×3+3×1=6+3=9(2, 3) \cdot (3, 1) = 2 \times 3 + 3 \times 1 = 6 + 3 = 9

“Multiply position by position, then add them up.” That’s the whole operation.

Now let’s put it next to the “weighted sum” from the previous article.

y=w1x1+w2x2++wnxn+by = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

The middle part, w1x1+w2x2+w_1 x_1 + w_2 x_2 + \cdots, has exactly the same shape as the dot product. Rewriting the previous article’s computation in vector language: it’s “the dot product of the weight vector ww and the input vector xx”.

The meaning: the dot product measures “how close in direction” two vectors are.

  • Close direction → dot product is a large positive
  • Perpendicular (unrelated) → dot product is near 0
  • Opposite → dot product is negative

Why does “multiply components and sum” produce a direction comparison? A picture makes it click.

Direction of two vectors vs the sign of the dot product Three cases — same direction, perpendicular, opposite — and how the sign of the dot product changes Same direction a · b = 8 Perpendicular a · b = 0 Opposite a · b = −8
Blue is fixed at a=(2,2); red varies through (3,1) → (−2,2) → (−3,−1). Closer in direction gives a larger positive; near-perpendicular gives 0; opposite gives negative.

At the component level, the story is simple. When two vectors have components pointing the same way, multiplying those components gives a positive number. When components point opposite ways, the product is negative. Summing everything up: if the directions are close, positives accumulate and the result is large; if perpendicular, positives and negatives cancel to 0; if opposite, negatives accumulate and the result is negative.

So what each neuron in a neural network is doing can be rephrased as “measuring how closely the input aligns with the weight vector.”

The “length and angle” form

The dot product can also be written in a geometric form.

ab=abcosθa \cdot b = \lVert a \rVert \lVert b \rVert \cos\theta
  • a\lVert a \rVert and b\lVert b \rVert are the lengths of each vector
  • θ\theta is the angle between them

This article doesn’t go into trigonometry, so this form won’t show up much later. But knowing it helps the “direction closeness” story feel more grounded.

  • Same direction (angle 0) → cosθ=1\cos\theta = 1 → dot product is the product of the lengths, maximum
  • Perpendicular (angle 90°) → cosθ=0\cos\theta = 0 → dot product is 0
  • Opposite (angle 180°) → cosθ=1\cos\theta = -1 → dot product is negative, minimum

Divide both sides by the product of lengths:

cosθ=abab\cos\theta = \frac{a \cdot b}{\lVert a \rVert \lVert b \rVert}

“The dot product divided by the two lengths” is exactly cos\cos of the angle between the vectors. That’s where the name cosine similarity comes from.

Cosine similarity in search and retrieval is exactly this cosθ\cos\theta. Dividing the dot product by the two lengths removes the size effect and boils “direction closeness” down to a single number.

Vector databases and embedding search are, at their core, doing this “direction closeness” calculation at scale.

Row or column — the name changes with how you write it

So far we’ve been writing vectors horizontally, like (2,3)(2, 3). Books and AI papers often write them vertically.

Quick note on bracket style before we go further. There are two common ways to wrap a matrix or column vector: parentheses ( ) and brackets [ ]. Japanese high school math used parentheses, while programming, Western textbooks, and AI papers tend to use brackets. Either means the same thing, but this article uses brackets from here on.

a=[a1a2an]a = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}

The vertical form is called a column vector; the horizontal form is called a row vector. Same content, different orientation.

AI literature leans toward column vectors.

Hand calculations are easier when both are vertical

Personal tip: when calculating a dot product by hand, writing both as column vectors side by side is easier to track.

ab=[23][31]=2×3+3×1=9a \cdot b = \begin{bmatrix} 2 \\ 3 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 1 \end{bmatrix} = 2 \times 3 + 3 \times 1 = 9

With both written vertically side by side, corresponding components like a1b1a_1 \leftrightarrow b_1 and a2b2a_2 \leftrightarrow b_2 end up on the same row. All that’s left is “multiply and add” row by row. With the horizontal (2,3)(3,1)(2, 3) \cdot (3, 1) form, corresponding components sit far apart — a little harder to track visually.

A small spoiler: a column vector is just a “matrix with 1 column”, and a row vector is a “matrix with 1 row”. The matrix in the next section is simply this shape extended in both directions.

A matrix is “a device for transforming vectors in bulk”

Now into matrices. A matrix is the column/row vector from before, extended in both directions.

W=[w11w12w21w22w31w32]W = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{bmatrix}

This is a 3×23 \times 2 matrix (3 rows, 2 columns).

Which one is horizontal again?

In Japanese, the shape of the kanji gives a hint — 行 contains horizontal strokes, 列 contains vertical strokes — so 行 is rows and 列 is columns. That trick doesn’t carry over to English.

Personal mnemonic: the r in row has a stroke extending sideways → row is horizontal. The l in column is a vertical bar → column is vertical. I mix up flex-direction: row and column in HTML/CSS often enough that this rescues me. No idea if native speakers actually use this, so treat it as a personal life hack.

A matrix transforms vectors in bulk

What do we use a matrix for? Mostly: to transform a vector into another vector, in bulk.

y=Wxy = W x

Both yy and xx are vectors; WW is the matrix between them. That single line looks dry, but expanded out, all it’s doing is bundling several dot products together.

For a 3×23 \times 2 matrix WW times a 2-dim column vector xx:

Wx=[w11w12w21w22w31w32][x1x2]=[w11x1+w12x2w21x1+w22x2w31x1+w32x2]W x = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} w_{11} x_1 + w_{12} x_2 \\ w_{21} x_1 + w_{22} x_2 \\ w_{31} x_1 + w_{32} x_2 \end{bmatrix}

Read each row of WW as a single vector — its dot product with xx becomes one component of yy.

  • Row 1 (w11,w12)(w_{11}, w_{12}) · xx → 1st component of yy = w11x1+w12x2w_{11} x_1 + w_{12} x_2
  • Row 2 (w21,w22)(w_{21}, w_{22}) · xx → 2nd component of yy = w21x1+w22x2w_{21} x_1 + w_{22} x_2
  • Row 3 (w31,w32)(w_{31}, w_{32}) · xx → 3rd component of yy = w31x1+w32x2w_{31} x_1 + w_{32} x_2

With specific numbers:

[123456][23]=[12+2332+4352+63]=[81828]\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \end{bmatrix} = \begin{bmatrix} 1 \cdot 2 + 2 \cdot 3 \\ 3 \cdot 2 + 4 \cdot 3 \\ 5 \cdot 2 + 6 \cdot 3 \end{bmatrix} = \begin{bmatrix} 8 \\ 18 \\ 28 \end{bmatrix}

A 3-row, 2-column matrix times a 2-dim vector produces a 3-dim vector.

If sizes don’t match, you can’t multiply

Natural question: what happens if the sizes don’t line up? Simple answer: the multiplication just isn’t defined. For A×BA \times B to work, the number of columns of the left operand has to equal the number of rows of the right.

  • 3×23 \times 2 matrix × 2-dim column vector → 3-dim column vector (works)
  • 3×23 \times 2 matrix × 3-dim column vector → doesn’t work
  • 3×23 \times 2 matrix × 2×42 \times 4 matrix → 3×43 \times 4 matrix (works)
  • 3×23 \times 2 matrix × 3×43 \times 4 matrix → doesn’t work

When shapes don’t match, you either transpose to line them up, or rethink how the matrix/vector was built. shape mismatch errors in PyTorch or NumPy are almost always this: “left’s columns ≠ right’s rows.”

So a single layer of a neural network is roughly this shape.

y=Wx+by = W x + b

The weighted sum from the previous article, written for a whole layer at once instead of one input at a time — that’s really all it is.

Matrix × matrix = “composition of transformations”

Neural networks stack many layers. Written as a formula:

y=W3(W2(W1x+b1)+b2)+b3y = W_3 (W_2 (W_1 x + b_1) + b_2) + b_3

The parentheses are deep and ugly, but it’s just transforming in sequence.

  • W1W_1 transforms once
  • W2W_2 transforms again
  • W3W_3 transforms one more time

Ignoring activations, it’s just several matrices multiplied in a row. That’s where the matrix product of two matrices shows up.

flowchart LR
    A[Input vector x] -->|Apply W1| B[Intermediate vector]
    B -->|Apply W2| C[Next intermediate vector]
    C -->|Apply W3| D[Output vector y]

You can compute a matrix product by hand — “row 1 · column 1 dot product, row 1 · column 2 dot product, …” — but you don’t need to learn that. “Multiplying two matrices gives you a single combined transformation” is a good-enough read.

Swapping the order changes the answer

Another thing to watch: matrix multiplication doesn’t commute — swapping operand order changes the result. For plain numbers, 3×5=5×33 \times 5 = 5 \times 3 holds. For matrices, ABAB and BABA aren’t generally equal.

  • Sometimes only one side is even defined because of size constraints
  • Even when both are defined, the values match only in special cases

Geometrically: “rotate 90° then scale” and “scale then rotate 90°” end up different because the order of the transformations matters.

In AI formulas too, W2(W1x)W_2 (W_1 x) depends on the order of layers, so the order of the WW matrices shouldn’t be swapped casually.

Transpose is just “flip rows and columns”

Another symbol you’ll see a lot in AI articles is transpose. Written as ATA^T or AA^\top, with a T on the shoulder.

The operation is literally “flip rows and columns.”

A=[123456],AT=[135246]A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix},\quad A^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix}

Why does it come up? When multiplying vectors and matrices, the row/column sizes have to line up. When they don’t, transpose fixes the orientation — that’s most of what it’s used for.

The aTba^T b form of the dot product

The most common use of transpose in AI articles is writing the dot product of column vectors aa and bb like this:

aTb=[23][31]=2×3+3×1=9a^T b = \begin{bmatrix} 2 & 3 \end{bmatrix} \begin{bmatrix} 3 \\ 1 \end{bmatrix} = 2 \times 3 + 3 \times 1 = 9

Transposing one column vector (vertical) into a row vector (horizontal) turns it into “1 row × 1 column”, which the matrix-product rule handles naturally. Strictly, this is a matrix product, but the result is a scalar (one number) and equals aba \cdot b. When you see aTba^T b in an AI article, read it as “the dot product, written in matrix-product form.”

With all this, the Attention formula becomes readable

With these tools, the most common LLM formula starts to read.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V

QQ, KK, and VV look intimidating at first. They’re three matrices the Transformer builds internally, each with a specific role.

SymbolRole
QQ (Query)What the word currently being attended to is looking for in other words
KK (Key)A label each word advertises — “I’m this kind of place”
VV (Value)The actual content each word wants to pass to the next layer

Library analogy: QQ is the search query, KK is each book’s spine/title label, VV is the book’s contents. All three come from the same input (word embeddings) with different weight matrices applied — three views of the same thing.

With that setup, the rest of the formula reads as:

  • QKTQ K^T computes the similarity (dot product) between all queries and all keys in one shot
  • Dividing by dk\sqrt{d_k} keeps the values from getting too large
  • softmax turns similarities into probability-like weights
  • Multiplying by VV at the end is a weighted average of the contents with those weights

What’s a “weighted average”?

“Weighted average” might not be a common phrase for everyone, so a quick recap. A plain (arithmetic) mean adds all values and divides by the count — every value counts equally. A weighted average attaches a “weight” to each value, so heavier weights pull harder on the result.

Example: three test scores, 70, 80, 90, and the final exam (90) counts double.

  • Arithmetic mean: (70+80+90)/3=80(70 + 80 + 90) / 3 = 80
  • Weighted average (final doubled): (701+801+902)/(1+1+2)=82.5(70 \cdot 1 + 80 \cdot 1 + 90 \cdot 2) / (1 + 1 + 2) = 82.5

Same scores, different result depending on the weights. If all weights are equal, the weighted average collapses back to the arithmetic mean.

In Attention, the weights from softmax (which sum to 1) are applied to each word’s VV and added up. Words with higher attention contribute more, words with lower attention contribute less — that’s the “blend” coming out.

With the softmax from the previous article, plus the dot product and matrix product from this one, you can read Attention as “decide attention by query–key dot product, then pick up values with those weights.”

You don’t need the fine-grained behavior; just being able to scan the shape of the formula and see what’s being bundled together makes papers and release notes much less painful.

Skippable topics

At entry level, the following can stay on the skip list.

TermWhat it isWhy you can skip
DeterminantIndicator of how much a matrix scales spaceBarely shows up in LLM explanations
Inverse matrixThe matrix that undoes anotherYou might see pseudo-inverses in AI articles, but they’re rare
Eigenvalues / eigenvectorsValues describing a matrix’s “habits”Important for PCA and physics; rarely needed for LLM articles
Orthogonal matrixA matrix that only rotatesWhen it appears, read it as “a rotation-like transformation”
Zero matrixA matrix of all zerosKnow the definition; rarely needs deeper treatment at entry level
Identity matrix IIDiagonal is 1, rest is 0. Multiplying by it leaves things unchangedShows up in papers (residual connections etc.); read as “the one that doesn’t change anything”

Look them up when you need to. No need to grab them all up front.

What you can read now

With just this article’s scope, you can already outline most equations you’ll see in AI articles.

Common shapeReading
y=Wx+by = W x + bTransform vector xx with matrix WW and add a bias
aba \cdot b or aTba^T bMeasure how close in direction two vectors are
QKTQ K^TBundle similarity calculations between many queries and keys
W1,W2,W3,W_1, W_2, W_3, \ldotsWeight matrices per layer
x\lVert x \rVertLength of a vector

When each formula reads less like “something I have to compute” and more like “a note about what’s being bundled together,” you’re already halfway there.


Glossary (feel free to skip)

TermMeaning
ScalarA single plain number. Not a 1-dim vector — a raw value
VectorA row of numbers. In AI, used as a container for features
MatrixA table of numbers. A device that transforms a vector into another vector, in bulk
DimensionHow many numbers in a vector. In LLM contexts you’ll see things like “4096-dim embedding”
NormLength of a vector. Written as x\lVert x \rVert
Dot productHow close in direction two vectors are. Written as aba \cdot b or aTba^T b
Cosine similarityDot product divided by the product of lengths. For comparing direction only
Matrix productMultiplication of two matrices. Corresponds to composition of transformations
TransposeFlip rows and columns of a matrix. Written as ATA^T or AA^\top
EmbeddingA vector representation of a word or image. Similar meanings end up close in vector space

Next up, I’m planning to head into probability and statistics — wrapping the likelihood and cross-entropy that sit behind softmax in the same “just-readable” style.