Vectors and matrices, just enough to read AI articles

In the previous article, The small set of math that makes AI articles readable, vectors were glossed over as “just a row of numbers.” If you want to read Transformer or Attention formulas a bit more seriously though, it helps to push one more step into vectors and matrices.

Same stance as before: the goal isn’t to solve anything, just to be able to read.

No determinants, no inverses, no eigenvalues here. The dot product and the matrix product — these two are enough to read most of the formulas around LLMs.

A vector is a container of features written as numbers

A quick review first.

A vector is a row of numbers. In AI it holds features of a word, image, or audio clip.

For words: “which meanings this word sits close to”, packed into hundreds to thousands of numbers
For images: pixel values, or features like edges and colors laid out as numbers
For audio: strength at each time step or frequency band, laid out as numbers

If we squash things down to 3 dimensions and write casually, a word vector looks like this (real ones are much higher-dimensional).

Word	Row of numbers
King	`[0.8, 0.1, 0.9]`
Queen	`[0.8, 0.9, 0.9]`
Apple	`[0.1, 0.1, 0.0]`

Think of each column as something like “royalty-ness”, “femininity”, “is it a human”. That’s good enough for intuition.

In practice, learned vectors don’t cleanly split into dimensions you can name. But the tendency “words with similar meanings sit near each other” does emerge.

Viewed as arrows, it’s just “direction and length”

A vector can also be drawn as an arrow.

2D vectors can be drawn as arrows from the origin. Vectors inside AI are hundreds or thousands of dimensions, so you can't draw them, but the intuition carries over.

In 2D, only two things matter.

Length = overall strength of the feature
Direction = which combination of features

Same idea in higher dimensions. Vectors for words with close meanings point in “roughly the same direction”; unrelated words point in “totally different directions” — that’s the mental picture.

The length is called the norm. It’s often written as $\lVert x \rVert$ with vertical bars on either side.

A note for readers from Japanese high school math

In Japanese high school math, vectors are written with an arrow on top, like $\vec{a}$ . But in university math, and in AI papers and textbooks, the arrow is rarely used. Instead, vectors are written in bold like $\mathbf{x}$ , or the convention “lowercase = vector, uppercase = matrix” is used from context.

From here on, this article writes $x$ or $W$ without an arrow. When you see a plain letter, assume it refers to a vector or a matrix depending on the context.

The norm notation also changes as a pair.

Japanese high school: $|\vec{a}|$ (one vertical bar). The arrow already signals “vector,” so one bar is enough
University / AI: $\lVert a \rVert$ (two vertical bars). Without the arrow, two bars distinguish the norm from absolute value

Since this article drops the arrow, the norm is written with two bars, $\lVert x \rVert$ .

Addition and subtraction “compose meanings”

Addition and subtraction of vectors are straightforward.

(2, 3) + (3, 1) = (5, 4)

You’re just adding position by position.

Notation for general components

From here on, we’ll sometimes describe vectors symbolically instead of with specific numbers. Let’s set up that notation first.

Vectors $a$ and $b$ written with their components visible look like this.

a = (a_1,\ a_2,\ \ldots,\ a_n)

b = (b_1,\ b_2,\ \ldots,\ b_n)

Reading guide:

$a_1$ is the 1st number of vector $a$ , $a_2$ is the 2nd, and so on
The small number in the bottom-right indicates “which position” — it’s not multiplication or a power
$\ldots$ just means “skip the middle”
The final $n$ means “the number of elements in the vector”

This “subscript for position” idea carries over to matrices later. There, the subscript becomes two numbers, like $a_{11}$ , indicating “row and column”, but the idea is the same.

With this notation, addition is:

a + b = (a_1 + b_1,\ a_2 + b_2,\ \ldots,\ a_n + b_n)

Same as the concrete example — adding position by position.

Moving meaning with word vectors

A famous story: word vectors sometimes satisfy relationships like this.

king - man + woman ≒ queen

Subtract “man-ness” from “king” and add “woman-ness”, and you land near “queen”. It doesn’t come out that cleanly every time in real models, but the intuition “you can move meaning by adding and subtracting vectors” is worth holding on to.

You see the same kind of thing in the image world with style transfer, where “adding a certain vector changes the style” — same root idea, adding and subtracting vectors that represent features.

Vector multiplication comes in several kinds

After addition and subtraction, multiplication is the next natural step. But vector multiplication isn’t a single operation — there are several kinds. The two you see most in AI articles are:

Dot product — takes two vectors, returns one number
Element-wise (Hadamard) product — multiplies position by position, returns a vector of the same length

There are others like the cross product, but they don’t come up much in LLM or image-generation explanations. We’ll start with the dot product — the one that appears most in Attention and similarity calculations.

The dot product is written as:

a \cdot b = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

Using the notation we just set up, what it does is: “multiply the first pair, multiply the second pair, … and add them all up.”

Plugging in our vectors $(2, 3)$ and $(3, 1)$ :

(2, 3) \cdot (3, 1) = 2 \times 3 + 3 \times 1 = 6 + 3 = 9

“Multiply position by position, then add them up.” That’s the whole operation.

Now let’s put it next to the “weighted sum” from the previous article.

y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

The middle part, $w_1 x_1 + w_2 x_2 + \cdots$ , has exactly the same shape as the dot product. Rewriting the previous article’s computation in vector language: it’s “the dot product of the weight vector $w$ and the input vector $x$ ”.

The meaning: the dot product measures “how close in direction” two vectors are.

Close direction → dot product is a large positive
Perpendicular (unrelated) → dot product is near 0
Opposite → dot product is negative

Why does “multiply components and sum” produce a direction comparison? A picture makes it click.

Blue is fixed at a=(2,2); red varies through (3,1) → (−2,2) → (−3,−1). Closer in direction gives a larger positive; near-perpendicular gives 0; opposite gives negative.

At the component level, the story is simple. When two vectors have components pointing the same way, multiplying those components gives a positive number. When components point opposite ways, the product is negative. Summing everything up: if the directions are close, positives accumulate and the result is large; if perpendicular, positives and negatives cancel to 0; if opposite, negatives accumulate and the result is negative.

So what each neuron in a neural network is doing can be rephrased as “measuring how closely the input aligns with the weight vector.”

The “length and angle” form

The dot product can also be written in a geometric form.

a \cdot b = \lVert a \rVert \lVert b \rVert \cos\theta

$\lVert a \rVert$ and $\lVert b \rVert$ are the lengths of each vector
$\theta$ is the angle between them

This article doesn’t go into trigonometry, so this form won’t show up much later. But knowing it helps the “direction closeness” story feel more grounded.

Same direction (angle 0) → $\cos\theta = 1$ → dot product is the product of the lengths, maximum
Perpendicular (angle 90°) → $\cos\theta = 0$ → dot product is 0
Opposite (angle 180°) → $\cos\theta = -1$ → dot product is negative, minimum

Divide both sides by the product of lengths:

\cos\theta = \frac{a \cdot b}{\lVert a \rVert \lVert b \rVert}

“The dot product divided by the two lengths” is exactly $\cos$ of the angle between the vectors. That’s where the name cosine similarity comes from.

Cosine similarity and embedding search

Cosine similarity in search and retrieval is exactly this $\cos\theta$ . Dividing the dot product by the two lengths removes the size effect and boils “direction closeness” down to a single number.

Vector databases and embedding search are, at their core, doing this “direction closeness” calculation at scale.

Row or column — the name changes with how you write it

So far we’ve been writing vectors horizontally, like $(2, 3)$ . Books and AI papers often write them vertically.

Quick note on bracket style before we go further. There are two common ways to wrap a matrix or column vector: parentheses ( ) and brackets [ ]. Japanese high school math used parentheses, while programming, Western textbooks, and AI papers tend to use brackets. Either means the same thing, but this article uses brackets from here on.

a = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}

The vertical form is called a column vector; the horizontal form is called a row vector. Same content, different orientation.

AI literature leans toward column vectors.

Hand calculations are easier when both are vertical

Personal tip: when calculating a dot product by hand, writing both as column vectors side by side is easier to track.

a \cdot b = \begin{bmatrix} 2 \\ 3 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 1 \end{bmatrix} = 2 \times 3 + 3 \times 1 = 9

With both written vertically side by side, corresponding components like $a_1 \leftrightarrow b_1$ and $a_2 \leftrightarrow b_2$ end up on the same row. All that’s left is “multiply and add” row by row. With the horizontal $(2, 3) \cdot (3, 1)$ form, corresponding components sit far apart — a little harder to track visually.

A small spoiler: a column vector is just a “matrix with 1 column”, and a row vector is a “matrix with 1 row”. The matrix in the next section is simply this shape extended in both directions.

A matrix is “a device for transforming vectors in bulk”

Now into matrices. A matrix is the column/row vector from before, extended in both directions.

W = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{bmatrix}

This is a $3 \times 2$ matrix (3 rows, 2 columns).

Which one is horizontal again?

In Japanese, the shape of the kanji gives a hint — 行 contains horizontal strokes, 列 contains vertical strokes — so 行 is rows and 列 is columns. That trick doesn’t carry over to English.

Personal mnemonic: the r in row has a stroke extending sideways → row is horizontal. The l in column is a vertical bar → column is vertical. I mix up flex-direction: row and column in HTML/CSS often enough that this rescues me. No idea if native speakers actually use this, so treat it as a personal life hack.

A matrix transforms vectors in bulk

What do we use a matrix for? Mostly: to transform a vector into another vector, in bulk.

y = W x

Both $y$ and $x$ are vectors; $W$ is the matrix between them. That single line looks dry, but expanded out, all it’s doing is bundling several dot products together.

For a $3 \times 2$ matrix $W$ times a 2-dim column vector $x$ :

W x = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} w_{11} x_1 + w_{12} x_2 \\ w_{21} x_1 + w_{22} x_2 \\ w_{31} x_1 + w_{32} x_2 \end{bmatrix}

Read each row of $W$ as a single vector — its dot product with $x$ becomes one component of $y$ .

Row 1 $(w_{11}, w_{12})$ · $x$ → 1st component of $y$ = $w_{11} x_1 + w_{12} x_2$
Row 2 $(w_{21}, w_{22})$ · $x$ → 2nd component of $y$ = $w_{21} x_1 + w_{22} x_2$
Row 3 $(w_{31}, w_{32})$ · $x$ → 3rd component of $y$ = $w_{31} x_1 + w_{32} x_2$

With specific numbers:

\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \end{bmatrix} = \begin{bmatrix} 1 \cdot 2 + 2 \cdot 3 \\ 3 \cdot 2 + 4 \cdot 3 \\ 5 \cdot 2 + 6 \cdot 3 \end{bmatrix} = \begin{bmatrix} 8 \\ 18 \\ 28 \end{bmatrix}

A 3-row, 2-column matrix times a 2-dim vector produces a 3-dim vector.

If sizes don’t match, you can’t multiply

Natural question: what happens if the sizes don’t line up? Simple answer: the multiplication just isn’t defined. For $A \times B$ to work, the number of columns of the left operand has to equal the number of rows of the right.

$3 \times 2$ matrix × 2-dim column vector → 3-dim column vector (works)
$3 \times 2$ matrix × 3-dim column vector → doesn’t work
$3 \times 2$ matrix × $2 \times 4$ matrix → $3 \times 4$ matrix (works)
$3 \times 2$ matrix × $3 \times 4$ matrix → doesn’t work

When shapes don’t match, you either transpose to line them up, or rethink how the matrix/vector was built. shape mismatch errors in PyTorch or NumPy are almost always this: “left’s columns ≠ right’s rows.”

So a single layer of a neural network is roughly this shape.

y = W x + b

The weighted sum from the previous article, written for a whole layer at once instead of one input at a time — that’s really all it is.

Matrix × matrix = “composition of transformations”

Neural networks stack many layers. Written as a formula:

y = W_3 (W_2 (W_1 x + b_1) + b_2) + b_3

The parentheses are deep and ugly, but it’s just transforming in sequence.

$W_1$ transforms once
$W_2$ transforms again
$W_3$ transforms one more time

Ignoring activations, it’s just several matrices multiplied in a row. That’s where the matrix product of two matrices shows up.

flowchart LR
    A[Input vector x] -->|Apply W1| B[Intermediate vector]
    B -->|Apply W2| C[Next intermediate vector]
    C -->|Apply W3| D[Output vector y]

You can compute a matrix product by hand — “row 1 · column 1 dot product, row 1 · column 2 dot product, …” — but you don’t need to learn that. “Multiplying two matrices gives you a single combined transformation” is a good-enough read.

Swapping the order changes the answer

Another thing to watch: matrix multiplication doesn’t commute — swapping operand order changes the result. For plain numbers, $3 \times 5 = 5 \times 3$ holds. For matrices, $AB$ and $BA$ aren’t generally equal.

Sometimes only one side is even defined because of size constraints
Even when both are defined, the values match only in special cases

Geometrically: “rotate 90° then scale” and “scale then rotate 90°” end up different because the order of the transformations matters.

In AI formulas too, $W_2 (W_1 x)$ depends on the order of layers, so the order of the $W$ matrices shouldn’t be swapped casually.

Transpose is just “flip rows and columns”

Another symbol you’ll see a lot in AI articles is transpose. Written as $A^T$ or $A^\top$ , with a T on the shoulder.

The operation is literally “flip rows and columns.”

A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix},\quad A^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix}

Why does it come up? When multiplying vectors and matrices, the row/column sizes have to line up. When they don’t, transpose fixes the orientation — that’s most of what it’s used for.

The $a^T b$ form of the dot product

The most common use of transpose in AI articles is writing the dot product of column vectors $a$ and $b$ like this:

a^T b = \begin{bmatrix} 2 & 3 \end{bmatrix} \begin{bmatrix} 3 \\ 1 \end{bmatrix} = 2 \times 3 + 3 \times 1 = 9

Transposing one column vector (vertical) into a row vector (horizontal) turns it into “1 row × 1 column”, which the matrix-product rule handles naturally. Strictly, this is a matrix product, but the result is a scalar (one number) and equals $a \cdot b$ . When you see $a^T b$ in an AI article, read it as “the dot product, written in matrix-product form.”

With all this, the Attention formula becomes readable

With these tools, the most common LLM formula starts to read.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V

$Q$ , $K$ , and $V$ look intimidating at first. They’re three matrices the Transformer builds internally, each with a specific role.

Symbol	Role
$Q$ (Query)	What the word currently being attended to is looking for in other words
$K$ (Key)	A label each word advertises — “I’m this kind of place”
$V$ (Value)	The actual content each word wants to pass to the next layer

Library analogy: $Q$ is the search query, $K$ is each book’s spine/title label, $V$ is the book’s contents. All three come from the same input (word embeddings) with different weight matrices applied — three views of the same thing.

With that setup, the rest of the formula reads as:

$Q K^T$ computes the similarity (dot product) between all queries and all keys in one shot
Dividing by $\sqrt{d_k}$ keeps the values from getting too large
softmax turns similarities into probability-like weights
Multiplying by $V$ at the end is a weighted average of the contents with those weights

What’s a “weighted average”?

“Weighted average” might not be a common phrase for everyone, so a quick recap. A plain (arithmetic) mean adds all values and divides by the count — every value counts equally. A weighted average attaches a “weight” to each value, so heavier weights pull harder on the result.

Example: three test scores, 70, 80, 90, and the final exam (90) counts double.

Arithmetic mean: $(70 + 80 + 90) / 3 = 80$
Weighted average (final doubled): $(70 \cdot 1 + 80 \cdot 1 + 90 \cdot 2) / (1 + 1 + 2) = 82.5$

Same scores, different result depending on the weights. If all weights are equal, the weighted average collapses back to the arithmetic mean.

In Attention, the weights from softmax (which sum to 1) are applied to each word’s $V$ and added up. Words with higher attention contribute more, words with lower attention contribute less — that’s the “blend” coming out.

With the softmax from the previous article, plus the dot product and matrix product from this one, you can read Attention as “decide attention by query–key dot product, then pick up values with those weights.”

You don’t need the fine-grained behavior; just being able to scan the shape of the formula and see what’s being bundled together makes papers and release notes much less painful.

Skippable topics

At entry level, the following can stay on the skip list.

Term	What it is	Why you can skip
Determinant	Indicator of how much a matrix scales space	Barely shows up in LLM explanations
Inverse matrix	The matrix that undoes another	You might see pseudo-inverses in AI articles, but they’re rare
Eigenvalues / eigenvectors	Values describing a matrix’s “habits”	Important for PCA and physics; rarely needed for LLM articles
Orthogonal matrix	A matrix that only rotates	When it appears, read it as “a rotation-like transformation”
Zero matrix	A matrix of all zeros	Know the definition; rarely needs deeper treatment at entry level
Identity matrix $I$	Diagonal is 1, rest is 0. Multiplying by it leaves things unchanged	Shows up in papers (residual connections etc.); read as “the one that doesn’t change anything”

Look them up when you need to. No need to grab them all up front.

What you can read now

With just this article’s scope, you can already outline most equations you’ll see in AI articles.

Common shape	Reading
$y = W x + b$	Transform vector $x$ with matrix $W$ and add a bias
$a \cdot b$ or $a^T b$	Measure how close in direction two vectors are
$Q K^T$	Bundle similarity calculations between many queries and keys
$W_1, W_2, W_3, \ldots$	Weight matrices per layer
$\lVert x \rVert$	Length of a vector

When each formula reads less like “something I have to compute” and more like “a note about what’s being bundled together,” you’re already halfway there.

The small set of math that makes AI articles readable Weighted sums, sigmoid, softmax, training flow. This article continues from there.
Building an OCR typo detector with an encoder model + local LLM A practical example of building probabilities from per-position vectors.
Trying RPG-parameter extraction from character images with a local vision LLM Multimodal example of treating images and text in the same vector space.
MoonshotAI (Kimi) proposes AttnRes — replacing residual connections with attention in Transformers, 1.25× compute efficiency A look at how $Q K^T$ and softmax are actually used in research.

Glossary (feel free to skip)

Term	Meaning
Scalar	A single plain number. Not a 1-dim vector — a raw value
Vector	A row of numbers. In AI, used as a container for features
Matrix	A table of numbers. A device that transforms a vector into another vector, in bulk
Dimension	How many numbers in a vector. In LLM contexts you’ll see things like “4096-dim embedding”
Norm	Length of a vector. Written as $\lVert x \rVert$
Dot product	How close in direction two vectors are. Written as $a \cdot b$ or $a^T b$
Cosine similarity	Dot product divided by the product of lengths. For comparing direction only
Matrix product	Multiplication of two matrices. Corresponds to composition of transformations
Transpose	Flip rows and columns of a matrix. Written as $A^T$ or $A^\top$
Embedding	A vector representation of a word or image. Similar meanings end up close in vector space

Next up, I’m planning to head into probability and statistics — wrapping the likelihood and cross-entropy that sit behind softmax in the same “just-readable” style.