Math for reading AI articles: the full 5-article series
Contents
When reading AI and LLM articles, formulas and symbols show up abruptly and block comprehension. Over the past few days I’ve written a five-article series that organizes those symbols for reading — not for solving. This article is the index of that series: big picture and reading order.
Each article stands on its own, but the symbols and concepts build on earlier articles, so reading in order from 1 to 5 is the smoothest path.
All five articles
| # | Article | What it covers |
|---|---|---|
| 1 | The small set of math that makes AI articles readable | Weighted sums, sigmoid, softmax, outline of the training loop |
| 2 | Vectors and matrices, just enough to read AI articles | Dot products, matrix products, transpose, Attention’s QKᵀ |
| 3 | Probability and statistics, just enough to read AI articles | Conditional probability, variance / standard deviation, cross-entropy, perplexity, temperature |
| 4 | Derivatives, just enough to read AI articles | d/dx, e, chain rule, partial derivatives, gradient ∇, Jacobians |
| 5 | Gradient descent and backprop, just enough to read AI articles | Gradient descent, SGD/Adam, backpropagation, vanishing gradients, residual connections, learning rate schedules |
Shared stance
All five articles take the same approach.
- No rigorous derivations or proofs
- Drops everything to a level readable with high-school math plus a bit
- The goal is “reading the meaning of numbers in papers and training logs”, not computing them
- Readers wanting more are expected to move on to textbooks (PRML, deep-learning books, analysis textbooks) after this series
The target is the minimum floor where AI-article symbols become mostly readable.
Which one to read first
Default: 1 → 5 in order. Each article assumes concepts from the previous one, so reading top to bottom is the least wasteful.
That said, some entry points might fit particular backgrounds better.
Programmers working on AI implementations
Articles 3 (probability / statistics) and 5 (gradient descent / backprop) are the core.
The loss function (cross-entropy) and the parameter update flow (loss.backward() → optimizer.step()) become readable, connecting back to code you’re already touching daily.
If the math foundation feels shaky, back up to articles 1 and 2; if derivatives feel rusty, article 4.
Readers who’ve forgotten most of high-school math
Reading from article 1 in order is the gentlest route. Each article starts from “remembering high-school math”, so that footing is enough to follow along. The series re-covers power functions, exponentials, derivatives, and fraction manipulations from scratch.
Readers who know ML terminology but get stuck on formulas
Articles 3 (probability / statistics) and 5 (gradient descent / backprop) first. After grabbing concrete correspondences like “softmax = a device that turns scores into a probability distribution” or “cross-entropy = the training loss itself”, revisiting articles 2 (matrices) and 4 (derivatives) to firm up the foundation lets formulas and words click together retroactively.
Readers who want to read LLM papers
2 (Attention’s QKᵀ) → 3 (cross-entropy / perplexity / temperature) → 4 (gradient / Jacobians) → 5 (Adam / learning rate schedules) gets you to the point where a paper’s training-details section and the formulas in figures are readable. Article 1 can be a quick skim of the opening as a prerequisite refresher.
Readers who just want to do integrals
Unfortunately this series focuses on math for reading AI articles, so integrals barely get coverage (article 4 introduces the notation, article 5 has a brief stretch integrating acceleration over time). That said, the author once had a weirdly specific hobby of doing Japanese university-entrance-exam integrals before bed, so a separate article on that might show up someday.
Formulas the series makes readable
A map between typical formulas and which articles’ tools are needed to read them.
| Formula / symbol | Articles needed |
|---|---|
y = Wx + b | 1 (weighted sums) + 2 (matrices) |
softmax(x_i) = exp(x_i) / Σ exp(x_j) | 1 + 3 (distributions) |
Attention(Q, K, V) = softmax(QKᵀ/√d_k) V | 2 + 3 |
H(p, q) = −Σ p(x) log q(x) (cross-entropy) | 3 |
PPL = exp(H(p, q)) (perplexity) | 3 |
∇L (gradient vector), Jacobians | 4 |
θ ← θ − η∇L (gradient descent) | 4 + 5 |
| Layer-wise gradient flow via the chain rule (backprop) | 4 + 5 |
AdamW(betas=(0.9, 0.95)), warmup_steps, grad_clip | 5 |
With these tools in place, training-details sections of papers, hyperparameters on model cards, and training logs on W&B / TensorBoard become readable.
What the series doesn’t cover
Listing out what’s out of scope explicitly.
- Rigorous proofs and derivations
- Deeper optimization theory (Newton’s method, L-BFGS, Natural Gradient, etc.)
- Measure-theoretic probability
- Differential and information geometry
- Serious treatment of integration (each article only touches it briefly)
If you need more than “being able to read AI articles”, use this series as the entry point and move on to other textbooks and papers.
Related hands-on articles
Touching implementation after the series lets the formulas connect to actual hands-on feel. This blog has a few records of actually training models.
Fine-tuning / LoRA training
- Fine-tuning LUKE/BERT on a Japanese corpus (OCR correction) A classical fine-tuning example
- Making an LLM LoRA on Mac mini M4
- LoRA creation with SeaArt
- 13 failures on Mac M1 Max, then success on RunPod
- LoRA training setup on RTX 3060 (6GB)
Deeper Transformer / Attention
- MoonshotAI (Kimi) proposes AttnRes — replacing residual connections with attention in Transformers An application of the residual-connection idea
- A unified view of Attention Sinks and Residual Sinks Transformer training stability
Optimizer implementation
- MegaTrain trains a 100B-parameter LLM in full precision on a single GPU A concrete example of Adam’s first / second moment implementation
The starting point was being asked fundamental questions by people around me: “the AI news articles and test-result explanations are hard to follow”, “what’s even the difference between AI and LLM?”. To answer, I needed to hand over “base knowledge for reading” first. This series is the organized version of that. While writing, I pre-empted “this is where a question is likely to come up” at several spots, so the position is closer to an entry-level guide than private study notes.
Once all five are read, the frequency of getting stuck on math symbols in AI articles should drop noticeably. From there, each paper and implementation article can be approached directly.