Translated from the original Japanese article

Tech Apr 23, 2026 28 min read

Gradient descent and backprop, just enough to read AI articles

Contents

In the previous derivatives article, we got up to the point where the gradient $\nabla L$ of the loss $L$ with respect to the parameters points “in the direction $L$ increases most steeply”. Which means moving in the opposite direction $-\nabla L$ makes $L$ decrease most steeply. AI training, stripped down, is just the act of repeating that move over and over.

This article walks through how that repetition is actually structured, keeping the same “read, don’t solve” stance as the previous four articles in the series.

If you can read gradient descent, SGD, Adam, backpropagation, vanishing gradients, residual connections, and learning rate schedules for what they’re doing at a high level, training logs, model cards, and the training-details section of papers become mostly readable.

The chain rule, gradients, and Jacobians from article 4 are assumed knowledge. If any notation slips your mind, jump back to article 4.

Learning is just “moving parameters to lower the loss”

A quick refresh of the big picture first.

Training a neural network is an optimization problem: minimize the loss function $L(\theta)$ over the parameters $\theta$ . The tools from previous articles:

The loss $L(\theta)$ (like the cross-entropy from article 3) returns a scalar
The parameters $\theta$ form a vector (millions to trillions of dimensions)
The gradient $\nabla L(\theta)$ is a vector of the same dimensions as $\theta$ , pointing “in the direction $L$ increases most steeply”

From these three, “nudge $\theta$ slightly opposite to $\nabla L$ and $L$ goes down” follows directly. The rest of training is stacking up that “slight nudge” tens of millions to billions of times.

The update rule θ ← θ − η∇L

The most basic “way of moving” is this formula.

\theta \leftarrow \theta - \eta \, \nabla L(\theta)

Breaking down the symbols:

$\theta$ (theta): notation bundling up all the model’s parameters. Picture billions to trillions of weights packed into a single vector.
$L(\theta)$ : the loss at those parameters (a scalar).
$\nabla L(\theta)$ : the gradient vector of the loss (same shape as $\theta$ ).
$\eta$ (eta): the learning rate. A small positive number that sets how large a step to take. The actual step size per update is (η × the size of the gradient), so larger η moves a lot per step, smaller η crawls. Typical values are $10^{-4}$ to $10^{-3}$ .
$\leftarrow$ is assignment (replace $\theta$ with the right-hand side).

The whole thing reads as “move $\theta$ in the opposite direction of the gradient by $\eta$ times as much.” This is called gradient descent.

Why subtract?

Some people trip over the sign. Why $-\eta\nabla L$ and not $+\eta\nabla L$ ?

$\nabla L$ points “in the direction of steepest increase of $L$ ”. Adding it would move $\theta$ toward larger loss. We want smaller loss, so we move in the opposite direction — that’s why we subtract $\eta\nabla L$ from $\theta$ .

A minimal example

Actual LLMs have hundreds of millions of parameters, but to follow the mechanism, a single-variable version is enough. Treat $L(w) = w^2$ — a bowl-shaped function — as the loss of a model with exactly one parameter $w$ . Minimum at $w = 0$ (loss is 0), growing as $w$ moves away — the simplest possible landscape.

Start at $w = 4$ , set the learning rate to $\eta = 0.3$ , and run the update $w \leftarrow w - \eta \cdot L'(w)$ for a few steps. Differentiating $L(w) = w^2$ with respect to $w$ gives $L'(w) = 2w$ , so the update here becomes

w \leftarrow w - 0.3 \cdot 2w = 0.4w

i.e., $w$ shrinks to 0.4 times its current value every step.

Step	$w$	$L = w^2$
0	4.00	16.00
1	1.60	2.56
2	0.64	0.41
3	0.26	0.07
4	0.10	0.01

$w$ gradually approaches 0, and the loss falls with it. This is the minimal unit of gradient descent.

Walking down L(w)=w² from w=4 with η=0.3. The first step jumps a lot, then subsequent steps approach the minimum with diminishing size.

Walking across contour lines in the opposite direction

Recall the contour-line picture of the “bowl” from article 4: at any point, the gradient points perpendicular to the contour, toward higher values. Gradient descent traverses that in reverse — stepping across contours toward the lower side, one contour at a time. If the loss landscape is bowl-shaped, we’re walking toward the bottom.

Learning rate η behavior

The learning rate $\eta$ sets how far to step, and it breaks things in both directions if it’s too big or too small.

Too small → barely moves per step, takes forever to reach the minimum. Direction is still correct.
Too large → overshoots the minimum and lands on the other side. Oscillates or diverges.
Just right → steadily descends each step. Fast convergence.

Concretely, starting at $w = 4$ on $L(w) = w^2$ with various $\eta$ , after 4 steps:

$\eta$	step 0	step 1	step 2	step 3	step 4
0.05 (small)	4.00	3.60	3.24	2.92	2.62
0.3 (medium)	4.00	1.60	0.64	0.26	0.10
0.9 (large)	4.00	−3.20	2.56	−2.05	1.64
1.1 (diverging)	4.00	−4.80	5.76	−6.91	8.29

At $\eta = 0.9$ it bounces back and forth but shrinks; at $\eta = 1.1$ it explodes.

In LLM training, the learning rate is typically in the $10^{-3}$ to $10^{-5}$ range. For the Optimizer called Adam (which shows up later), around $10^{-4}$ (lr=1e-4) is the standard (Adam’s internals are covered in a later section).

”Shouldn’t I just crank it up?”

Tempting, but pushing $\eta$ higher eventually enters the diverging zone and breaks training. “As large as possible without diverging” is the basic tuning principle, and the learning rate schedules introduced later (warmup, cosine decay) come from the same “push to the edge of divergence, cleverly” philosophy.

Full-batch, SGD, mini-batch

$L$ is usually defined as “the average over all training data”, but computing $\nabla L$ over all the data every step is expensive. So there’s a choice about how much data to use per gradient computation.

Full-batch gradient descent: one gradient over the whole dataset. The gradient is “true”, but each step is slow.
Stochastic gradient descent (SGD): one sample at a time. Fast, but noisy.
Mini-batch gradient descent: tens to thousands of samples at a time. Balanced compromise.

Real LLM training is almost entirely mini-batch. “batch_size=4M tokens” kinds of notations mean a single mini-batch is 4 million tokens.

What “Stochastic” means

The S in SGD is for Stochastic — “involving randomness”. Because which samples to use each step is picked randomly, the gradient wobbles each time. That randomness is where the name comes from.

Randomness is useful

Real loss landscapes aren’t as simple as a single bowl — they’re bumpy, with lots of small valleys scattered around. This maps directly onto the “maximum vs. local maximum” distinction from high school calculus. A local minimum is “lower than its immediate surroundings, but not necessarily the overall minimum”. In AI terminology, that’s called a local minimum — distinct from the global minimum.

Real loss landscapes are bumpy, with many local minima alongside the global minimum. Gradient descent can only follow the nearest valley, so it gets stuck in shallow local minima.

A purely full-batch gradient descent gets stuck in a shallow local minimum once it falls in. The gradient noise from SGD / mini-batch gives it a chance to “accidentally” bounce back out — a useful side effect. It’s considered one of the reasons modern deep learning works well in practice.

The idea of “random jitter letting you escape barriers” has a long history in physics and classical optimization — thermal excitation of atoms into higher-energy states, or simulated annealing (cooling metal slowly so it settles into a low-energy configuration) are on the same lineage. SGD’s randomness is a direct descendant of this family.

Backpropagation is “the chain rule at scale”

Computing $\nabla L$ requires $\partial L / \partial w$ for each parameter $w$ . Neural networks stack layers deeply, so from input to loss, it’s a deeply composite function.

Recall the forward diagram from article 4:

flowchart LR
    X[Input x] --> L1[Layer 1 W1, b1]
    L1 --> L2[Layer 2 W2, b2]
    L2 --> L3[Layer 3 W3, b3]
    L3 --> L[Loss L]

After the forward pass computes the loss, gradients flow back in reverse:

flowchart RL
    L[Loss L] -->|∂L/∂y3| L3[Layer 3 W3, b3]
    L3 -->|∂L/∂y2| L2[Layer 2 W2, b2]
    L2 -->|∂L/∂y1| L1[Layer 1 W1, b1]
    L1 -->|∂L/∂x| X[Input x]

This is backpropagation (backprop for short). All it does is apply the chain rule as many times as there are layers.

Why compute in reverse?

Expanding the gradient of $L$ with respect to weight $W_1$ in layer 1 via the chain rule:

\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y_3} \cdot \frac{\partial y_3}{\partial y_2} \cdot \frac{\partial y_2}{\partial y_1} \cdot \frac{\partial y_1}{\partial W_1}

Working right-to-left, the endpoint $\partial L / \partial y_3$ is the starting point. Multiplying local derivatives layer by layer as we walk back, we finally arrive at $\partial L / \partial W_1$ .

The values computed during the forward pass (each layer’s output) can be cached and reused in the backward pass. That’s why backprop is computationally efficient.

Computation graphs and local gradients

The foundation for implementing backprop is the computation graph. Each operation in a formula becomes a node, and the data flow becomes arrows between them.

Example: L = (wx − y)²

Let’s draw the computation graph for the loss of a single-sample linear regression. $w$ is the weight, $x$ is the input, $y$ is the target.

flowchart LR
    W[w] --> MUL[×]
    X[x] --> MUL
    MUL -->|u = wx| SUB[−]
    Y[y] --> SUB
    SUB -->|v = u-y| SQ[²]
    SQ -->|L = v²| OUT[L]

Each node is an operation; each arrow is a value. Each node can compute its local gradient — “how much its output changes when its input is nudged.”

Node	Operation	Local gradients
Multiply	$u = wx$	$\dfrac{\partial u}{\partial w} = x$ $\dfrac{\partial u}{\partial x} = w$
Subtract	$v = u - y$	$\dfrac{\partial v}{\partial u} = 1$ $\dfrac{\partial v}{\partial y} = -1$
Square	$L = v^2$	$\dfrac{\partial L}{\partial v} = 2v$

Multiplying these in reverse gives $w$ ‘s gradient on the loss:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial v} \cdot \frac{\partial v}{\partial u} \cdot \frac{\partial u}{\partial w} = 2v \cdot 1 \cdot x = 2(wx - y)x

Backprop, at its core, is just “walk the computation graph’s arrows in reverse, multiplying local gradients as you go”.

Gradients flow as Jacobians

When the inputs and outputs are vectors, the local gradient becomes “a matrix of partial derivatives per element” — a Jacobian matrix (briefly introduced by name and shape in article 4).

For a function $\vec{y} = f(\vec{x})$ with input $\vec{x}$ (length $n$ ) and output $\vec{y}$ (length $m$ ), the Jacobian is an $m \times n$ matrix whose $(i, j)$ entry is $\dfrac{\partial y_i}{\partial x_j}$ :

J = \begin{bmatrix} \dfrac{\partial y_1}{\partial x_1} & \cdots & \dfrac{\partial y_1}{\partial x_n} \\[6pt] \vdots & \ddots & \vdots \\[6pt] \dfrac{\partial y_m}{\partial x_1} & \cdots & \dfrac{\partial y_m}{\partial x_n} \end{bmatrix}

A complete table of “how much $y_i$ moves when $x_j$ is nudged a bit” for all $i, j$ .

In backprop, what was a scalar chain rule (multiplication of local gradients) becomes matrix multiplication for vector-to-vector cases. It’s the matrix-valued version of the chain rule. Chaining local Jacobians together as matrix products from back to front yields the gradient with respect to all parameters, as a vector.

In practice, computing the full $m \times n$ Jacobian explicitly is wasteful, so implementations use the Vector-Jacobian Product (VJP) — computing only “the product of the Jacobian with a vector”. This is a key piece of what autograd does under the hood. In LLMs, vectors within a layer can be several thousand dimensions and weight matrices can have millions of elements, but the principle remains the chaining of Jacobians.

autograd is “automation of backprop”

Implementing the computation graph and its local-gradient products by hand gets painful fast once layers stack up. Modern deep learning frameworks (PyTorch, JAX, TensorFlow, etc.) automate this via autograd (automatic differentiation).

A typical training step

In PyTorch:

# 1. Forward pass
output = model(x)
loss = criterion(output, y)

# 2. Gradient computation (backprop)
loss.backward()

# 3. Parameter update
optimizer.step()
optimizer.zero_grad()

When loss.backward() is called, PyTorch walks the computation graph built during the forward pass in reverse and computes gradients for each parameter automatically. The result gets written into each parameter tensor’s .grad attribute.

optimizer.step() reads those .grad values and performs the update $\theta \leftarrow \theta - \eta \nabla L$ (in whatever form the optimizer implements). optimizer.zero_grad() clears .grad before the next step, so gradients from different steps don’t accumulate.

What autograd saves you from

autograd is sometimes described as “magic”, but what it’s doing is exactly what we just covered: build the computation graph on the forward pass → traverse it in reverse, multiplying local gradients (Jacobians) to accumulate each parameter’s gradient. A mechanical procedure, nothing more. “Magic” is a stretch; a tool that automates what would otherwise be a soul-crushing amount of manual differentiation is closer to the truth.

The payoff from not having to write it yourself:

No need to derive $\partial L / \partial W$ by hand. Write the forward pass, and backprop comes along for free.
Works at scale — hundreds of millions to trillions of parameters, all covered by the three lines of code above.
New layers or loss functions work automatically, as long as they’re built from differentiable operations.

The fact that modern deep learning could scale this far is in no small part thanks to autograd.

”So I don’t need to write derivatives myself?”

Research papers still discuss derivatives in formulas, but implementations leave everything to the machine. As a user, you just need to understand the formula — the implementation only needs the forward pass. Keeping this gap in mind makes papers easier to read: when derivatives get discussed, you can mentally translate to “that’s autograd running in reverse under the hood”.

Optimizer lineage: SGD → Momentum → Adam / AdamW

So far we’ve written “once gradients are out, move with $\theta \leftarrow \theta - \eta \nabla L$ ”. The component that takes gradients and concretely updates parameters is called an Optimizer. Implementation-wise, in PyTorch it’s a swappable object like torch.optim.SGD(...) or torch.optim.AdamW(...), picked separately from the model, loss, learning rate, etc. The optimizer.step() call we saw earlier is the Optimizer’s method that applies the gradient-based update.

Plain gradient descent (SGD) has a lot of room for improvement, and over the years various Optimizers have been proposed. Here’s the lineage up to what LLM training uses today — Adam / AdamW.

SGD (plain gradient descent)

\theta \leftarrow \theta - \eta \, g

where $g = \nabla L$ . Each step, just look at the current gradient and move $\theta$ . Simple, but it tends to oscillate, and efficiency suffers on curved valleys.

SGD with Momentum

Carry over the “velocity” $v$ from the previous step — add inertia.

v \leftarrow \beta v + g, \quad \theta \leftarrow \theta - \eta \, v

$\beta$ is the factor deciding how much of the previous velocity to carry (called the momentum coefficient). Typical value: 0.9. Adding momentum from past gradients cancels out small oscillations and moves faster in the major valley direction. Picture a ball rolling downhill, picking up speed.

Adam (Adaptive Moment Estimation)

On top of momentum’s inertia (the first moment), also track the history of the gradient’s magnitude (the second moment).

m \leftarrow \beta_1 m + (1-\beta_1) g

v \leftarrow \beta_2 v + (1-\beta_2) g^2

$m$ is a smoothed estimate of the gradient’s direction; $v$ is an estimate of the gradient’s size. The update uses both:

\theta \leftarrow \theta - \eta \cdot \frac{m}{\sqrt{v} + \epsilon}

Where gradients are small, the denominator is also small, so the move is relatively large; where gradients are thrashing, the denominator is large, so the move is conservative. A mechanism that auto-adjusts the per-parameter step size. Adam’s name (Adaptive Moment Estimation) refers to this “adaptive to gradient history” property.

Standard values from the original paper: $\beta_1 = 0.9, \beta_2 = 0.999$ . LLMs sometimes use $\beta_2 = 0.95$ .

AdamW

A patched version of Adam that correctly integrates weight decay. weight decay is regularization to keep parameters from growing too large. Adam’s original formulation combined it somewhat incorrectly; AdamW separates and handles it properly.

LLM training is almost entirely AdamW. When you see paper appendices with settings like AdamW(betas=(0.9, 0.95), eps=1e-8, weight_decay=0.1), the $\beta_1, \beta_2$ are the momentum coefficients above, and weight_decay is the regularization strength.

Interested in Adam’s first/second moment internals? The MegaTrain article on training a 100B-parameter LLM on a single GPU goes deeper.

Vanishing and exploding gradients

When layers get very deep (tens to hundreds), the process of multiplying gradients via the chain rule introduces new problems.

The Transformer that shows up here is the architecture underlying modern LLMs — a deep neural network built by stacking blocks of Attention and feed-forward layers tens of times or more. Attention itself is briefly touched on in article 2 on vectors and matrices as the $QK^T$ shape; the rough image of “a device that weighs how much a word should attend to each other word via dot products, then takes a weighted sum” is enough. This article doesn’t dive deep into the Transformer or Attention internals — “a type of network that stacks many layers” is the level of detail needed here.

Vanishing gradient

When each layer’s local gradient is less than 1 (e.g., 0.1, 0.5), multiplying through $N$ layers gives:

0.1 \times 0.1 \times \dots \times 0.1 = 0.1^N \to 0

The gradient collapses toward 0 before reaching the earlier layers; parameters there basically don’t update. Those layers stop moving, and training stalls.

sigmoid and tanh have small maximum gradients (0.25 for sigmoid, 1 for tanh), which made vanishing common in deep stacks. The workaround that spread was ReLU (Rectified Linear Unit, $\max(0, x)$ ). In math, this is known as the ramp function — 0 for $x < 0$ , $x$ for $x \ge 0$ . Since ReLU’s gradient stays at 1 in the positive region (no vanishing), gradients can flow through deep networks more reliably.

Exploding gradient

Conversely, when local gradients exceed 1, the product grows exponentially:

2 \times 2 \times \dots \times 2 = 2^N \to \infty

Huge gradients cause $\theta$ to jump wildly; the loss turns into NaN and training breaks.

The remedy is gradient clipping: cap the magnitude of the gradient vector (its L2 norm — the same vector length $\|x\|$ covered in article 2) and scale it down when it exceeds the cap. Settings like grad_clip=1.0 refer to this.

Residual connections and LayerNorm are “devices that let gradients flow”

Vanishing gradients get addressed more fundamentally by designing the layers themselves to be gradient-friendly. Transformers have residual connections and LayerNorm baked in as standard.

Residual connections

Define a layer’s output $y$ as “the layer transformation $f(x)$ plus the input $x$ added back in”:

y = f(x) + x

With this structure, during backprop, even if the gradient collapses through $f(x)$ , the shortcut path via $+x$ lets the gradient through unaltered. Gradients reach earlier layers much more reliably even in deep networks.

Introduced by ResNet (2015) and spread across image recognition and Transformers generally. In Transformers, every Attention block and feed-forward block has a residual connection attached.

LayerNorm

Normalization that “rescales a layer’s output to mean 0, variance 1”. If the value scales within layers drift around, the gradient scales follow, so normalizing keeps things stable.

For a vector $z$ :

\text{LN}(z) = \gamma \cdot \frac{z - \mu}{\sigma} + \beta

where $\mu, \sigma$ are the mean and standard deviation within $z$ , and $\gamma, \beta$ are learnable parameters. Stable value scales lead to stable gradient scales, which leads to stable training overall.

When Transformer articles mention LN, “post-norm”, or “pre-norm”, this is what they’re talking about.

A two-piece device

With residual connections alone, the $+x$ can let value scales grow as you stack layers. LayerNorm re-aligns those scales, so the two together form a device that “lets gradients through while keeping values’ scales in check”.

That’s why Transformers can stack 100+ layers. Variations on this theme include MoonshotAI’s AttnRes.

Learning rate schedule: warmup and cosine decay

Learning rate $\eta$ is often left fixed, but varying it during training usually works better. In LLMs, the standard is warmup plus cosine decay.

Warmup: start small

At training onset, parameters are randomly initialized, sitting at some weird spot on the loss landscape. Jumping straight in with a high learning rate makes gradients thrash and diverge easily. The fix is warmup — ramp the learning rate from 0 up to the target over the first few percent of steps.

Example: warmup_steps=500 means the learning rate rises linearly from 0 to the target over the first 500 steps.

Cosine decay: get finer as you go

After warmup, decay the learning rate along a cosine curve. By the end of training, it drops to roughly 1% of the target.

\eta(t) = \eta_{\max} \cdot \frac{1 + \cos(\pi t / T)}{2}

where $t$ is the step after warmup completes, $T$ is the total number of steps. Bold at the start, fine-tuned at the end — this contour speeds up convergence.

Typical learning rate schedule. Rises from 0 during warmup, then decays along a cosine curve to near 0. Standard for LLM training.

”Why not just start at the target?”

Freshly initialized networks produce outputs unrelated to the training data, so their gradients have arbitrary direction and large magnitude. Running at the target learning rate fully reflects those garbage gradients into the parameters, causing divergence. Warmup gives the gradients time to settle before turning up the heat.

The big picture of an LLM training loop

Putting all the pieces together, the actual training loop takes shape.

The flow of one step

flowchart LR
    A[Fetch batch] --> B[Forward]
    B --> C[Compute loss]
    C --> D[Backprop]
    D --> E[Grad clip]
    E --> F[optimizer.step]
    F --> G[scheduler.step]
    G --> A

Fetch batch: get the next mini-batch
Forward: run input through the model to get output
Compute loss: cross-entropy or similar (article 3)
Backprop: loss.backward() computes gradients for all parameters
Grad clip: cap the gradient’s L2 norm
optimizer.step(): update with Adam / etc. ( $\theta \leftarrow \theta - \eta \nabla L$ )
scheduler.step(): adjust the next step’s learning rate via warmup/cosine

LLM pre-training is this loop repeated hundreds of millions to billions of times.

Epochs and steps

1 epoch: one pass through the training data
1 step: one mini-batch
1 run: the full training across the configured number of epochs

LLMs often stop before completing a single epoch (the data is vast enough that training has progressed sufficiently before a full pass).

The words “epoch” and “step” aren’t unique to LLMs — they appear across all generative-AI training tools: image-generation LoRA training, Stable Diffusion fine-tuning, voice synthesis augmentation, and more. What’s running under the hood is always the same gradient descent loop, so the units that appear on screen are also shared. If you’ve run image-gen LoRA training before, re-reading something like SeaArt LoRA practice will click: “the loss-curve reading in this article is literally the same thing”.

When does training end?

After reading about update rules and backprop, “okay, when does it actually stop?” is a fair question.

Gradient descent’s mechanical rule is “keep moving in the direction the loss decreases” — in theory, it runs until the loss reaches 0. In practice, the model’s representational capacity is limited, and training data has noise, so the loss plateaus at some non-zero value. Stopping the training is decided by a separate rule.

Step / epoch count: stop after a pre-decided number of steps or epochs. LLM pre-training almost always uses this.
Early stopping: halt when validation loss stops improving. Standard for fine-tuning; avoids overfitting.
Human judgment: watch the training log, stop manually when the loss curve flattens.

When a paper’s training details say total_steps=300000, that’s the cue for when training ends.

Distributed training is a separate topic

Real LLM training runs across many GPUs and nodes, but parallelization and communication are at a different layer. What happens within a single step is exactly what’s described here. Implementation-side discussions of the training loop are in other articles (e.g., Asynchronous RL training architectures).

So where does “the correct answer” come from?

After walking through losses, gradients, and optimizers mechanically, stepping back, “who actually supplies the ground truth?” becomes a natural question. If loss is “how wrong the model is compared to the correct answer”, no correct answer means no loss.

The answer depends on the kind of learning. Machine learning splits broadly into four categories based on how correct answers are handled:

Learning type	Source of the correct answer	Typical example
Supervised learning	Human-labeled data	Image classification, spam detection, scripted dialog
Unsupervised learning	No answers. Find structure in the data itself	Clustering, dimensionality reduction, anomaly detection
Self-supervised learning	Generate answers mechanically from the data itself	LLM pre-training, BERT masked language modeling
Reinforcement learning	Reward signals (scores) from the environment serve as a loose “correct”	Game AI, robot control, RLHF

How each one works

Supervised learning uses explicit answers, like images tagged “cat” and “dog” by humans. The clearest form.
Unsupervised learning doesn’t use answers — its focus is clustering similar data, compressing high-dimensional data to low dimensions, etc. Loss is defined as “how well does the data’s own properties survive” (reconstruction error, distance preservation, etc.).
Self-supervised learning mechanically generates answers from the data itself. LLMs use “the next token” pulled from the text as the answer.
Reinforcement learning evaluates the quality of actions after the fact. Rather than a single correct answer, the model moves toward maximizing cumulative reward.

Deep learning is usable across all four

For clarity: deep learning is the name for learning methods that use deep neural networks, and it’s orthogonal to the four categories above. Any of those learning types — supervised / unsupervised / self-supervised / reinforcement — becomes “deep learning” when a deep network is used internally. In all cases, the gradient descent and backprop story in this article applies directly.

What’s special about LLMs is the setup: “there’s a vast amount of self-supervised training data” (= web text). From a deep-learning standpoint, LLMs are “one of the four categories, scaled up”.

Why LLM pre-training is special

Specifically, LLM pre-training’s special feature is that the correct answers are already embedded in the training data. Take vast amounts of text from the web and books, and quiz the model with “what’s the next token given this context?”. The answer comes out automatically by shifting the text by one position — no human labeling required. This is called self-supervised learning, and it’s one of the big reasons LLMs can train on trillions of tokens.

When fine-tuning or RLHF involves humans, it’s the post-pipeline of nudging the rough linguistic knowledge from pre-training into a form humans want.

Hallucinations and self-supervised learning

By the way, “self-supervised” is, literally, a structure where the subject teaches itself — meaning the data’s contents are inherited directly as the “truth”. Mistakes, fiction, outdated information, and biases mixed into web or book text all get learned. During pre-training, there’s no external party saying “that’s wrong”.

The roots of LLM hallucinations (confident plausible-sounding lies) lie in this “the teacher is the data itself” situation. Fine-tuning and RLHF can be read as “overlaying human evaluation on the ‘data-as-truth’ state, to at least somewhat distinguish fact from fiction”.

If you’re curious what actual fine-tuning work looks like, this blog has a few hands-on records.

LLM side: Fine-tuning LUKE/BERT on a Japanese corpus for OCR correction, Making an LLM LoRA on Mac mini M4
Image generation: LoRA creation with SeaArt, 13 failures on Mac M1 Max, success on RunPod, LoRA training on RTX 3060 (6GB)

The targets and methods vary, but since the underlying loop is always gradient descent, this article’s content applies directly.

Basic questions you might have while reading

Since this article focuses on the “mechanics” side of training, some foundational questions tend to slip by. Brief follow-ups.

What’s a neural network, again?

A neural network is, roughly, “a function that takes numbers, runs them through matrix multiplications and simple transforms (activation functions) several times, and spits out the target numbers”.

flowchart LR
    X["Input x (a row of numbers)"] --> L1["Layer 1: Wx+b → activation"]
    L1 --> L2[Layer 2]
    L2 --> L3[Layer 3]
    L3 --> Y["Output (probability distribution, numbers)"]

Each layer has a weight matrix $W$ and a bias $b$ ; the layer computes $Wx + b$ and passes the result through a nonlinearity (like ReLU). Stacking many layers produces a function capable of expressing complex relationships.

The parameters written as $\theta$ all through this article are the collective $W$ and $b$ across all layers — the thing training adjusts.

So is deep learning just stacking this up?

Pretty much. Deep learning refers to neural networks with a deep (= large number of) stack, distinguished from shallow nets with 2 or 3 layers. A modern LLM like a Transformer is a typical example, stacking tens to hundreds of layers.

More layers means more expressive power — it can learn mappings that simpler nets couldn’t (image → object name, text → meaning, audio → text, etc.). But just going deeper runs into vanishing/exploding gradients, which is why residual connections and LayerNorm come paired with depth — the “tricks that let gradients flow through deep nets” covered earlier.

Is this actually a brain imitation, like people say?

The name “neural network” and the layered-connected-units look really do trace back to biological neuron-inspired designs. McCulloch-Pitts (1940s) and Rosenblatt’s perceptron started here, influenced by “the brain processes information through neural firing”.

Today’s deep learning, however, has inherited the name and visual but drifted far from biological brains:

Brain neurons encode information in firing timing (spikes); neural networks pass continuous-valued vectors
Brain learning is localized, asynchronous chemical reactions; deep learning takes the whole-system derivative and updates with gradients globally in lockstep
There’s no evidence the brain does backprop

Rather than “imitating the brain’s methods”, modern deep learning has become “a functionally similar-output separate thing”, started from a brain metaphor but separated long since.

Why does just layering matrix multiplications work?

A single layer can only do linear transformations (straight-line mappings). But interleaving nonlinear functions (like ReLU) while stacking layers gives you the power to approximate virtually any continuous function — this is known as the universal approximation theorem. “Tell cats from dogs”, “return probabilities for the next token”, “generate images” — if an input-to-output relationship is expressible as a function, a neural network can approximate it.

Training is the act of tuning that approximation to fit the data, so a trained network is “a huge function that’s memorized the input-output mapping the data revealed”.

Is training sequential? All at once? How do parameter updates happen?

Short answer: every mini-batch, all parameters move at once, by a small amount. Repeated forever.

A single mini-batch’s forward-backward pass computes $\nabla L$ for all parameters $\theta$ (hundreds of millions to trillions of them) in one go
optimizer.step() applies $\theta \leftarrow \theta - \eta \nabla L$ to every parameter simultaneously
Because $\eta$ is small, the amount each step moves is tiny
Repeat for millions to billions of steps

So it’s both “all at once” (every parameter moves in sync) and “sequential” (small amounts, many times). Picture every screw in a large building being turned a sliver, every screw at the same time, every step.

What you can read now

With what’s here, settings seen in LLM papers, model cards, and training logs become “numbers that mean something”.

Common setting	Reading
`lr = 1e-4`	Learning rate target. Standard for Adam-family optimizers
`AdamW(betas=(0.9, 0.95))`	First-moment momentum 0.9, second-moment momentum 0.95
`weight_decay = 0.1`	Regularization to keep parameters from growing too large
`warmup_steps = 500`	Ramp learning rate from 0 to target over the first 500 steps
`cosine schedule`	After warmup, decay the learning rate along a cosine curve
`grad_clip = 1.0`	Scale down when gradient L2 norm exceeds 1
`batch_size = 4M tokens`	One mini-batch is 4 million tokens
`loss curve`	Per-step loss trajectory
`gradient norm`	Gradient L2 norm, a signal for detecting explosion

The training-details sections of papers and the W&B / TensorBoard training-log UIs should mostly make sense now.

Things you can skip

At the entry level, these are fine to ignore:

Term	Summary	Why it’s safe to skip
Second-order optimization (Newton’s method, L-BFGS)	Methods that also use the gradient’s derivative (Hessian)	Too expensive for LLMs
Natural Gradient	Gradient correction using parameter-space geometry	Heavy to implement; not standard practice
Regularization (L1/L2, Dropout)	Extra terms / procedures that curb overfitting	Important but a separate topic
Gradient Accumulation	Accumulate gradients across steps before updating	Trick for effectively expanding mini-batch size
Mixed Precision (fp16/bf16)	Lower precision for speed	Implementation / hardware topic
Distributed training (DDP, FSDP, ZeRO)	Parallelize parameters across many GPUs	Upper-layer training-loop optimization
Policy gradients (RL)	RL-specific gradient computation	Shows up in RLHF, but separate topic

The bare minimum to pin down: the update rule $\theta \leftarrow \theta - \eta \nabla L$ , the chain-rule expansion of backprop, Adam’s first/second moments, and the residual-connection + LayerNorm mechanism that lets gradients flow. Four things.

The small set of math that makes AI articles readable Article 1 of the series. Weighted sums, softmax, the outline of training.
Vectors and matrices, just enough to read AI articles Article 2. Foundation for gradient vectors and Jacobians.
Probability and statistics, just enough to read AI articles Article 3. Cross-entropy (the guts of the loss function).
Derivatives, just enough to read AI articles Article 4. Chain rule, gradients, Jacobians.
MoonshotAI (Kimi) proposes AttnRes — replacing residual connections with attention in Transformers A new layer structure applying the residual-connection idea.
MegaTrain trains a 100B-parameter LLM in full precision on a single GPU Deeper into Adam’s first/second moments implementation.

Glossary (feel free to skip)

Term	Meaning
Gradient descent	The generic name for updates lowering the loss via $\theta \leftarrow \theta - \eta \nabla L$
Learning rate $\eta$	The step size per update
SGD	Stochastic gradient descent; the mini-batch approach
Backpropagation (backprop)	The procedure of applying the chain rule at scale to compute gradients
Computation graph	Diagram of forward-pass operations as connected nodes — the skeleton of backprop
autograd	Framework automatic differentiation (PyTorch et al.)
Optimizer	Component that consumes gradients and updates $\theta$ (SGD, Adam, AdamW)
Momentum	Improvement adding previous-gradient inertia
Adam	Optimizer using first (direction) and second (scale) moments
AdamW	Correct combination of Adam + weight decay
Vanishing gradient	Phenomenon of gradients collapsing to 0 through the chain rule
Exploding gradient	Phenomenon of gradients diverging to $\infty$ through the chain rule
Gradient clipping	Technique capping gradient magnitude
Residual connection	Structure $y = f(x) + x$ ; gradients flow through the shortcut
LayerNorm	Normalization bringing each layer’s values to mean 0, variance 1
Warmup	Period where the learning rate ramps from 0 to target
Cosine decay	Post-warmup schedule that decays the learning rate along a cosine curve

This closes the series, at least for now. Across articles 1-5, the “math-symbol-ish” parts of AI articles should be broadly readable.