Tech 28 min read

Gradient descent and backprop, just enough to read AI articles

IkesanContents

In the previous derivatives article, we got up to the point where the gradient L\nabla L of the loss LL with respect to the parameters points “in the direction LL increases most steeply”. Which means moving in the opposite direction L-\nabla L makes LL decrease most steeply. AI training, stripped down, is just the act of repeating that move over and over.

This article walks through how that repetition is actually structured, keeping the same “read, don’t solve” stance as the previous four articles in the series.

If you can read gradient descent, SGD, Adam, backpropagation, vanishing gradients, residual connections, and learning rate schedules for what they’re doing at a high level, training logs, model cards, and the training-details section of papers become mostly readable.

The chain rule, gradients, and Jacobians from article 4 are assumed knowledge. If any notation slips your mind, jump back to article 4.

Learning is just “moving parameters to lower the loss”

A quick refresh of the big picture first.

Training a neural network is an optimization problem: minimize the loss function L(θ)L(\theta) over the parameters θ\theta. The tools from previous articles:

  • The loss L(θ)L(\theta) (like the cross-entropy from article 3) returns a scalar
  • The parameters θ\theta form a vector (millions to trillions of dimensions)
  • The gradient L(θ)\nabla L(\theta) is a vector of the same dimensions as θ\theta, pointing “in the direction LL increases most steeply”

From these three, “nudge θ\theta slightly opposite to L\nabla L and LL goes down” follows directly. The rest of training is stacking up that “slight nudge” tens of millions to billions of times.

The update rule θ ← θ − η∇L

The most basic “way of moving” is this formula.

θθηL(θ)\theta \leftarrow \theta - \eta \, \nabla L(\theta)

Breaking down the symbols:

  • θ\theta (theta): notation bundling up all the model’s parameters. Picture billions to trillions of weights packed into a single vector.
  • L(θ)L(\theta): the loss at those parameters (a scalar).
  • L(θ)\nabla L(\theta): the gradient vector of the loss (same shape as θ\theta).
  • η\eta (eta): the learning rate. A small positive number that sets how large a step to take. The actual step size per update is (η × the size of the gradient), so larger η moves a lot per step, smaller η crawls. Typical values are 10410^{-4} to 10310^{-3}.
  • \leftarrow is assignment (replace θ\theta with the right-hand side).

The whole thing reads as “move θ\theta in the opposite direction of the gradient by η\eta times as much.” This is called gradient descent.

Why subtract?

Some people trip over the sign. Why ηL-\eta\nabla L and not +ηL+\eta\nabla L?

L\nabla L points “in the direction of steepest increase of LL”. Adding it would move θ\theta toward larger loss. We want smaller loss, so we move in the opposite direction — that’s why we subtract ηL\eta\nabla L from θ\theta.

A minimal example

Actual LLMs have hundreds of millions of parameters, but to follow the mechanism, a single-variable version is enough. Treat L(w)=w2L(w) = w^2 — a bowl-shaped function — as the loss of a model with exactly one parameter ww. Minimum at w=0w = 0 (loss is 0), growing as ww moves away — the simplest possible landscape.

Start at w=4w = 4, set the learning rate to η=0.3\eta = 0.3, and run the update wwηL(w)w \leftarrow w - \eta \cdot L'(w) for a few steps. Differentiating L(w)=w2L(w) = w^2 with respect to ww gives L(w)=2wL'(w) = 2w, so the update here becomes

ww0.32w=0.4ww \leftarrow w - 0.3 \cdot 2w = 0.4w

i.e., ww shrinks to 0.4 times its current value every step.

StepwwL=w2L = w^2
04.0016.00
11.602.56
20.640.41
30.260.07
40.100.01

ww gradually approaches 0, and the loss falls with it. This is the minimal unit of gradient descent.

1D gradient descent Steps descending from w=4 toward w=0 on the parabola L(w)=w² w L step 0 (w=4) step 1 (w=1.6) step 2, 3, ... → 0 min w=0
Walking down L(w)=w² from w=4 with η=0.3. The first step jumps a lot, then subsequent steps approach the minimum with diminishing size.

Walking across contour lines in the opposite direction

Recall the contour-line picture of the “bowl” from article 4: at any point, the gradient points perpendicular to the contour, toward higher values. Gradient descent traverses that in reverse — stepping across contours toward the lower side, one contour at a time. If the loss landscape is bowl-shaped, we’re walking toward the bottom.

Learning rate η behavior

The learning rate η\eta sets how far to step, and it breaks things in both directions if it’s too big or too small.

  • Too small → barely moves per step, takes forever to reach the minimum. Direction is still correct.
  • Too large → overshoots the minimum and lands on the other side. Oscillates or diverges.
  • Just right → steadily descends each step. Fast convergence.

Concretely, starting at w=4w = 4 on L(w)=w2L(w) = w^2 with various η\eta, after 4 steps:

η\etastep 0step 1step 2step 3step 4
0.05 (small)4.003.603.242.922.62
0.3 (medium)4.001.600.640.260.10
0.9 (large)4.00−3.202.56−2.051.64
1.1 (diverging)4.00−4.805.76−6.918.29

At η=0.9\eta = 0.9 it bounces back and forth but shrinks; at η=1.1\eta = 1.1 it explodes.

In LLM training, the learning rate is typically in the 10310^{-3} to 10510^{-5} range. For the Optimizer called Adam (which shows up later), around 10410^{-4} (lr=1e-4) is the standard (Adam’s internals are covered in a later section).

”Shouldn’t I just crank it up?”

Tempting, but pushing η\eta higher eventually enters the diverging zone and breaks training. “As large as possible without diverging” is the basic tuning principle, and the learning rate schedules introduced later (warmup, cosine decay) come from the same “push to the edge of divergence, cleverly” philosophy.

Full-batch, SGD, mini-batch

LL is usually defined as “the average over all training data”, but computing L\nabla L over all the data every step is expensive. So there’s a choice about how much data to use per gradient computation.

  • Full-batch gradient descent: one gradient over the whole dataset. The gradient is “true”, but each step is slow.
  • Stochastic gradient descent (SGD): one sample at a time. Fast, but noisy.
  • Mini-batch gradient descent: tens to thousands of samples at a time. Balanced compromise.

Real LLM training is almost entirely mini-batch. “batch_size=4M tokens” kinds of notations mean a single mini-batch is 4 million tokens.

What “Stochastic” means

The S in SGD is for Stochastic — “involving randomness”. Because which samples to use each step is picked randomly, the gradient wobbles each time. That randomness is where the name comes from.

Randomness is useful

Real loss landscapes aren’t as simple as a single bowl — they’re bumpy, with lots of small valleys scattered around. This maps directly onto the “maximum vs. local maximum” distinction from high school calculus. A local minimum is “lower than its immediate surroundings, but not necessarily the overall minimum”. In AI terminology, that’s called a local minimum — distinct from the global minimum.

Bumpy loss landscape A loss function with multiple local minima and one global minimum θ L local min local min global min
Real loss landscapes are bumpy, with many local minima alongside the global minimum. Gradient descent can only follow the nearest valley, so it gets stuck in shallow local minima.

A purely full-batch gradient descent gets stuck in a shallow local minimum once it falls in. The gradient noise from SGD / mini-batch gives it a chance to “accidentally” bounce back out — a useful side effect. It’s considered one of the reasons modern deep learning works well in practice.

The idea of “random jitter letting you escape barriers” has a long history in physics and classical optimization — thermal excitation of atoms into higher-energy states, or simulated annealing (cooling metal slowly so it settles into a low-energy configuration) are on the same lineage. SGD’s randomness is a direct descendant of this family.

Backpropagation is “the chain rule at scale”

Computing L\nabla L requires L/w\partial L / \partial w for each parameter ww. Neural networks stack layers deeply, so from input to loss, it’s a deeply composite function.

Recall the forward diagram from article 4:

flowchart LR
    X[Input x] --> L1[Layer 1 W1, b1]
    L1 --> L2[Layer 2 W2, b2]
    L2 --> L3[Layer 3 W3, b3]
    L3 --> L[Loss L]

After the forward pass computes the loss, gradients flow back in reverse:

flowchart RL
    L[Loss L] -->|∂L/∂y3| L3[Layer 3 W3, b3]
    L3 -->|∂L/∂y2| L2[Layer 2 W2, b2]
    L2 -->|∂L/∂y1| L1[Layer 1 W1, b1]
    L1 -->|∂L/∂x| X[Input x]

This is backpropagation (backprop for short). All it does is apply the chain rule as many times as there are layers.

Why compute in reverse?

Expanding the gradient of LL with respect to weight W1W_1 in layer 1 via the chain rule:

LW1=Ly3y3y2y2y1y1W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y_3} \cdot \frac{\partial y_3}{\partial y_2} \cdot \frac{\partial y_2}{\partial y_1} \cdot \frac{\partial y_1}{\partial W_1}

Working right-to-left, the endpoint L/y3\partial L / \partial y_3 is the starting point. Multiplying local derivatives layer by layer as we walk back, we finally arrive at L/W1\partial L / \partial W_1.

The values computed during the forward pass (each layer’s output) can be cached and reused in the backward pass. That’s why backprop is computationally efficient.

Computation graphs and local gradients

The foundation for implementing backprop is the computation graph. Each operation in a formula becomes a node, and the data flow becomes arrows between them.

Example: L = (wx − y)²

Let’s draw the computation graph for the loss of a single-sample linear regression. ww is the weight, xx is the input, yy is the target.

flowchart LR
    W[w] --> MUL[×]
    X[x] --> MUL
    MUL -->|u = wx| SUB[−]
    Y[y] --> SUB
    SUB -->|v = u-y| SQ[²]
    SQ -->|L = v²| OUT[L]

Each node is an operation; each arrow is a value. Each node can compute its local gradient — “how much its output changes when its input is nudged.”

NodeOperationLocal gradients
Multiplyu=wxu = wxuw=x\dfrac{\partial u}{\partial w} = x
ux=w\dfrac{\partial u}{\partial x} = w
Subtractv=uyv = u - yvu=1\dfrac{\partial v}{\partial u} = 1
vy=1\dfrac{\partial v}{\partial y} = -1
SquareL=v2L = v^2Lv=2v\dfrac{\partial L}{\partial v} = 2v

Multiplying these in reverse gives ww‘s gradient on the loss:

Lw=Lvvuuw=2v1x=2(wxy)x\frac{\partial L}{\partial w} = \frac{\partial L}{\partial v} \cdot \frac{\partial v}{\partial u} \cdot \frac{\partial u}{\partial w} = 2v \cdot 1 \cdot x = 2(wx - y)x

Backprop, at its core, is just “walk the computation graph’s arrows in reverse, multiplying local gradients as you go”.

Gradients flow as Jacobians

When the inputs and outputs are vectors, the local gradient becomes “a matrix of partial derivatives per element” — a Jacobian matrix (briefly introduced by name and shape in article 4).

For a function y=f(x)\vec{y} = f(\vec{x}) with input x\vec{x} (length nn) and output y\vec{y} (length mm), the Jacobian is an m×nm \times n matrix whose (i,j)(i, j) entry is yixj\dfrac{\partial y_i}{\partial x_j}:

J=[y1x1y1xnymx1ymxn]J = \begin{bmatrix} \dfrac{\partial y_1}{\partial x_1} & \cdots & \dfrac{\partial y_1}{\partial x_n} \\[6pt] \vdots & \ddots & \vdots \\[6pt] \dfrac{\partial y_m}{\partial x_1} & \cdots & \dfrac{\partial y_m}{\partial x_n} \end{bmatrix}

A complete table of “how much yiy_i moves when xjx_j is nudged a bit” for all i,ji, j.

In backprop, what was a scalar chain rule (multiplication of local gradients) becomes matrix multiplication for vector-to-vector cases. It’s the matrix-valued version of the chain rule. Chaining local Jacobians together as matrix products from back to front yields the gradient with respect to all parameters, as a vector.

In practice, computing the full m×nm \times n Jacobian explicitly is wasteful, so implementations use the Vector-Jacobian Product (VJP) — computing only “the product of the Jacobian with a vector”. This is a key piece of what autograd does under the hood. In LLMs, vectors within a layer can be several thousand dimensions and weight matrices can have millions of elements, but the principle remains the chaining of Jacobians.

autograd is “automation of backprop”

Implementing the computation graph and its local-gradient products by hand gets painful fast once layers stack up. Modern deep learning frameworks (PyTorch, JAX, TensorFlow, etc.) automate this via autograd (automatic differentiation).

A typical training step

In PyTorch:

# 1. Forward pass
output = model(x)
loss = criterion(output, y)

# 2. Gradient computation (backprop)
loss.backward()

# 3. Parameter update
optimizer.step()
optimizer.zero_grad()

When loss.backward() is called, PyTorch walks the computation graph built during the forward pass in reverse and computes gradients for each parameter automatically. The result gets written into each parameter tensor’s .grad attribute.

optimizer.step() reads those .grad values and performs the update θθηL\theta \leftarrow \theta - \eta \nabla L (in whatever form the optimizer implements). optimizer.zero_grad() clears .grad before the next step, so gradients from different steps don’t accumulate.

What autograd saves you from

autograd is sometimes described as “magic”, but what it’s doing is exactly what we just covered: build the computation graph on the forward pass → traverse it in reverse, multiplying local gradients (Jacobians) to accumulate each parameter’s gradient. A mechanical procedure, nothing more. “Magic” is a stretch; a tool that automates what would otherwise be a soul-crushing amount of manual differentiation is closer to the truth.

The payoff from not having to write it yourself:

  • No need to derive L/W\partial L / \partial W by hand. Write the forward pass, and backprop comes along for free.
  • Works at scale — hundreds of millions to trillions of parameters, all covered by the three lines of code above.
  • New layers or loss functions work automatically, as long as they’re built from differentiable operations.

The fact that modern deep learning could scale this far is in no small part thanks to autograd.

”So I don’t need to write derivatives myself?”

Research papers still discuss derivatives in formulas, but implementations leave everything to the machine. As a user, you just need to understand the formula — the implementation only needs the forward pass. Keeping this gap in mind makes papers easier to read: when derivatives get discussed, you can mentally translate to “that’s autograd running in reverse under the hood”.

Optimizer lineage: SGD → Momentum → Adam / AdamW

So far we’ve written “once gradients are out, move with θθηL\theta \leftarrow \theta - \eta \nabla L”. The component that takes gradients and concretely updates parameters is called an Optimizer. Implementation-wise, in PyTorch it’s a swappable object like torch.optim.SGD(...) or torch.optim.AdamW(...), picked separately from the model, loss, learning rate, etc. The optimizer.step() call we saw earlier is the Optimizer’s method that applies the gradient-based update.

Plain gradient descent (SGD) has a lot of room for improvement, and over the years various Optimizers have been proposed. Here’s the lineage up to what LLM training uses today — Adam / AdamW.

SGD (plain gradient descent)

θθηg\theta \leftarrow \theta - \eta \, g

where g=Lg = \nabla L. Each step, just look at the current gradient and move θ\theta. Simple, but it tends to oscillate, and efficiency suffers on curved valleys.

SGD with Momentum

Carry over the “velocity” vv from the previous step — add inertia.

vβv+g,θθηvv \leftarrow \beta v + g, \quad \theta \leftarrow \theta - \eta \, v

β\beta is the factor deciding how much of the previous velocity to carry (called the momentum coefficient). Typical value: 0.9. Adding momentum from past gradients cancels out small oscillations and moves faster in the major valley direction. Picture a ball rolling downhill, picking up speed.

Adam (Adaptive Moment Estimation)

On top of momentum’s inertia (the first moment), also track the history of the gradient’s magnitude (the second moment).

mβ1m+(1β1)gm \leftarrow \beta_1 m + (1-\beta_1) g vβ2v+(1β2)g2v \leftarrow \beta_2 v + (1-\beta_2) g^2

mm is a smoothed estimate of the gradient’s direction; vv is an estimate of the gradient’s size. The update uses both:

θθηmv+ϵ\theta \leftarrow \theta - \eta \cdot \frac{m}{\sqrt{v} + \epsilon}

Where gradients are small, the denominator is also small, so the move is relatively large; where gradients are thrashing, the denominator is large, so the move is conservative. A mechanism that auto-adjusts the per-parameter step size. Adam’s name (Adaptive Moment Estimation) refers to this “adaptive to gradient history” property.

Standard values from the original paper: β1=0.9,β2=0.999\beta_1 = 0.9, \beta_2 = 0.999. LLMs sometimes use β2=0.95\beta_2 = 0.95.

AdamW

A patched version of Adam that correctly integrates weight decay. weight decay is regularization to keep parameters from growing too large. Adam’s original formulation combined it somewhat incorrectly; AdamW separates and handles it properly.

LLM training is almost entirely AdamW. When you see paper appendices with settings like AdamW(betas=(0.9, 0.95), eps=1e-8, weight_decay=0.1), the β1,β2\beta_1, \beta_2 are the momentum coefficients above, and weight_decay is the regularization strength.

Interested in Adam’s first/second moment internals? The MegaTrain article on training a 100B-parameter LLM on a single GPU goes deeper.

Vanishing and exploding gradients

When layers get very deep (tens to hundreds), the process of multiplying gradients via the chain rule introduces new problems.

The Transformer that shows up here is the architecture underlying modern LLMs — a deep neural network built by stacking blocks of Attention and feed-forward layers tens of times or more. Attention itself is briefly touched on in article 2 on vectors and matrices as the QKTQK^T shape; the rough image of “a device that weighs how much a word should attend to each other word via dot products, then takes a weighted sum” is enough. This article doesn’t dive deep into the Transformer or Attention internals — “a type of network that stacks many layers” is the level of detail needed here.

Vanishing gradient

When each layer’s local gradient is less than 1 (e.g., 0.1, 0.5), multiplying through NN layers gives:

0.1×0.1××0.1=0.1N00.1 \times 0.1 \times \dots \times 0.1 = 0.1^N \to 0

The gradient collapses toward 0 before reaching the earlier layers; parameters there basically don’t update. Those layers stop moving, and training stalls.

sigmoid and tanh have small maximum gradients (0.25 for sigmoid, 1 for tanh), which made vanishing common in deep stacks. The workaround that spread was ReLU (Rectified Linear Unit, max(0,x)\max(0, x)). In math, this is known as the ramp function — 0 for x<0x < 0, xx for x0x \ge 0. Since ReLU’s gradient stays at 1 in the positive region (no vanishing), gradients can flow through deep networks more reliably.

Exploding gradient

Conversely, when local gradients exceed 1, the product grows exponentially:

2×2××2=2N2 \times 2 \times \dots \times 2 = 2^N \to \infty

Huge gradients cause θ\theta to jump wildly; the loss turns into NaN and training breaks.

The remedy is gradient clipping: cap the magnitude of the gradient vector (its L2 norm — the same vector length x\|x\| covered in article 2) and scale it down when it exceeds the cap. Settings like grad_clip=1.0 refer to this.

Residual connections and LayerNorm are “devices that let gradients flow”

Vanishing gradients get addressed more fundamentally by designing the layers themselves to be gradient-friendly. Transformers have residual connections and LayerNorm baked in as standard.

Residual connections

Define a layer’s output yy as “the layer transformation f(x)f(x) plus the input xx added back in”:

y=f(x)+xy = f(x) + x

With this structure, during backprop, even if the gradient collapses through f(x)f(x), the shortcut path via +x+x lets the gradient through unaltered. Gradients reach earlier layers much more reliably even in deep networks.

Introduced by ResNet (2015) and spread across image recognition and Transformers generally. In Transformers, every Attention block and feed-forward block has a residual connection attached.

LayerNorm

Normalization that “rescales a layer’s output to mean 0, variance 1”. If the value scales within layers drift around, the gradient scales follow, so normalizing keeps things stable.

For a vector zz:

LN(z)=γzμσ+β\text{LN}(z) = \gamma \cdot \frac{z - \mu}{\sigma} + \beta

where μ,σ\mu, \sigma are the mean and standard deviation within zz, and γ,β\gamma, \beta are learnable parameters. Stable value scales lead to stable gradient scales, which leads to stable training overall.

When Transformer articles mention LN, “post-norm”, or “pre-norm”, this is what they’re talking about.

A two-piece device

With residual connections alone, the +x+x can let value scales grow as you stack layers. LayerNorm re-aligns those scales, so the two together form a device that “lets gradients through while keeping values’ scales in check”.

That’s why Transformers can stack 100+ layers. Variations on this theme include MoonshotAI’s AttnRes.

Learning rate schedule: warmup and cosine decay

Learning rate η\eta is often left fixed, but varying it during training usually works better. In LLMs, the standard is warmup plus cosine decay.

Warmup: start small

At training onset, parameters are randomly initialized, sitting at some weird spot on the loss landscape. Jumping straight in with a high learning rate makes gradients thrash and diverge easily. The fix is warmup — ramp the learning rate from 0 up to the target over the first few percent of steps.

Example: warmup_steps=500 means the learning rate rises linearly from 0 to the target over the first 500 steps.

Cosine decay: get finer as you go

After warmup, decay the learning rate along a cosine curve. By the end of training, it drops to roughly 1% of the target.

η(t)=ηmax1+cos(πt/T)2\eta(t) = \eta_{\max} \cdot \frac{1 + \cos(\pi t / T)}{2}

where tt is the step after warmup completes, TT is the total number of steps. Bold at the start, fine-tuned at the end — this contour speeds up convergence.

Learning rate schedule Learning rate rising from 0 during warmup, then decaying along a cosine curve step lr max warmup ends warmup cosine decay
Typical learning rate schedule. Rises from 0 during warmup, then decays along a cosine curve to near 0. Standard for LLM training.

”Why not just start at the target?”

Freshly initialized networks produce outputs unrelated to the training data, so their gradients have arbitrary direction and large magnitude. Running at the target learning rate fully reflects those garbage gradients into the parameters, causing divergence. Warmup gives the gradients time to settle before turning up the heat.

The big picture of an LLM training loop

Putting all the pieces together, the actual training loop takes shape.

The flow of one step

flowchart LR
    A[Fetch batch] --> B[Forward]
    B --> C[Compute loss]
    C --> D[Backprop]
    D --> E[Grad clip]
    E --> F[optimizer.step]
    F --> G[scheduler.step]
    G --> A
  1. Fetch batch: get the next mini-batch
  2. Forward: run input through the model to get output
  3. Compute loss: cross-entropy or similar (article 3)
  4. Backprop: loss.backward() computes gradients for all parameters
  5. Grad clip: cap the gradient’s L2 norm
  6. optimizer.step(): update with Adam / etc. (θθηL\theta \leftarrow \theta - \eta \nabla L)
  7. scheduler.step(): adjust the next step’s learning rate via warmup/cosine

LLM pre-training is this loop repeated hundreds of millions to billions of times.

Epochs and steps

  • 1 epoch: one pass through the training data
  • 1 step: one mini-batch
  • 1 run: the full training across the configured number of epochs

LLMs often stop before completing a single epoch (the data is vast enough that training has progressed sufficiently before a full pass).

The words “epoch” and “step” aren’t unique to LLMs — they appear across all generative-AI training tools: image-generation LoRA training, Stable Diffusion fine-tuning, voice synthesis augmentation, and more. What’s running under the hood is always the same gradient descent loop, so the units that appear on screen are also shared. If you’ve run image-gen LoRA training before, re-reading something like SeaArt LoRA practice will click: “the loss-curve reading in this article is literally the same thing”.

When does training end?

After reading about update rules and backprop, “okay, when does it actually stop?” is a fair question.

Gradient descent’s mechanical rule is “keep moving in the direction the loss decreases” — in theory, it runs until the loss reaches 0. In practice, the model’s representational capacity is limited, and training data has noise, so the loss plateaus at some non-zero value. Stopping the training is decided by a separate rule.

  • Step / epoch count: stop after a pre-decided number of steps or epochs. LLM pre-training almost always uses this.
  • Early stopping: halt when validation loss stops improving. Standard for fine-tuning; avoids overfitting.
  • Human judgment: watch the training log, stop manually when the loss curve flattens.

When a paper’s training details say total_steps=300000, that’s the cue for when training ends.

Distributed training is a separate topic

Real LLM training runs across many GPUs and nodes, but parallelization and communication are at a different layer. What happens within a single step is exactly what’s described here. Implementation-side discussions of the training loop are in other articles (e.g., Asynchronous RL training architectures).

So where does “the correct answer” come from?

After walking through losses, gradients, and optimizers mechanically, stepping back, “who actually supplies the ground truth?” becomes a natural question. If loss is “how wrong the model is compared to the correct answer”, no correct answer means no loss.

The answer depends on the kind of learning. Machine learning splits broadly into four categories based on how correct answers are handled:

Learning typeSource of the correct answerTypical example
Supervised learningHuman-labeled dataImage classification, spam detection, scripted dialog
Unsupervised learningNo answers. Find structure in the data itselfClustering, dimensionality reduction, anomaly detection
Self-supervised learningGenerate answers mechanically from the data itselfLLM pre-training, BERT masked language modeling
Reinforcement learningReward signals (scores) from the environment serve as a loose “correct”Game AI, robot control, RLHF

How each one works

  • Supervised learning uses explicit answers, like images tagged “cat” and “dog” by humans. The clearest form.
  • Unsupervised learning doesn’t use answers — its focus is clustering similar data, compressing high-dimensional data to low dimensions, etc. Loss is defined as “how well does the data’s own properties survive” (reconstruction error, distance preservation, etc.).
  • Self-supervised learning mechanically generates answers from the data itself. LLMs use “the next token” pulled from the text as the answer.
  • Reinforcement learning evaluates the quality of actions after the fact. Rather than a single correct answer, the model moves toward maximizing cumulative reward.

Deep learning is usable across all four

For clarity: deep learning is the name for learning methods that use deep neural networks, and it’s orthogonal to the four categories above. Any of those learning types — supervised / unsupervised / self-supervised / reinforcement — becomes “deep learning” when a deep network is used internally. In all cases, the gradient descent and backprop story in this article applies directly.

What’s special about LLMs is the setup: “there’s a vast amount of self-supervised training data” (= web text). From a deep-learning standpoint, LLMs are “one of the four categories, scaled up”.

Why LLM pre-training is special

Specifically, LLM pre-training’s special feature is that the correct answers are already embedded in the training data. Take vast amounts of text from the web and books, and quiz the model with “what’s the next token given this context?”. The answer comes out automatically by shifting the text by one position — no human labeling required. This is called self-supervised learning, and it’s one of the big reasons LLMs can train on trillions of tokens.

When fine-tuning or RLHF involves humans, it’s the post-pipeline of nudging the rough linguistic knowledge from pre-training into a form humans want.

Hallucinations and self-supervised learning

By the way, “self-supervised” is, literally, a structure where the subject teaches itself — meaning the data’s contents are inherited directly as the “truth”. Mistakes, fiction, outdated information, and biases mixed into web or book text all get learned. During pre-training, there’s no external party saying “that’s wrong”.

The roots of LLM hallucinations (confident plausible-sounding lies) lie in this “the teacher is the data itself” situation. Fine-tuning and RLHF can be read as “overlaying human evaluation on the ‘data-as-truth’ state, to at least somewhat distinguish fact from fiction”.

If you’re curious what actual fine-tuning work looks like, this blog has a few hands-on records.

The targets and methods vary, but since the underlying loop is always gradient descent, this article’s content applies directly.

Basic questions you might have while reading

Since this article focuses on the “mechanics” side of training, some foundational questions tend to slip by. Brief follow-ups.

What’s a neural network, again?

A neural network is, roughly, “a function that takes numbers, runs them through matrix multiplications and simple transforms (activation functions) several times, and spits out the target numbers”.

flowchart LR
    X["Input x (a row of numbers)"] --> L1["Layer 1: Wx+b → activation"]
    L1 --> L2[Layer 2]
    L2 --> L3[Layer 3]
    L3 --> Y["Output (probability distribution, numbers)"]

Each layer has a weight matrix WW and a bias bb; the layer computes Wx+bWx + b and passes the result through a nonlinearity (like ReLU). Stacking many layers produces a function capable of expressing complex relationships.

The parameters written as θ\theta all through this article are the collective WW and bb across all layers — the thing training adjusts.

So is deep learning just stacking this up?

Pretty much. Deep learning refers to neural networks with a deep (= large number of) stack, distinguished from shallow nets with 2 or 3 layers. A modern LLM like a Transformer is a typical example, stacking tens to hundreds of layers.

More layers means more expressive power — it can learn mappings that simpler nets couldn’t (image → object name, text → meaning, audio → text, etc.). But just going deeper runs into vanishing/exploding gradients, which is why residual connections and LayerNorm come paired with depth — the “tricks that let gradients flow through deep nets” covered earlier.

Is this actually a brain imitation, like people say?

The name “neural network” and the layered-connected-units look really do trace back to biological neuron-inspired designs. McCulloch-Pitts (1940s) and Rosenblatt’s perceptron started here, influenced by “the brain processes information through neural firing”.

Today’s deep learning, however, has inherited the name and visual but drifted far from biological brains:

  • Brain neurons encode information in firing timing (spikes); neural networks pass continuous-valued vectors
  • Brain learning is localized, asynchronous chemical reactions; deep learning takes the whole-system derivative and updates with gradients globally in lockstep
  • There’s no evidence the brain does backprop

Rather than “imitating the brain’s methods”, modern deep learning has become “a functionally similar-output separate thing”, started from a brain metaphor but separated long since.

Why does just layering matrix multiplications work?

A single layer can only do linear transformations (straight-line mappings). But interleaving nonlinear functions (like ReLU) while stacking layers gives you the power to approximate virtually any continuous function — this is known as the universal approximation theorem. “Tell cats from dogs”, “return probabilities for the next token”, “generate images” — if an input-to-output relationship is expressible as a function, a neural network can approximate it.

Training is the act of tuning that approximation to fit the data, so a trained network is “a huge function that’s memorized the input-output mapping the data revealed”.

Is training sequential? All at once? How do parameter updates happen?

Short answer: every mini-batch, all parameters move at once, by a small amount. Repeated forever.

  • A single mini-batch’s forward-backward pass computes L\nabla L for all parameters θ\theta (hundreds of millions to trillions of them) in one go
  • optimizer.step() applies θθηL\theta \leftarrow \theta - \eta \nabla L to every parameter simultaneously
  • Because η\eta is small, the amount each step moves is tiny
  • Repeat for millions to billions of steps

So it’s both “all at once” (every parameter moves in sync) and “sequential” (small amounts, many times). Picture every screw in a large building being turned a sliver, every screw at the same time, every step.

What you can read now

With what’s here, settings seen in LLM papers, model cards, and training logs become “numbers that mean something”.

Common settingReading
lr = 1e-4Learning rate target. Standard for Adam-family optimizers
AdamW(betas=(0.9, 0.95))First-moment momentum 0.9, second-moment momentum 0.95
weight_decay = 0.1Regularization to keep parameters from growing too large
warmup_steps = 500Ramp learning rate from 0 to target over the first 500 steps
cosine scheduleAfter warmup, decay the learning rate along a cosine curve
grad_clip = 1.0Scale down when gradient L2 norm exceeds 1
batch_size = 4M tokensOne mini-batch is 4 million tokens
loss curvePer-step loss trajectory
gradient normGradient L2 norm, a signal for detecting explosion

The training-details sections of papers and the W&B / TensorBoard training-log UIs should mostly make sense now.

Things you can skip

At the entry level, these are fine to ignore:

TermSummaryWhy it’s safe to skip
Second-order optimization (Newton’s method, L-BFGS)Methods that also use the gradient’s derivative (Hessian)Too expensive for LLMs
Natural GradientGradient correction using parameter-space geometryHeavy to implement; not standard practice
Regularization (L1/L2, Dropout)Extra terms / procedures that curb overfittingImportant but a separate topic
Gradient AccumulationAccumulate gradients across steps before updatingTrick for effectively expanding mini-batch size
Mixed Precision (fp16/bf16)Lower precision for speedImplementation / hardware topic
Distributed training (DDP, FSDP, ZeRO)Parallelize parameters across many GPUsUpper-layer training-loop optimization
Policy gradients (RL)RL-specific gradient computationShows up in RLHF, but separate topic

The bare minimum to pin down: the update rule θθηL\theta \leftarrow \theta - \eta \nabla L, the chain-rule expansion of backprop, Adam’s first/second moments, and the residual-connection + LayerNorm mechanism that lets gradients flow. Four things.


Glossary (feel free to skip)

TermMeaning
Gradient descentThe generic name for updates lowering the loss via θθηL\theta \leftarrow \theta - \eta \nabla L
Learning rate η\etaThe step size per update
SGDStochastic gradient descent; the mini-batch approach
Backpropagation (backprop)The procedure of applying the chain rule at scale to compute gradients
Computation graphDiagram of forward-pass operations as connected nodes — the skeleton of backprop
autogradFramework automatic differentiation (PyTorch et al.)
OptimizerComponent that consumes gradients and updates θ\theta (SGD, Adam, AdamW)
MomentumImprovement adding previous-gradient inertia
AdamOptimizer using first (direction) and second (scale) moments
AdamWCorrect combination of Adam + weight decay
Vanishing gradientPhenomenon of gradients collapsing to 0 through the chain rule
Exploding gradientPhenomenon of gradients diverging to \infty through the chain rule
Gradient clippingTechnique capping gradient magnitude
Residual connectionStructure y=f(x)+xy = f(x) + x; gradients flow through the shortcut
LayerNormNormalization bringing each layer’s values to mean 0, variance 1
WarmupPeriod where the learning rate ramps from 0 to target
Cosine decayPost-warmup schedule that decays the learning rate along a cosine curve

This closes the series, at least for now. Across articles 1-5, the “math-symbol-ish” parts of AI articles should be broadly readable.