Derivatives, just enough to read AI articles
Contents
After the previous probability and statistics article, LLM training logs and model cards become mostly readable. But if you try to follow how a model actually moves toward lower loss, calculus symbols start showing up. , , — plenty of people close the tab right there.
Same stance as the previous three articles in the series: the goal isn’t to solve anything, just to be able to read.
No rigorous calculus rules here. , , , — if you can read “what these four are doing”, most of the calculus in AI articles becomes followable.
Gradient descent and backpropagation lean directly on the chain rule and gradient, so they’re saved for the next article.
Derivatives extract “slope”
The core of differentiation is this: “At some point on a function, if you nudge the input slightly, how much does the output move?”
Think of it as the rate of change from middle/high school, but with the nudge shrunk to zero.
Graphically, the slope of the tangent line at a point on a curve is the derivative at that point.
- Steeply upward → large positive derivative
- Flat point (like a vertex) → derivative is 0
- Steeply downward → large negative derivative
Japanese high school math splits this into two stages worth keeping in mind.
First, the slope at a specific point is called the derivative value or differential coefficient — written .
Bundled back into a function that returns this slope for any , you get the derivative function, written or .
“Differentiating” a function means deriving this derivative function from the original .
In AI, you compute — the derivative of the loss with respect to a weight — to see “how much the loss shifts if we nudge a bit.” That the loss can be pushed down “in the direction that decreases it” is only possible because this slope can be measured.
Reading d/dx and f’(x)
Two notations for derivatives show up most often.
Leibniz notation: orLagrange notation: or
Both mean the same thing: “differentiate with respect to .” The choice is about readability, not meaning.
- looks like a fraction, but "" and "" aren’t quantities being divided independently. The shape captures the intuition of “how much changes when is nudged by an infinitesimal amount.”
- is short, handy when the same function gets differentiated repeatedly mid-derivation.
- “With respect to ” means “measured against the variable as the thing being nudged.”
AI papers and explanations usually want to be explicit about which variable the derivative is taken with respect to, so Leibniz notation (, etc.) is more common.
Why splitting the Leibniz fraction is allowed
Push into calculus a bit further — exam prep or university — and techniques like → show up (substitution in integrals, separable ODEs). “Isn’t it supposed to be a single symbol? How can we split it like a fraction?” is a fair thing to wonder. Short answer:
When Leibniz invented the notation in the 17th century, and really were treated as “infinitesimally small quantities” — an actual fraction. In modern rigorous calculus, is treated as a single symbol, but the splitting manipulation is justified behind the scenes as a shorthand for the chain rule or the substitution theorem. The rigorous justification lives in differential forms or real analysis at the university level, but until then, treating it as “a handy tool that gives correct answers” is enough.
Getting a feel with concrete examples
Functions of the form — like , , — are called power functions.
is called the exponent; it says “how many times you multiply by itself.” doesn’t have to be a positive integer — negative (), fractional (), all of it counts as a power function.
Differentiation feels most natural starting from power functions.
- differentiates to
- differentiates to
- differentiates to
Line these up and the rule pops out:
“Bring the exponent down to the front, reduce the exponent by one.” It’s the main formula when computing by hand, and the form you see most often in AI articles.
| Original | Derivative |
|---|---|
| (constant) |
The formula works as-is for negative or fractional too. Just drop the exponent in front and decrement.
Constants differentiate to zero. The “slope of something that doesn’t change” is 0, which reads cleanly.
Commonly-used derivatives
For reading AI articles, this table covers almost everything.
The exponential function shows up here for the first time. It’s structurally the opposite of a power function : the base is a constant, and the exponent is the variable.
Similar names, but where sits determines the family.
- Power function : base is variable, exponent is fixed
- Exponential function : base is fixed, exponent is variable
is the exponential with . The logarithm that shows up later is the inverse of the exponential — it returns “what power of equals .”
| Original | Derivative |
|---|---|
| (power function) | |
| (exponential, base ) | (unchanged) |
| (natural log) | |
| constant |
No need to memorize everything. Just ” stays under differentiation” and ” differentiates to ” already help a lot when reading.
Both of those have the base / natural log in common — the next section gets into why.
What e really is
The previous articles parked ” is the mysterious constant ≈ 2.71828…, it’ll come up in calculus” for later. Time to settle that.
Origin: the compound interest limit
first showed up historically in compound interest. “If you borrow 1 at 100% annual interest and compound times a year, what’s the maximum the debt reaches by year’s end?”
- 1 compounding:
- 2 compoundings:
- 4 compoundings:
- 12 compoundings:
- Finer still…
Larger gives larger values, but the sequence doesn’t blow up to infinity — it converges to a specific constant.
That’s ‘s origin story. Jacob Bernoulli stumbled onto this limit in the 17th century while thinking about compound interest.
As a consequence, e behaves specially under differentiation
This also has a special face in the world of derivatives.
Differentiating a general exponential leaves an extra coefficient attached.
- → times about
- → unchanged
- → times about
Only gets a coefficient of 1. Differentiate as many times as you like, never changes shape. That a compound-interest constant also behaves cleanly under differentiation isn’t a coincidence — behind both sits a “rate of change proportional to the current value” structure. Digging into the reason is a detour; “the compound-interest limit also shows up naturally in differentiation” is plenty.
The same reason makes the logarithm with base (natural log) differentiate cleanly.
Other bases (10, 2) tack an extra coefficient onto the right side. Only base produces bare .
Why AI uses natural log and e everywhere
Revisiting the earlier articles: softmax’s , cross-entropy’s , log-likelihood’s — AI formulas are soaked in and . The main reason is “derivatives come out clean.”
Training updates parameters in the direction of the derivative, and the cleaner the derivative looks, the simpler the math, code, and implementation become. Backpropagation (up next) multiplies derivatives through the chain rule, so having no stray coefficients dangling off each step makes a massive difference in practice.
When appears in AI writing, you can read it as base . Many AI writers don’t bother distinguishing and . This convention is more engineering-flavored than textbook-style.
The chain rule is “outer × inner”
The most load-bearing differentiation rule in AI is the chain rule.
Japanese high school math calls it “differentiation of composite functions”; “chain rule” (連鎖律) is the wording more common at university level and in AI/ML contexts. Same thing either way.
For a composite function — function inside a function — if you let , the chain rule reads:
In words, three steps:
- Differentiate the outer function as usual
- Differentiate the inner function as usual
- Multiply the two
As a concrete example, differentiate .
- Outer: → differentiates to
- Inner: → differentiates to
- Multiply:
Think gear ratios. A small nudge of turns the intermediate . turning turns the output . Multiply the ratio with the ratio, and you get the full ratio.
Why multiplication (not addition)?
“Why multiply, not add?” is a natural reaction. A slightly more math-y explanation:
A small change in produces approximately change in . The corresponding in produces approximately change in . Chaining these:
Divide by , take the limit as , and you get . Rigorously there’s a case analysis for scenarios, but intuitively “multiply the local rates” is correct. Leibniz notation makes it look like fractions cancelling, which is exactly because of this multiplicative structure in the background.
The chain rule’s role in AI
Neural networks stack layer on layer — the path from input to final loss is a deeply composite function.
flowchart LR
X[Input x] --> L1[Layer 1: W1, b1]
L1 --> L2[Layer 2: W2, b2]
L2 --> L3[Layer 3: W3, b3]
L3 --> L[Loss L]
To differentiate loss with respect to, say, a weight in layer 1, applying the chain rule repeatedly breaks the derivative into per-layer derivatives that get multiplied together. That’s exactly the skeleton of backpropagation (next article). “Just hammer the chain rule and you get gradients for all weights in a deep network” is a big part of why modern neural networks work.
Partial derivatives ∂f/∂x
So far everything has been single-variable functions. In AI training, the loss is a multivariable function of the neural network’s weights (parameters), which number in the millions to trillions. So differentiation needs to extend to functions with multiple variables.
Take a 2-variable function like .
“If I move alone, how does change?” and “If I move alone, how does change?” are the questions we want to measure separately.
That’s the partial derivative.
- Fix , differentiate with respect to →
- Fix , differentiate with respect to →
is read as “partial”, “round dee”, “del”, etc. Using instead of advertises that “this is a derivative taken with the other variables fixed.” It’s a different-symbol-different-meaning declaration to distinguish from the single-variable .
A concrete example
For :
- ( is a constant from ‘s perspective, so it differentiates to 0)
- (similarly drops out)
Regular differentiation plus “treat the other variables as constants” — that’s the whole rule change.
Geometric picture
graphs as a bowl-shaped surface in 3D. Moving around changes the height .
can be pictured as the slope of the cross-section you get by slicing the bowl with a plane of constant . Fixing leaves a single 2D curve; you’re measuring the slope of that curve at the point. is the same idea with the cutting plane rotated 90°.
Even with 3+ variables, “the slope of a 1-axis cross-section” stays the right mental image.
Gradient ∇f is a vector of partial derivatives
Stack all partial derivatives of a multivariable function into a vector and you get the gradient. The symbol is (nabla).
For :
It’s just a vector. The “arrow with a direction and length” from the second article on vectors comes back.
Two meanings: direction and length
A gradient vector carries two pieces of information.
- Direction: the direction in which the function increases most steeply
- Length: how fast it increases in that direction
“Most-steeply-increasing” flipped backward is “most-steeply-decreasing,” so if you compute the gradient of the loss (with bundling up all model parameters) and nudge in the opposite direction, loss goes down most steeply. That’s the core motion of gradient descent, the main subject of the next article.
The gradient via contour lines
Take the “bowl” from the partial-derivative section and look at it from directly above:
The concentric circles are contour lines — sets of points at the same height. Closer to the center is lower (bottom of the bowl), farther out is higher.
The red dot is one point on the bowl’s slope; the red arrow is the gradient vector at that point. As shown, the gradient is always perpendicular to the contour and points toward higher values.
Why does “stacked partial derivatives” give the steepest direction?
A natural question: “Why does stacking partial derivatives together land on the steepest direction specifically?” The intuition in one pass:
The rate of change in the direction of any unit vector is the dot product . Dot product (from the vectors article) is largest when the two vectors align, zero when they’re orthogonal, negative when opposite. So aligned with the gradient gives the maximum rate of change.
The formal proof comes from Cauchy–Schwarz, but “the direction maximizing the dot product” is enough for reading AI articles.
Same idea in massive dimensions
In real AI models, has as many components as the number of parameters — billions to trillions. 3+ dimensions can’t be visualized directly, but “the vector pointing in the direction that decreases loss fastest” is exactly the same concept as the 2D case.
Differentiation with respect to vectors and matrices
The gradient above is one specific case: “a scalar function differentiated with respect to a vector .” AI articles have several other combinations on top of that. This isn’t high school material, so heavy derivations stay out of scope; but knowing the patterns keeps you from stalling on notation in papers.
| What / with respect to | Notation | Result shape | AI use |
|---|---|---|---|
| Scalar / vector | , | Vector (same shape as ) | Gradient descent’s main character — the gradient from the previous section |
| Scalar / matrix | , | Matrix (same shape as ) | Updating each layer’s weight matrix |
| Vector / vector | Matrix (Jacobian) | Flowing gradients layer-to-layer in backprop | |
| Vector / scalar | Vector | Position differentiated by time is velocity; again for acceleration. More physics than AI |
The underlying idea is the same in all cases: “each element of the result is a partial derivative with respect to one of the inputs.” Even when the notation looks imposing, the content is a table of partial derivatives.
For anyone who had high school physics before calculus, the last row (vector / scalar) will be the most familiar shape. A position vector differentiated with respect to time is the velocity ; differentiated again, the acceleration .
Incidentally, the memorized high school physics formulas (, for constant-acceleration motion) can be derived mechanically from this relationship by integrating over time. From a calculus-first perspective, these are derivable formulas — there’s no need to memorize them.
Just remember the layout rules
- is a matrix with the same shape as , whose -th element is
- The
Jacobianis a matrix with output components down the rows and input components across the columns; the -th element is
When , , or a Jacobian shows up in a paper, reading it as “a multidimensional arrangement of element-wise partial derivatives” is enough.
The actual mechanics (how the chain rule turns into matrix multiplications in backprop) land in the next article alongside concrete examples.
Integration, briefly
A short detour into integration at the end.
Just as differentiation extracts slope, integration does the reverse — and the symbol is .
Integration has two faces
Integration has two apparently-different viewpoints that are actually connected behind the scenes.
- The reverse of differentiation: A function satisfying is called an
antiderivativeof . Finding one is called anindefinite integral, written . The constant disappears under differentiation, so there’s always a constant’s worth of ambiguity. - Area under a curve: The area bounded by the graph of and the -axis over an interval is the
definite integral, written .
These two connect through the fundamental theorem of calculus:
“The area equals the difference of the antiderivative at the endpoints.” Reverse-differentiating lets you compute areas. The full proof waits for real analysis at the university level, but the conclusion alone is enough for reading.
The constant-acceleration motion again, via integration
Circling back to the earlier example. Given a constant acceleration , integrating once and twice over time:
And the memorized high-school formulas fall out cleanly. The constants and dropped in at each integration step are the initial velocity and initial position.
Where integration shows up in AI
Integration in AI articles is sparse.
- Continuous probability distributions (expected value )
- Stochastic differential equations in diffusion models
- Continuous entropy in information theory
In LLM-adjacent writing it barely shows up, so seeing and reading it as “reverse of differentiation” or “area” is enough.
What you can read at this point
With this article’s toolkit, a lot of training-side AI formulas become followable.
| Common form | Reading |
|---|---|
| How the loss shifts when the weight is nudged | |
| Rate of change with respect to , other weights fixed | |
| Gradient vector over all parameters | |
| / | A matrix the same shape as , containing each element’s partial derivative |
| The Jacobian — a table of output × input partial derivatives | |
| unchanged | |
| / chain rule | Derivative of a composite function, the skeleton of backprop |
You don’t need to actually solve the formulas. As long as “what is the rate of change of what, with respect to what” is readable, the numbers on the gradient side of a paper or codebase stop feeling like just numbers.
Things you can skip for now
At the entry level, these can be ignored.
| Term | Summary | Why it’s safe to skip |
|---|---|---|
| ε-δ definition | Rigorous formalization of limits | The “nudge by something small” intuition is enough |
| Taylor expansion | Polynomial approximation of a function | Look it up when needed |
| Integration techniques | Substitution, integration by parts, etc. | Integration itself rarely shows up in LLM articles |
| Higher-order derivatives | Derivative applied multiple times | Used inside optimizers like Adam, but knowing the name is enough |
| Hessian matrix | A matrix of second-order partial derivatives | Used inside optimizers; the name is enough |
| Implicit differentiation | Differentiation of functions defined by equations | Basically never shows up in AI articles |
For neural network training, the three things worth pinning down are: derivatives of power functions, derivatives of and , and the chain rule.
Bonus: when you actually need to compute derivatives
This article focuses on reading rather than computing, so the formulas for by-hand differentiation have been mostly skipped. Exams oriented toward deep learning (like the Japanese E-qualification) do make you compute derivatives though, so here’s a compact reference.
Product and quotient rules
Rules for differentiating the product or quotient of two functions.
As a concrete example, differentiate :
Left half is “differentiate the first, keep the second as-is”, right half is “keep the first, differentiate the second”, then add.
More chain rule examples
Standard patterns for applying the chain rule by hand.
| Original | Derivative |
|---|---|
All handled by “differentiate outer × differentiate inner.”
General-base exponentials/logs, and inverse trig
AI sticks to base , but exams mix in general bases and inverse trig functions.
| Original | Derivative |
|---|---|
For base , , so these reduce to and . This is another place where ‘s status as “the base where derivatives don’t get deformed” shows up.
Why (a^x)’ = a^x log a?
Tracing where this formula comes from. Any exponential can be rewritten using :
( here is the natural log, .) Raising both sides to the -th power:
Differentiating via chain rule splits into “outer × inner”:
- Outer (‘s special property at work)
- Inner ( is a constant)
Multiplying:
When , , the coefficient cleanly vanishes:
“Why does only get a coefficient of 1?” The answer: cancels the extra exactly.
And (log_a x)’
The log side follows the same pattern. Using the change-of-base formula:
is a constant, so it factors out during differentiation.
For , again, collapsing to .
Both exponentials and logs with general base carry an extra coefficient, which vanishes cleanly only when . At the formula level, that’s why and natural log are the permanent residents of AI math.
Related reading
- The small set of math that makes AI articles readable Weighted sums, sigmoid, softmax, the training loop. Part 1 of this series.
- Vectors and matrices, just enough to read AI articles Dot products, matrix products, transpose. The foundation for gradient vectors. Part 2 of this series.
- Probability and statistics, just enough to read AI articles Softmax, cross-entropy, perplexity — the loss functions we take derivatives of. Part 3 of this series.
- MoonshotAI (Kimi) proposes AttnRes — replacing residual connections with attention in Transformers, 1.25× compute efficiency Gradient vanishing and residual connections in practice. Sets the scene for the next article.
Glossary (feel free to skip)
| Term | Meaning |
|---|---|
| Derivative | The slope (instantaneous rate of change) of a function at a point |
| Derivative function | A function returning the point-by-point slope of the original function |
| ”Differentiate with respect to ” operator (Leibniz notation) | |
| Derivative function (Lagrange notation) | |
| ≈ 2.71828… Arises as the limit , and is the unique base satisfying | |
| Natural logarithm | Logarithm with base . Often written just as in AI contexts |
| Chain rule | Rule for differentiating composite functions: |
| Partial derivative | Rate of change when only one variable in a multivariable function is varied |
| Gradient | Vector of partial derivatives. Points in the direction of steepest increase |
| Jacobian | Matrix of partial derivatives of an output vector’s components with respect to an input vector’s components |
| Integral | Area under a curve. The reverse operation of differentiation |
Next up: gradient descent and backpropagation. The chain rule and gradient from this article take center stage.