Derivatives, just enough to read AI articles

After the previous probability and statistics article, LLM training logs and model cards become mostly readable. But if you try to follow how a model actually moves toward lower loss, calculus symbols start showing up. $\frac{dL}{dw}$ , $\partial$ , $\nabla$ — plenty of people close the tab right there.

Same stance as the previous three articles in the series: the goal isn’t to solve anything, just to be able to read.

No rigorous calculus rules here. $\frac{d}{dx}$ , $f'(x)$ , $\frac{\partial f}{\partial x}$ , $\nabla f$ — if you can read “what these four are doing”, most of the calculus in AI articles becomes followable.

Gradient descent and backpropagation lean directly on the chain rule and gradient, so they’re saved for the next article.

Derivatives extract “slope”

The core of differentiation is this: “At some point on a function, if you nudge the input slightly, how much does the output move?”

Think of it as the rate of change from middle/high school, but with the nudge shrunk to zero.

Graphically, the slope of the tangent line at a point on a curve is the derivative at that point.

The slope of the tangent line at a point is the derivative at that point. The value changes from point to point.

Steeply upward → large positive derivative
Flat point (like a vertex) → derivative is 0
Steeply downward → large negative derivative

Japanese high school math splits this into two stages worth keeping in mind. First, the slope at a specific point $x = a$ is called the derivative value or differential coefficient — written $f'(a)$ . Bundled back into a function that returns this slope for any $x$ , you get the derivative function, written $f'(x)$ or $\frac{df}{dx}$ . “Differentiating” a function means deriving this derivative function from the original $f(x)$ .

In AI, you compute $\frac{dL}{dw}$ — the derivative of the loss $L$ with respect to a weight $w$ — to see “how much the loss shifts if we nudge $w$ a bit.” That the loss can be pushed down “in the direction that decreases it” is only possible because this slope can be measured.

Reading d/dx and f’(x)

Two notations for derivatives show up most often.

Leibniz notation: $\frac{d}{dx} f(x)$ or $\frac{df}{dx}$
Lagrange notation: $f'(x)$ or $y'$

Both mean the same thing: “differentiate $f$ with respect to $x$ .” The choice is about readability, not meaning.

$\frac{df}{dx}$ looks like a fraction, but " $df$ " and " $dx$ " aren’t quantities being divided independently. The shape captures the intuition of “how much $f$ changes when $x$ is nudged by an infinitesimal amount.”
$f'(x)$ is short, handy when the same function gets differentiated repeatedly mid-derivation.
“With respect to $x$ ” means “measured against the variable $x$ as the thing being nudged.”

AI papers and explanations usually want to be explicit about which variable the derivative is taken with respect to, so Leibniz notation ( $\frac{\partial L}{\partial w}$ , etc.) is more common.

Why splitting the Leibniz fraction is allowed

Push into calculus a bit further — exam prep or university — and techniques like $\frac{dy}{dx} = f(x)$ → $dy = f(x)dx$ show up (substitution in integrals, separable ODEs). “Isn’t it supposed to be a single symbol? How can we split it like a fraction?” is a fair thing to wonder. Short answer:

When Leibniz invented the notation in the 17th century, $dx$ and $dy$ really were treated as “infinitesimally small quantities” — an actual fraction. In modern rigorous calculus, $\frac{dy}{dx}$ is treated as a single symbol, but the splitting manipulation is justified behind the scenes as a shorthand for the chain rule or the substitution theorem. The rigorous justification lives in differential forms or real analysis at the university level, but until then, treating it as “a handy tool that gives correct answers” is enough.

Getting a feel with concrete examples

Functions of the form $y = x^n$ — like $y = x$ , $y = x^2$ , $y = x^3$ — are called power functions. $n$ is called the exponent; it says “how many times you multiply $x$ by itself.” $n$ doesn’t have to be a positive integer — negative ( $x^{-1} = 1/x$ ), fractional ( $x^{1/2} = \sqrt{x}$ ), all of it counts as a power function.

Differentiation feels most natural starting from power functions.

$y = x^2$ differentiates to $y' = 2x$
$y = x^3$ differentiates to $y' = 3x^2$
$y = x^{10}$ differentiates to $y' = 10x^9$

Line these up and the rule pops out:

(x^n)' = n x^{n-1}

“Bring the exponent down to the front, reduce the exponent by one.” It’s the main formula when computing by hand, and the form you see most often in AI articles.

Original	Derivative
$y = x$	$y' = 1$
$y = x^2$	$y' = 2x$
$y = x^3$	$y' = 3x^2$
$y = x^{-1} = 1/x$	$y' = -x^{-2} = -\dfrac{1}{x^2}$
$y = x^{1/2} = \sqrt{x}$	$y' = \dfrac{1}{2}x^{-1/2} = \dfrac{1}{2\sqrt{x}}$
$y = 5$ (constant)	$y' = 0$

The formula $(x^n)' = n x^{n-1}$ works as-is for negative or fractional $n$ too. Just drop the exponent in front and decrement.

Constants differentiate to zero. The “slope of something that doesn’t change” is 0, which reads cleanly.

Commonly-used derivatives

For reading AI articles, this table covers almost everything.

The exponential function $a^x$ shows up here for the first time. It’s structurally the opposite of a power function $x^n$ : the base $a$ is a constant, and the exponent $x$ is the variable. Similar names, but where $x$ sits determines the family.

Power function $x^n$ : base is variable, exponent is fixed
Exponential function $a^x$ : base is fixed, exponent is variable

$e^x$ is the exponential with $a = e$ . The logarithm $\log x$ that shows up later is the inverse of the exponential — it returns “what power of $e$ equals $x$ .”

Original	Derivative
$x^n$ (power function)	$n x^{n-1}$
$e^x$ (exponential, base $e$ )	$e^x$ (unchanged)
$\log x$ (natural log)	$1/x$
$\sin x$	$\cos x$
$\cos x$	$-\sin x$
constant $c$	$0$

No need to memorize everything. Just ” $e^x$ stays $e^x$ under differentiation” and ” $\log x$ differentiates to $1/x$ ” already help a lot when reading.

Both of those have the base $e$ / natural log in common — the next section gets into why.

What e really is

The previous articles parked ” $e$ is the mysterious constant ≈ 2.71828…, it’ll come up in calculus” for later. Time to settle that.

Origin: the compound interest limit

$e$ first showed up historically in compound interest. “If you borrow 1 at 100% annual interest and compound $n$ times a year, what’s the maximum the debt reaches by year’s end?”

1 compounding: $(1 + 1)^1 = 2$
2 compoundings: $(1 + 1/2)^2 = 2.25$
4 compoundings: $(1 + 1/4)^4 \approx 2.441$
12 compoundings: $(1 + 1/12)^{12} \approx 2.613$
Finer still…

Larger $n$ gives larger values, but the sequence doesn’t blow up to infinity — it converges to a specific constant.

e = \lim_{n \to \infty}\left(1 + \frac{1}{n}\right)^n \approx 2.71828\ldots

That’s $e$ ‘s origin story. Jacob Bernoulli stumbled onto this limit in the 17th century while thinking about compound interest.

As a consequence, e behaves specially under differentiation

This $e$ also has a special face in the world of derivatives.

(e^x)' = e^x

Differentiating a general exponential $a^x$ leaves an extra coefficient attached.

$(2^x)'$ → $2^x$ times about $0.693$
$(e^x)'$ → $e^x$ unchanged
$(10^x)'$ → $10^x$ times about $2.303$

Only $e$ gets a coefficient of 1. Differentiate as many times as you like, $e^x$ never changes shape. That a compound-interest constant also behaves cleanly under differentiation isn’t a coincidence — behind both sits a “rate of change proportional to the current value” structure. Digging into the reason is a detour; “the compound-interest limit also shows up naturally in differentiation” is plenty.

The same reason makes the logarithm with base $e$ (natural log) differentiate cleanly.

(\log x)' = \frac{1}{x}

Other bases (10, 2) tack an extra coefficient onto the right side. Only base $e$ produces bare $1/x$ .

Why AI uses natural log and e everywhere

Revisiting the earlier articles: softmax’s $e^{x_i}$ , cross-entropy’s $\log q$ , log-likelihood’s $\log P$ — AI formulas are soaked in $e$ and $\log$ . The main reason is “derivatives come out clean.”

Training updates parameters in the direction of the derivative, and the cleaner the derivative looks, the simpler the math, code, and implementation become. Backpropagation (up next) multiplies derivatives through the chain rule, so having no stray coefficients dangling off each step makes a massive difference in practice.

When $\log$ appears in AI writing, you can read it as base $e$ . Many AI writers don’t bother distinguishing $\ln$ and $\log$ . This convention is more engineering-flavored than textbook-style.

The chain rule is “outer × inner”

The most load-bearing differentiation rule in AI is the chain rule. Japanese high school math calls it “differentiation of composite functions”; “chain rule” (連鎖律) is the wording more common at university level and in AI/ML contexts. Same thing either way.

For a composite function $y = f(g(x))$ — function inside a function — if you let $u = g(x)$ , the chain rule reads:

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

In words, three steps:

Differentiate the outer function $f$ as usual
Differentiate the inner function $g$ as usual
Multiply the two

As a concrete example, differentiate $y = (x^2 + 1)^3$ .

Outer: $y = u^3$ → differentiates to $3u^2$
Inner: $u = x^2 + 1$ → differentiates to $2x$
Multiply: $3u^2 \cdot 2x = 3(x^2+1)^2 \cdot 2x = 6x(x^2+1)^2$

Think gear ratios. A small nudge of $x$ turns the intermediate $u$ . $u$ turning turns the output $y$ . Multiply the $x \to u$ ratio with the $u \to y$ ratio, and you get the full $x \to y$ ratio.

Why multiplication (not addition)?

“Why multiply, not add?” is a natural reaction. A slightly more math-y explanation:

A small change $\Delta x$ in $x$ produces approximately $\frac{du}{dx} \cdot \Delta x$ change in $u$ . The corresponding $\Delta u$ in $u$ produces approximately $\frac{dy}{du} \cdot \Delta u$ change in $y$ . Chaining these:

\Delta y \approx \frac{dy}{du} \cdot \frac{du}{dx} \cdot \Delta x

Divide by $\Delta x$ , take the limit as $\Delta x \to 0$ , and you get $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ . Rigorously there’s a case analysis for $\Delta u = 0$ scenarios, but intuitively “multiply the local rates” is correct. Leibniz notation makes it look like fractions cancelling, which is exactly because of this multiplicative structure in the background.

The chain rule’s role in AI

Neural networks stack layer on layer — the path from input to final loss is a deeply composite function.

flowchart LR
    X[Input x] --> L1[Layer 1: W1, b1]
    L1 --> L2[Layer 2: W2, b2]
    L2 --> L3[Layer 3: W3, b3]
    L3 --> L[Loss L]

To differentiate loss $L$ with respect to, say, a weight $W_1$ in layer 1, applying the chain rule repeatedly breaks the derivative into per-layer derivatives that get multiplied together. That’s exactly the skeleton of backpropagation (next article). “Just hammer the chain rule and you get gradients for all weights in a deep network” is a big part of why modern neural networks work.

Partial derivatives ∂f/∂x

So far everything has been single-variable functions. In AI training, the loss is a multivariable function of the neural network’s weights (parameters), which number in the millions to trillions. So differentiation needs to extend to functions with multiple variables.

Take a 2-variable function like $f(x, y) = x^2 + y^2$ . “If I move $x$ alone, how does $f$ change?” and “If I move $y$ alone, how does $f$ change?” are the questions we want to measure separately. That’s the partial derivative.

Fix $y$ , differentiate with respect to $x$ → $\frac{\partial f}{\partial x}$
Fix $x$ , differentiate with respect to $y$ → $\frac{\partial f}{\partial y}$

$\partial$ is read as “partial”, “round dee”, “del”, etc. Using $\partial$ instead of $d$ advertises that “this is a derivative taken with the other variables fixed.” It’s a different-symbol-different-meaning declaration to distinguish from the single-variable $\frac{df}{dx}$ .

A concrete example

For $f(x, y) = x^2 + y^2$ :

$\frac{\partial f}{\partial x} = 2x$ ( $y^2$ is a constant from $x$ ‘s perspective, so it differentiates to 0)
$\frac{\partial f}{\partial y} = 2y$ (similarly $x^2$ drops out)

Regular differentiation plus “treat the other variables as constants” — that’s the whole rule change.

Geometric picture

$f(x, y) = x^2 + y^2$ graphs as a bowl-shaped surface in 3D. Moving $(x, y)$ around changes the height $f$ .

$\frac{\partial f}{\partial x}$ can be pictured as the slope of the cross-section you get by slicing the bowl with a plane of constant $y$ . Fixing $y$ leaves a single 2D curve; you’re measuring the slope of that curve at the point. $\frac{\partial f}{\partial y}$ is the same idea with the cutting plane rotated 90°.

Even with 3+ variables, “the slope of a 1-axis cross-section” stays the right mental image.

Gradient ∇f is a vector of partial derivatives

Stack all partial derivatives of a multivariable function into a vector and you get the gradient. The symbol is $\nabla$ (nabla).

For $f(x, y) = x^2 + y^2$ :

\nabla f = \begin{bmatrix} \partial f / \partial x \\ \partial f / \partial y \end{bmatrix} = \begin{bmatrix} 2x \\ 2y \end{bmatrix}

It’s just a vector. The “arrow with a direction and length” from the second article on vectors comes back.

Two meanings: direction and length

A gradient vector carries two pieces of information.

Direction: the direction in which the function increases most steeply
Length: how fast it increases in that direction

“Most-steeply-increasing” flipped backward is “most-steeply-decreasing,” so if you compute the gradient $\nabla L$ of the loss $L(\theta)$ (with $\theta$ bundling up all model parameters) and nudge $\theta$ in the opposite direction, loss goes down most steeply. That’s the core motion of gradient descent, the main subject of the next article.

The gradient via contour lines

Take the “bowl” from the partial-derivative section and look at it from directly above:

Concentric circles are contour lines; the center is the bowl's minimum. The gradient at any point is perpendicular to the contour there and points outward (toward higher values). Gradient descent moves in the opposite direction — straight toward the bottom of the valley.

The concentric circles are contour lines — sets of points at the same height. Closer to the center is lower (bottom of the bowl), farther out is higher.

The red dot is one point on the bowl’s slope; the red arrow is the gradient vector $\nabla f$ at that point. As shown, the gradient is always perpendicular to the contour and points toward higher values.

Why does “stacked partial derivatives” give the steepest direction?

A natural question: “Why does stacking partial derivatives together land on the steepest direction specifically?” The intuition in one pass:

The rate of change in the direction of any unit vector $\vec{v}$ is the dot product $\nabla f \cdot \vec{v}$ . Dot product (from the vectors article) is largest when the two vectors align, zero when they’re orthogonal, negative when opposite. So $\vec{v}$ aligned with the gradient gives the maximum rate of change.

The formal proof comes from Cauchy–Schwarz, but “the direction maximizing the dot product” is enough for reading AI articles.

Same idea in massive dimensions

In real AI models, $\nabla L$ has as many components as the number of parameters — billions to trillions. 3+ dimensions can’t be visualized directly, but “the vector pointing in the direction that decreases loss fastest” is exactly the same concept as the 2D case.

Differentiation with respect to vectors and matrices

The gradient above is one specific case: “a scalar function $f$ differentiated with respect to a vector $\vec{x}$ .” AI articles have several other combinations on top of that. This isn’t high school material, so heavy derivations stay out of scope; but knowing the patterns keeps you from stalling on notation in papers.

What / with respect to	Notation	Result shape	AI use
Scalar / vector	$\frac{\partial L}{\partial \vec{w}}$ , $\nabla_{\vec{w}} L$	Vector (same shape as $\vec{w}$ )	Gradient descent’s main character — the gradient from the previous section
Scalar / matrix	$\frac{\partial L}{\partial W}$ , $\nabla_W L$	Matrix (same shape as $W$ )	Updating each layer’s weight matrix
Vector / vector	$\frac{\partial \vec{y}}{\partial \vec{x}}$	Matrix (Jacobian)	Flowing gradients layer-to-layer in backprop
Vector / scalar	$\dfrac{d\vec{r}}{dt}$	Vector	Position $\vec{r}$ differentiated by time $t$ is velocity; again for acceleration. More physics than AI

The underlying idea is the same in all cases: “each element of the result is a partial derivative with respect to one of the inputs.” Even when the notation looks imposing, the content is a table of partial derivatives.

For anyone who had high school physics before calculus, the last row (vector / scalar) will be the most familiar shape. A position vector $\vec{r}(t)$ differentiated with respect to time $t$ is the velocity $\vec{v} = d\vec{r}/dt$ ; differentiated again, the acceleration $\vec{a} = d\vec{v}/dt$ .

Incidentally, the memorized high school physics formulas ( $v = v_0 + at$ , $x = x_0 + v_0 t + \frac{1}{2}at^2$ for constant-acceleration motion) can be derived mechanically from this $\vec{a} = d\vec{v}/dt$ relationship by integrating over time. From a calculus-first perspective, these are derivable formulas — there’s no need to memorize them.

Just remember the layout rules

$\frac{\partial L}{\partial W}$ is a matrix with the same shape as $W$ , whose $(i, j)$ -th element is $\frac{\partial L}{\partial W_{ij}}$
The Jacobian $\frac{\partial \vec{y}}{\partial \vec{x}}$ is a matrix with output components down the rows and input components across the columns; the $(i, j)$ -th element is $\frac{\partial y_i}{\partial x_j}$

When $\nabla_W L$ , $\frac{\partial L}{\partial W}$ , or a Jacobian $J$ shows up in a paper, reading it as “a multidimensional arrangement of element-wise partial derivatives” is enough.

The actual mechanics (how the chain rule turns into matrix multiplications in backprop) land in the next article alongside concrete examples.

Integration, briefly

A short detour into integration at the end. Just as differentiation extracts slope, integration does the reverse — and the symbol is $\int$ .

Integration has two faces

Integration has two apparently-different viewpoints that are actually connected behind the scenes.

The reverse of differentiation: A function $F(x)$ satisfying $F'(x) = f(x)$ is called an antiderivative of $f$ . Finding one is called an indefinite integral, written $\int f(x)\, dx = F(x) + C$ . The constant $C$ disappears under differentiation, so there’s always a constant’s worth of ambiguity.
Area under a curve: The area bounded by the graph of $f(x)$ and the $x$ -axis over an interval $[a, b]$ is the definite integral, written $\int_a^b f(x)\, dx$ .

These two connect through the fundamental theorem of calculus:

\int_a^b f(x)\, dx = F(b) - F(a)

“The area equals the difference of the antiderivative at the endpoints.” Reverse-differentiating lets you compute areas. The full proof waits for real analysis at the university level, but the conclusion alone is enough for reading.

The constant-acceleration motion again, via integration

Circling back to the earlier example. Given a constant acceleration $\vec{a}$ , integrating once and twice over time:

\vec{v}(t) = \int \vec{a}\, dt = \vec{a}\, t + \vec{v}_0

\vec{r}(t) = \int \vec{v}\, dt = \frac{1}{2}\vec{a}\, t^2 + \vec{v}_0 t + \vec{r}_0

And the memorized high-school formulas fall out cleanly. The constants $\vec{v}_0$ and $\vec{r}_0$ dropped in at each integration step are the initial velocity and initial position.

Where integration shows up in AI

Integration in AI articles is sparse.

Continuous probability distributions (expected value $E[X] = \int x \cdot p(x)\, dx$ )
Stochastic differential equations in diffusion models
Continuous entropy in information theory

In LLM-adjacent writing it barely shows up, so seeing $\int$ and reading it as “reverse of differentiation” or “area” is enough.

What you can read at this point

With this article’s toolkit, a lot of training-side AI formulas become followable.

Common form	Reading
$\frac{dL}{dw}$	How the loss $L$ shifts when the weight $w$ is nudged
$\frac{\partial L}{\partial w_i}$	Rate of change with respect to $w_i$ , other weights fixed
$\nabla_\theta L$	Gradient vector over all parameters
$\frac{\partial L}{\partial W}$ / $\nabla_W L$	A matrix the same shape as $W$ , containing each element’s partial derivative
$\frac{\partial \vec{y}}{\partial \vec{x}}$	The Jacobian — a table of output × input partial derivatives
$(e^x)'$	$e^x$ unchanged
$(\log x)'$	$1/x$
$(f \circ g)'$ / chain rule	Derivative of a composite function, the skeleton of backprop

You don’t need to actually solve the formulas. As long as “what is the rate of change of what, with respect to what” is readable, the numbers on the gradient side of a paper or codebase stop feeling like just numbers.

Things you can skip for now

At the entry level, these can be ignored.

Term	Summary	Why it’s safe to skip
ε-δ definition	Rigorous formalization of limits	The “nudge by something small” intuition is enough
Taylor expansion	Polynomial approximation of a function	Look it up when needed
Integration techniques	Substitution, integration by parts, etc.	Integration itself rarely shows up in LLM articles
Higher-order derivatives	Derivative applied multiple times	Used inside optimizers like Adam, but knowing the name is enough
Hessian matrix	A matrix of second-order partial derivatives	Used inside optimizers; the name is enough
Implicit differentiation	Differentiation of functions defined by equations	Basically never shows up in AI articles

For neural network training, the three things worth pinning down are: derivatives of power functions, derivatives of $e^x$ and $\log x$ , and the chain rule.

Bonus: when you actually need to compute derivatives

This article focuses on reading rather than computing, so the formulas for by-hand differentiation have been mostly skipped. Exams oriented toward deep learning (like the Japanese E-qualification) do make you compute derivatives though, so here’s a compact reference.

Product and quotient rules

Rules for differentiating the product or quotient of two functions.

(fg)' = f'g + fg'

\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}

As a concrete example, differentiate $y = x^2 e^x$ :

y' = 2x \cdot e^x + x^2 \cdot e^x = (2x + x^2)\, e^x

Left half is “differentiate the first, keep the second as-is”, right half is “keep the first, differentiate the second”, then add.

More chain rule examples

Standard patterns for applying the chain rule by hand.

Original	Derivative
$y = \sin(2x)$	$y' = 2\cos(2x)$
$y = e^{x^2}$	$y' = 2x \cdot e^{x^2}$
$y = \log(x^2 + 1)$	$y' = \dfrac{2x}{x^2 + 1}$

All handled by “differentiate outer × differentiate inner.”

General-base exponentials/logs, and inverse trig

AI sticks to base $e$ , but exams mix in general bases and inverse trig functions.

Original	Derivative
$a^x$	$a^x \log a$
$\log_a x$	$\dfrac{1}{x \log a}$
$\tan x$	$\dfrac{1}{\cos^2 x}$
$\arctan x$	$\dfrac{1}{1 + x^2}$

For base $e$ , $\log e = 1$ , so these reduce to $(e^x)' = e^x$ and $(\log x)' = 1/x$ . This is another place where $e$ ‘s status as “the base where derivatives don’t get deformed” shows up.

Why (a^x)’ = a^x log a?

Tracing where this formula comes from. Any exponential $a^x$ can be rewritten using $e$ :

a = e^{\log a}

( $\log$ here is the natural log, $\log_e$ .) Raising both sides to the $x$ -th power:

a^x = \left(e^{\log a}\right)^x = e^{x \log a}

Differentiating $e^{x \log a}$ via chain rule splits into “outer × inner”:

Outer $(e^u)' = e^u$ ( $e$ ‘s special property at work)
Inner $(x \log a)' = \log a$ ( $\log a$ is a constant)

Multiplying:

(a^x)' = e^{x \log a} \cdot \log a = a^x \cdot \log a

When $a = e$ , $\log e = 1$ , the coefficient cleanly vanishes:

(e^x)' = e^x \cdot 1 = e^x

“Why does only $e$ get a coefficient of 1?” The answer: $\log e = 1$ cancels the extra $\log a$ exactly.

And (log_a x)’

The log side follows the same pattern. Using the change-of-base formula:

\log_a x = \frac{\log x}{\log a}

$\log a$ is a constant, so it factors out during differentiation.

(\log_a x)' = \frac{(\log x)'}{\log a} = \frac{1}{x \log a}

For $a = e$ , $\log e = 1$ again, collapsing to $(\log x)' = 1/x$ .

Both exponentials and logs with general base $a$ carry an extra $\log a$ coefficient, which vanishes cleanly only when $a = e$ . At the formula level, that’s why $e$ and natural log are the permanent residents of AI math.

The small set of math that makes AI articles readable Weighted sums, sigmoid, softmax, the training loop. Part 1 of this series.
Vectors and matrices, just enough to read AI articles Dot products, matrix products, transpose. The foundation for gradient vectors. Part 2 of this series.
Probability and statistics, just enough to read AI articles Softmax, cross-entropy, perplexity — the loss functions we take derivatives of. Part 3 of this series.
MoonshotAI (Kimi) proposes AttnRes — replacing residual connections with attention in Transformers, 1.25× compute efficiency Gradient vanishing and residual connections in practice. Sets the scene for the next article.

Glossary (feel free to skip)

Term	Meaning
Derivative	The slope (instantaneous rate of change) of a function at a point
Derivative function	A function returning the point-by-point slope of the original function
$\frac{d}{dx}$	”Differentiate with respect to $x$ ” operator (Leibniz notation)
$f'(x)$	Derivative function (Lagrange notation)
$e$	≈ 2.71828… Arises as the limit $(1+1/n)^n$ , and is the unique base satisfying $(e^x)' = e^x$
Natural logarithm	Logarithm with base $e$ . Often written just as $\log$ in AI contexts
Chain rule	Rule for differentiating composite functions: $\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}$
Partial derivative $\partial$	Rate of change when only one variable in a multivariable function is varied
Gradient $\nabla f$	Vector of partial derivatives. Points in the direction of steepest increase
Jacobian	Matrix of partial derivatives of an output vector’s components with respect to an input vector’s components
Integral $\int$	Area under a curve. The reverse operation of differentiation

Next up: gradient descent and backpropagation. The chain rule and gradient from this article take center stage.