Tech 23 min read

Derivatives, just enough to read AI articles

IkesanContents

After the previous probability and statistics article, LLM training logs and model cards become mostly readable. But if you try to follow how a model actually moves toward lower loss, calculus symbols start showing up. dLdw\frac{dL}{dw}, \partial, \nabla — plenty of people close the tab right there.

Same stance as the previous three articles in the series: the goal isn’t to solve anything, just to be able to read.

No rigorous calculus rules here. ddx\frac{d}{dx}, f(x)f'(x), fx\frac{\partial f}{\partial x}, f\nabla f — if you can read “what these four are doing”, most of the calculus in AI articles becomes followable.

Gradient descent and backpropagation lean directly on the chain rule and gradient, so they’re saved for the next article.

Derivatives extract “slope”

The core of differentiation is this: “At some point on a function, if you nudge the input slightly, how much does the output move?”

Think of it as the rate of change from middle/high school, but with the nudge shrunk to zero.

Graphically, the slope of the tangent line at a point on a curve is the derivative at that point.

Tangent lines and slopes of a function Three tangent lines on a parabola: negative slope on the left, zero at the vertex, positive on the right x y slope < 0 slope = 0 slope > 0
The slope of the tangent line at a point is the derivative at that point. The value changes from point to point.
  • Steeply upward → large positive derivative
  • Flat point (like a vertex) → derivative is 0
  • Steeply downward → large negative derivative

Japanese high school math splits this into two stages worth keeping in mind. First, the slope at a specific point x=ax = a is called the derivative value or differential coefficient — written f(a)f'(a). Bundled back into a function that returns this slope for any xx, you get the derivative function, written f(x)f'(x) or dfdx\frac{df}{dx}. “Differentiating” a function means deriving this derivative function from the original f(x)f(x).

In AI, you compute dLdw\frac{dL}{dw} — the derivative of the loss LL with respect to a weight ww — to see “how much the loss shifts if we nudge ww a bit.” That the loss can be pushed down “in the direction that decreases it” is only possible because this slope can be measured.

Reading d/dx and f’(x)

Two notations for derivatives show up most often.

  1. Leibniz notation: ddxf(x)\frac{d}{dx} f(x) or dfdx\frac{df}{dx}
  2. Lagrange notation: f(x)f'(x) or yy'

Both mean the same thing: “differentiate ff with respect to xx.” The choice is about readability, not meaning.

  • dfdx\frac{df}{dx} looks like a fraction, but "dfdf" and "dxdx" aren’t quantities being divided independently. The shape captures the intuition of “how much ff changes when xx is nudged by an infinitesimal amount.”
  • f(x)f'(x) is short, handy when the same function gets differentiated repeatedly mid-derivation.
  • “With respect to xx” means “measured against the variable xx as the thing being nudged.”

AI papers and explanations usually want to be explicit about which variable the derivative is taken with respect to, so Leibniz notation (Lw\frac{\partial L}{\partial w}, etc.) is more common.

Why splitting the Leibniz fraction is allowed

Push into calculus a bit further — exam prep or university — and techniques like dydx=f(x)\frac{dy}{dx} = f(x)dy=f(x)dxdy = f(x)dx show up (substitution in integrals, separable ODEs). “Isn’t it supposed to be a single symbol? How can we split it like a fraction?” is a fair thing to wonder. Short answer:

When Leibniz invented the notation in the 17th century, dxdx and dydy really were treated as “infinitesimally small quantities” — an actual fraction. In modern rigorous calculus, dydx\frac{dy}{dx} is treated as a single symbol, but the splitting manipulation is justified behind the scenes as a shorthand for the chain rule or the substitution theorem. The rigorous justification lives in differential forms or real analysis at the university level, but until then, treating it as “a handy tool that gives correct answers” is enough.

Getting a feel with concrete examples

Functions of the form y=xny = x^n — like y=xy = x, y=x2y = x^2, y=x3y = x^3 — are called power functions. nn is called the exponent; it says “how many times you multiply xx by itself.” nn doesn’t have to be a positive integer — negative (x1=1/xx^{-1} = 1/x), fractional (x1/2=xx^{1/2} = \sqrt{x}), all of it counts as a power function.

Differentiation feels most natural starting from power functions.

  • y=x2y = x^2 differentiates to y=2xy' = 2x
  • y=x3y = x^3 differentiates to y=3x2y' = 3x^2
  • y=x10y = x^{10} differentiates to y=10x9y' = 10x^9

Line these up and the rule pops out:

(xn)=nxn1(x^n)' = n x^{n-1}

“Bring the exponent down to the front, reduce the exponent by one.” It’s the main formula when computing by hand, and the form you see most often in AI articles.

OriginalDerivative
y=xy = xy=1y' = 1
y=x2y = x^2y=2xy' = 2x
y=x3y = x^3y=3x2y' = 3x^2
y=x1=1/xy = x^{-1} = 1/xy=x2=1x2y' = -x^{-2} = -\dfrac{1}{x^2}
y=x1/2=xy = x^{1/2} = \sqrt{x}y=12x1/2=12xy' = \dfrac{1}{2}x^{-1/2} = \dfrac{1}{2\sqrt{x}}
y=5y = 5 (constant)y=0y' = 0

The formula (xn)=nxn1(x^n)' = n x^{n-1} works as-is for negative or fractional nn too. Just drop the exponent in front and decrement.

Constants differentiate to zero. The “slope of something that doesn’t change” is 0, which reads cleanly.

Commonly-used derivatives

For reading AI articles, this table covers almost everything.

The exponential function axa^x shows up here for the first time. It’s structurally the opposite of a power function xnx^n: the base aa is a constant, and the exponent xx is the variable. Similar names, but where xx sits determines the family.

  • Power function xnx^n: base is variable, exponent is fixed
  • Exponential function axa^x: base is fixed, exponent is variable

exe^x is the exponential with a=ea = e. The logarithm logx\log x that shows up later is the inverse of the exponential — it returns “what power of ee equals xx.”

OriginalDerivative
xnx^n (power function)nxn1n x^{n-1}
exe^x (exponential, base ee)exe^x (unchanged)
logx\log x (natural log)1/x1/x
sinx\sin xcosx\cos x
cosx\cos xsinx-\sin x
constant cc00

No need to memorize everything. Just ”exe^x stays exe^x under differentiation” and ”logx\log x differentiates to 1/x1/x” already help a lot when reading.

Both of those have the base ee / natural log in common — the next section gets into why.

What e really is

The previous articles parked ”ee is the mysterious constant ≈ 2.71828…, it’ll come up in calculus” for later. Time to settle that.

Origin: the compound interest limit

ee first showed up historically in compound interest. “If you borrow 1 at 100% annual interest and compound nn times a year, what’s the maximum the debt reaches by year’s end?”

  • 1 compounding: (1+1)1=2(1 + 1)^1 = 2
  • 2 compoundings: (1+1/2)2=2.25(1 + 1/2)^2 = 2.25
  • 4 compoundings: (1+1/4)42.441(1 + 1/4)^4 \approx 2.441
  • 12 compoundings: (1+1/12)122.613(1 + 1/12)^{12} \approx 2.613
  • Finer still…

Larger nn gives larger values, but the sequence doesn’t blow up to infinity — it converges to a specific constant.

e=limn(1+1n)n2.71828e = \lim_{n \to \infty}\left(1 + \frac{1}{n}\right)^n \approx 2.71828\ldots

That’s ee‘s origin story. Jacob Bernoulli stumbled onto this limit in the 17th century while thinking about compound interest.

As a consequence, e behaves specially under differentiation

This ee also has a special face in the world of derivatives.

(ex)=ex(e^x)' = e^x

Differentiating a general exponential axa^x leaves an extra coefficient attached.

  • (2x)(2^x)'2x2^x times about 0.6930.693
  • (ex)(e^x)'exe^x unchanged
  • (10x)(10^x)'10x10^x times about 2.3032.303

Only ee gets a coefficient of 1. Differentiate as many times as you like, exe^x never changes shape. That a compound-interest constant also behaves cleanly under differentiation isn’t a coincidence — behind both sits a “rate of change proportional to the current value” structure. Digging into the reason is a detour; “the compound-interest limit also shows up naturally in differentiation” is plenty.

The same reason makes the logarithm with base ee (natural log) differentiate cleanly.

(logx)=1x(\log x)' = \frac{1}{x}

Other bases (10, 2) tack an extra coefficient onto the right side. Only base ee produces bare 1/x1/x.

Why AI uses natural log and e everywhere

Revisiting the earlier articles: softmax’s exie^{x_i}, cross-entropy’s logq\log q, log-likelihood’s logP\log P — AI formulas are soaked in ee and log\log. The main reason is “derivatives come out clean.”

Training updates parameters in the direction of the derivative, and the cleaner the derivative looks, the simpler the math, code, and implementation become. Backpropagation (up next) multiplies derivatives through the chain rule, so having no stray coefficients dangling off each step makes a massive difference in practice.

When log\log appears in AI writing, you can read it as base ee. Many AI writers don’t bother distinguishing ln\ln and log\log. This convention is more engineering-flavored than textbook-style.

The chain rule is “outer × inner”

The most load-bearing differentiation rule in AI is the chain rule. Japanese high school math calls it “differentiation of composite functions”; “chain rule” (連鎖律) is the wording more common at university level and in AI/ML contexts. Same thing either way.

For a composite function y=f(g(x))y = f(g(x)) — function inside a function — if you let u=g(x)u = g(x), the chain rule reads:

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

In words, three steps:

  1. Differentiate the outer function ff as usual
  2. Differentiate the inner function gg as usual
  3. Multiply the two

As a concrete example, differentiate y=(x2+1)3y = (x^2 + 1)^3.

  • Outer: y=u3y = u^3 → differentiates to 3u23u^2
  • Inner: u=x2+1u = x^2 + 1 → differentiates to 2x2x
  • Multiply: 3u22x=3(x2+1)22x=6x(x2+1)23u^2 \cdot 2x = 3(x^2+1)^2 \cdot 2x = 6x(x^2+1)^2

Think gear ratios. A small nudge of xx turns the intermediate uu. uu turning turns the output yy. Multiply the xux \to u ratio with the uyu \to y ratio, and you get the full xyx \to y ratio.

Why multiplication (not addition)?

“Why multiply, not add?” is a natural reaction. A slightly more math-y explanation:

A small change Δx\Delta x in xx produces approximately dudxΔx\frac{du}{dx} \cdot \Delta x change in uu. The corresponding Δu\Delta u in uu produces approximately dyduΔu\frac{dy}{du} \cdot \Delta u change in yy. Chaining these:

ΔydydududxΔx\Delta y \approx \frac{dy}{du} \cdot \frac{du}{dx} \cdot \Delta x

Divide by Δx\Delta x, take the limit as Δx0\Delta x \to 0, and you get dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}. Rigorously there’s a case analysis for Δu=0\Delta u = 0 scenarios, but intuitively “multiply the local rates” is correct. Leibniz notation makes it look like fractions cancelling, which is exactly because of this multiplicative structure in the background.

The chain rule’s role in AI

Neural networks stack layer on layer — the path from input to final loss is a deeply composite function.

flowchart LR
    X[Input x] --> L1[Layer 1: W1, b1]
    L1 --> L2[Layer 2: W2, b2]
    L2 --> L3[Layer 3: W3, b3]
    L3 --> L[Loss L]

To differentiate loss LL with respect to, say, a weight W1W_1 in layer 1, applying the chain rule repeatedly breaks the derivative into per-layer derivatives that get multiplied together. That’s exactly the skeleton of backpropagation (next article). “Just hammer the chain rule and you get gradients for all weights in a deep network” is a big part of why modern neural networks work.

Partial derivatives ∂f/∂x

So far everything has been single-variable functions. In AI training, the loss is a multivariable function of the neural network’s weights (parameters), which number in the millions to trillions. So differentiation needs to extend to functions with multiple variables.

Take a 2-variable function like f(x,y)=x2+y2f(x, y) = x^2 + y^2. “If I move xx alone, how does ff change?” and “If I move yy alone, how does ff change?” are the questions we want to measure separately. That’s the partial derivative.

  • Fix yy, differentiate with respect to xxfx\frac{\partial f}{\partial x}
  • Fix xx, differentiate with respect to yyfy\frac{\partial f}{\partial y}

\partial is read as “partial”, “round dee”, “del”, etc. Using \partial instead of dd advertises that “this is a derivative taken with the other variables fixed.” It’s a different-symbol-different-meaning declaration to distinguish from the single-variable dfdx\frac{df}{dx}.

A concrete example

For f(x,y)=x2+y2f(x, y) = x^2 + y^2:

  • fx=2x\frac{\partial f}{\partial x} = 2x (y2y^2 is a constant from xx‘s perspective, so it differentiates to 0)
  • fy=2y\frac{\partial f}{\partial y} = 2y (similarly x2x^2 drops out)

Regular differentiation plus “treat the other variables as constants” — that’s the whole rule change.

Geometric picture

f(x,y)=x2+y2f(x, y) = x^2 + y^2 graphs as a bowl-shaped surface in 3D. Moving (x,y)(x, y) around changes the height ff.

fx\frac{\partial f}{\partial x} can be pictured as the slope of the cross-section you get by slicing the bowl with a plane of constant yy. Fixing yy leaves a single 2D curve; you’re measuring the slope of that curve at the point. fy\frac{\partial f}{\partial y} is the same idea with the cutting plane rotated 90°.

Even with 3+ variables, “the slope of a 1-axis cross-section” stays the right mental image.

Gradient ∇f is a vector of partial derivatives

Stack all partial derivatives of a multivariable function into a vector and you get the gradient. The symbol is \nabla (nabla).

For f(x,y)=x2+y2f(x, y) = x^2 + y^2:

f=[f/xf/y]=[2x2y]\nabla f = \begin{bmatrix} \partial f / \partial x \\ \partial f / \partial y \end{bmatrix} = \begin{bmatrix} 2x \\ 2y \end{bmatrix}

It’s just a vector. The “arrow with a direction and length” from the second article on vectors comes back.

Two meanings: direction and length

A gradient vector carries two pieces of information.

  1. Direction: the direction in which the function increases most steeply
  2. Length: how fast it increases in that direction

“Most-steeply-increasing” flipped backward is “most-steeply-decreasing,” so if you compute the gradient L\nabla L of the loss L(θ)L(\theta) (with θ\theta bundling up all model parameters) and nudge θ\theta in the opposite direction, loss goes down most steeply. That’s the core motion of gradient descent, the main subject of the next article.

The gradient via contour lines

Take the “bowl” from the partial-derivative section and look at it from directly above:

Contour lines and gradient vector Concentric circles as contour lines of a function, with a gradient vector extending outward from a point on one of the contours min ∇f
Concentric circles are contour lines; the center is the bowl's minimum. The gradient at any point is perpendicular to the contour there and points outward (toward higher values). Gradient descent moves in the opposite direction — straight toward the bottom of the valley.

The concentric circles are contour lines — sets of points at the same height. Closer to the center is lower (bottom of the bowl), farther out is higher.

The red dot is one point on the bowl’s slope; the red arrow is the gradient vector f\nabla f at that point. As shown, the gradient is always perpendicular to the contour and points toward higher values.

Why does “stacked partial derivatives” give the steepest direction?

A natural question: “Why does stacking partial derivatives together land on the steepest direction specifically?” The intuition in one pass:

The rate of change in the direction of any unit vector v\vec{v} is the dot product fv\nabla f \cdot \vec{v}. Dot product (from the vectors article) is largest when the two vectors align, zero when they’re orthogonal, negative when opposite. So v\vec{v} aligned with the gradient gives the maximum rate of change.

The formal proof comes from Cauchy–Schwarz, but “the direction maximizing the dot product” is enough for reading AI articles.

Same idea in massive dimensions

In real AI models, L\nabla L has as many components as the number of parameters — billions to trillions. 3+ dimensions can’t be visualized directly, but “the vector pointing in the direction that decreases loss fastest” is exactly the same concept as the 2D case.

Differentiation with respect to vectors and matrices

The gradient above is one specific case: “a scalar function ff differentiated with respect to a vector x\vec{x}.” AI articles have several other combinations on top of that. This isn’t high school material, so heavy derivations stay out of scope; but knowing the patterns keeps you from stalling on notation in papers.

What / with respect toNotationResult shapeAI use
Scalar / vectorLw\frac{\partial L}{\partial \vec{w}}, wL\nabla_{\vec{w}} LVector (same shape as w\vec{w})Gradient descent’s main character — the gradient from the previous section
Scalar / matrixLW\frac{\partial L}{\partial W}, WL\nabla_W LMatrix (same shape as WW)Updating each layer’s weight matrix
Vector / vectoryx\frac{\partial \vec{y}}{\partial \vec{x}}Matrix (Jacobian)Flowing gradients layer-to-layer in backprop
Vector / scalardrdt\dfrac{d\vec{r}}{dt}VectorPosition r\vec{r} differentiated by time tt is velocity; again for acceleration. More physics than AI

The underlying idea is the same in all cases: “each element of the result is a partial derivative with respect to one of the inputs.” Even when the notation looks imposing, the content is a table of partial derivatives.

For anyone who had high school physics before calculus, the last row (vector / scalar) will be the most familiar shape. A position vector r(t)\vec{r}(t) differentiated with respect to time tt is the velocity v=dr/dt\vec{v} = d\vec{r}/dt; differentiated again, the acceleration a=dv/dt\vec{a} = d\vec{v}/dt.

Incidentally, the memorized high school physics formulas (v=v0+atv = v_0 + at, x=x0+v0t+12at2x = x_0 + v_0 t + \frac{1}{2}at^2 for constant-acceleration motion) can be derived mechanically from this a=dv/dt\vec{a} = d\vec{v}/dt relationship by integrating over time. From a calculus-first perspective, these are derivable formulas — there’s no need to memorize them.

Just remember the layout rules

  • LW\frac{\partial L}{\partial W} is a matrix with the same shape as WW, whose (i,j)(i, j)-th element is LWij\frac{\partial L}{\partial W_{ij}}
  • The Jacobian yx\frac{\partial \vec{y}}{\partial \vec{x}} is a matrix with output components down the rows and input components across the columns; the (i,j)(i, j)-th element is yixj\frac{\partial y_i}{\partial x_j}

When WL\nabla_W L, LW\frac{\partial L}{\partial W}, or a Jacobian JJ shows up in a paper, reading it as “a multidimensional arrangement of element-wise partial derivatives” is enough.

The actual mechanics (how the chain rule turns into matrix multiplications in backprop) land in the next article alongside concrete examples.

Integration, briefly

A short detour into integration at the end. Just as differentiation extracts slope, integration does the reverse — and the symbol is \int.

Integration has two faces

Integration has two apparently-different viewpoints that are actually connected behind the scenes.

  1. The reverse of differentiation: A function F(x)F(x) satisfying F(x)=f(x)F'(x) = f(x) is called an antiderivative of ff. Finding one is called an indefinite integral, written f(x)dx=F(x)+C\int f(x)\, dx = F(x) + C. The constant CC disappears under differentiation, so there’s always a constant’s worth of ambiguity.
  2. Area under a curve: The area bounded by the graph of f(x)f(x) and the xx-axis over an interval [a,b][a, b] is the definite integral, written abf(x)dx\int_a^b f(x)\, dx.

These two connect through the fundamental theorem of calculus:

abf(x)dx=F(b)F(a)\int_a^b f(x)\, dx = F(b) - F(a)

“The area equals the difference of the antiderivative at the endpoints.” Reverse-differentiating lets you compute areas. The full proof waits for real analysis at the university level, but the conclusion alone is enough for reading.

The constant-acceleration motion again, via integration

Circling back to the earlier example. Given a constant acceleration a\vec{a}, integrating once and twice over time:

v(t)=adt=at+v0\vec{v}(t) = \int \vec{a}\, dt = \vec{a}\, t + \vec{v}_0 r(t)=vdt=12at2+v0t+r0\vec{r}(t) = \int \vec{v}\, dt = \frac{1}{2}\vec{a}\, t^2 + \vec{v}_0 t + \vec{r}_0

And the memorized high-school formulas fall out cleanly. The constants v0\vec{v}_0 and r0\vec{r}_0 dropped in at each integration step are the initial velocity and initial position.

Where integration shows up in AI

Integration in AI articles is sparse.

  • Continuous probability distributions (expected value E[X]=xp(x)dxE[X] = \int x \cdot p(x)\, dx)
  • Stochastic differential equations in diffusion models
  • Continuous entropy in information theory

In LLM-adjacent writing it barely shows up, so seeing \int and reading it as “reverse of differentiation” or “area” is enough.

What you can read at this point

With this article’s toolkit, a lot of training-side AI formulas become followable.

Common formReading
dLdw\frac{dL}{dw}How the loss LL shifts when the weight ww is nudged
Lwi\frac{\partial L}{\partial w_i}Rate of change with respect to wiw_i, other weights fixed
θL\nabla_\theta LGradient vector over all parameters
LW\frac{\partial L}{\partial W} / WL\nabla_W LA matrix the same shape as WW, containing each element’s partial derivative
yx\frac{\partial \vec{y}}{\partial \vec{x}}The Jacobian — a table of output × input partial derivatives
(ex)(e^x)'exe^x unchanged
(logx)(\log x)'1/x1/x
(fg)(f \circ g)' / chain ruleDerivative of a composite function, the skeleton of backprop

You don’t need to actually solve the formulas. As long as “what is the rate of change of what, with respect to what” is readable, the numbers on the gradient side of a paper or codebase stop feeling like just numbers.

Things you can skip for now

At the entry level, these can be ignored.

TermSummaryWhy it’s safe to skip
ε-δ definitionRigorous formalization of limitsThe “nudge by something small” intuition is enough
Taylor expansionPolynomial approximation of a functionLook it up when needed
Integration techniquesSubstitution, integration by parts, etc.Integration itself rarely shows up in LLM articles
Higher-order derivativesDerivative applied multiple timesUsed inside optimizers like Adam, but knowing the name is enough
Hessian matrixA matrix of second-order partial derivativesUsed inside optimizers; the name is enough
Implicit differentiationDifferentiation of functions defined by equationsBasically never shows up in AI articles

For neural network training, the three things worth pinning down are: derivatives of power functions, derivatives of exe^x and logx\log x, and the chain rule.

Bonus: when you actually need to compute derivatives

This article focuses on reading rather than computing, so the formulas for by-hand differentiation have been mostly skipped. Exams oriented toward deep learning (like the Japanese E-qualification) do make you compute derivatives though, so here’s a compact reference.

Product and quotient rules

Rules for differentiating the product or quotient of two functions.

(fg)=fg+fg(fg)' = f'g + fg' (fg)=fgfgg2\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}

As a concrete example, differentiate y=x2exy = x^2 e^x:

y=2xex+x2ex=(2x+x2)exy' = 2x \cdot e^x + x^2 \cdot e^x = (2x + x^2)\, e^x

Left half is “differentiate the first, keep the second as-is”, right half is “keep the first, differentiate the second”, then add.

More chain rule examples

Standard patterns for applying the chain rule by hand.

OriginalDerivative
y=sin(2x)y = \sin(2x)y=2cos(2x)y' = 2\cos(2x)
y=ex2y = e^{x^2}y=2xex2y' = 2x \cdot e^{x^2}
y=log(x2+1)y = \log(x^2 + 1)y=2xx2+1y' = \dfrac{2x}{x^2 + 1}

All handled by “differentiate outer × differentiate inner.”

General-base exponentials/logs, and inverse trig

AI sticks to base ee, but exams mix in general bases and inverse trig functions.

OriginalDerivative
axa^xaxlogaa^x \log a
logax\log_a x1xloga\dfrac{1}{x \log a}
tanx\tan x1cos2x\dfrac{1}{\cos^2 x}
arctanx\arctan x11+x2\dfrac{1}{1 + x^2}

For base ee, loge=1\log e = 1, so these reduce to (ex)=ex(e^x)' = e^x and (logx)=1/x(\log x)' = 1/x. This is another place where ee‘s status as “the base where derivatives don’t get deformed” shows up.

Why (a^x)’ = a^x log a?

Tracing where this formula comes from. Any exponential axa^x can be rewritten using ee:

a=elogaa = e^{\log a}

(log\log here is the natural log, loge\log_e.) Raising both sides to the xx-th power:

ax=(eloga)x=exlogaa^x = \left(e^{\log a}\right)^x = e^{x \log a}

Differentiating exlogae^{x \log a} via chain rule splits into “outer × inner”:

  • Outer (eu)=eu(e^u)' = e^u (ee‘s special property at work)
  • Inner (xloga)=loga(x \log a)' = \log a (loga\log a is a constant)

Multiplying:

(ax)=exlogaloga=axloga(a^x)' = e^{x \log a} \cdot \log a = a^x \cdot \log a

When a=ea = e, loge=1\log e = 1, the coefficient cleanly vanishes:

(ex)=ex1=ex(e^x)' = e^x \cdot 1 = e^x

“Why does only ee get a coefficient of 1?” The answer: loge=1\log e = 1 cancels the extra loga\log a exactly.

And (log_a x)’

The log side follows the same pattern. Using the change-of-base formula:

logax=logxloga\log_a x = \frac{\log x}{\log a}

loga\log a is a constant, so it factors out during differentiation.

(logax)=(logx)loga=1xloga(\log_a x)' = \frac{(\log x)'}{\log a} = \frac{1}{x \log a}

For a=ea = e, loge=1\log e = 1 again, collapsing to (logx)=1/x(\log x)' = 1/x.

Both exponentials and logs with general base aa carry an extra loga\log a coefficient, which vanishes cleanly only when a=ea = e. At the formula level, that’s why ee and natural log are the permanent residents of AI math.


Glossary (feel free to skip)

TermMeaning
DerivativeThe slope (instantaneous rate of change) of a function at a point
Derivative functionA function returning the point-by-point slope of the original function
ddx\frac{d}{dx}”Differentiate with respect to xx” operator (Leibniz notation)
f(x)f'(x)Derivative function (Lagrange notation)
ee≈ 2.71828… Arises as the limit (1+1/n)n(1+1/n)^n, and is the unique base satisfying (ex)=ex(e^x)' = e^x
Natural logarithmLogarithm with base ee. Often written just as log\log in AI contexts
Chain ruleRule for differentiating composite functions: dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}
Partial derivative \partialRate of change when only one variable in a multivariable function is varied
Gradient f\nabla fVector of partial derivatives. Points in the direction of steepest increase
JacobianMatrix of partial derivatives of an output vector’s components with respect to an input vector’s components
Integral \intArea under a curve. The reverse operation of differentiation

Next up: gradient descent and backpropagation. The chain rule and gradient from this article take center stage.