Translated from the original Japanese article

Tech Apr 22, 2026 23 min read

Probability and statistics, just enough to read AI articles

Contents

Keep reading AI articles and, right after the math symbols, probability and statistics symbols start showing up. Stuff like P(x|y), log, or H(p, q) suddenly lining up is enough to make you close the tab.

This article covers those probability and statistics symbols — with the goal of being able to read them, not solve them. It follows the previous math and vectors and matrices articles as the third in the series, but it’s written to stand on its own too.

No rigorous derivation of Bayes or maximum likelihood here. Conditional probability, cross-entropy, perplexity, and temperature are enough to make LLM training logs and model cards readable.

Probability is “how likely, on a 0-to-1 scale”

Start with notation.

The most common form in AI articles is $P(x)$ . It represents “the probability that event $x$ happens”.

$P(x) = 0$ : never happens
$P(x) = 1$ : happens for sure
$P(x) = 0.3$ : happens 30% of the time

That’s the whole reading. Nothing tricky. Think of the weather report’s “70% chance of rain” as $P(\text{rain}) = 0.7$ , and you’re set.

When there are multiple possible outcomes, their probabilities always add up to 1.

P(\text{sunny}) + P(\text{cloudy}) + P(\text{rainy}) = 1

It’s just saying “one of them has to happen.” Probability distributions and softmax, which come later, all exist to uphold this “sums to 1” rule.

Conditional probability $P(A | B)$ is “the probability of A given B happened”

Next up is the version with a vertical bar: $P(A | B)$ . It reads as “the probability that $A$ happens, given that $B$ has happened”. Right of the bar is the condition; left of the bar is what you want to know.

$P(\text{rain} | \text{cloudy})$ : probability of rain given that it’s cloudy
$P(\text{positive} | \text{infected})$ : probability the test returns positive given the person is infected
$P(\text{infected} | \text{positive})$ : probability the person is actually infected given the test came back positive

The trick is that the last two look similar but mean totally different things — swap the sides of the bar and the meaning flips. This “flipping sides” story leads to Bayes’ theorem, but this article doesn’t chase it.

LLMs predict the next word via conditional probability

What LLMs do, stripped down, is compute this conditional probability.

P(\text{next token} | \text{tokens so far})

For input like “The weather today is”, the LLM computes a probability across the entire vocabulary (tens of thousands of tokens).

Candidate	Probability
sunny	0.42
cloudy	0.18
rainy	0.15
nice	0.08
…(thousands more)	remainder

A list of probabilities that sums to 1 falls out, and picking one (or taking the top one) is one step of generation. This “list” is what the next section calls a probability distribution, and softmax is the device that turns scores into this list.

A probability distribution is “probabilities lined up per candidate”

Line up probabilities per candidate like in the table above, and you have a probability distribution. “Distribution” sounds fancy, but it’s just a row of probabilities.

Finite candidates → discrete distribution (LLM token prediction lives here)
Continuous values → continuous distribution (like the normal distribution)

AI articles mostly deal with discrete distributions, showing up as “a probability vector over the whole vocabulary.” Each element of the vector is “the probability that token is picked”, and they sum to 1.

So the LLM’s output layer is returning a tens-of-thousands-of-dimensions probability distribution.

Softmax is “a device that turns scores into a probability distribution”

Inside the model, each candidate gets a raw “score” (any value, positive or negative). At this stage it’s not a probability yet — the numbers don’t sum to 1, and some are negative.

softmax converts this into a probability distribution.

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

Plenty of symbols, but it’s two steps:

Push each score through $e^x$ to make it positive ( $e^x$ is always positive, no matter the input)
Divide by the sum so everything adds to 1

What comes out is a set of probabilities that sum to 1. Details on softmax are in the previous math article, so for this article, “a device that turns scores into a distribution” is enough.

The LLM’s final output and image classifiers’ class decisions are both looking at the probability distribution after softmax.

Expected value $E[X]$ is another name for “weighted average”

Next up: the expected value $E[X]$ . “Expected” makes it sound like a prediction, but it’s just a weighted average.

E[X] = \sum_x x \cdot P(x)

All it does is sum up “value $x$ × its probability $P(x)$ ”.

For example, the expected value of a die roll is

E[X] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + \cdots + 6 \cdot \tfrac{1}{6} = 3.5

Not that every roll gives you 3.5 — it means “if you roll many times, you’ll settle around this value.”

Compare this with the weighted sum from the previous article, $y = w_1 x_1 + w_2 x_2 + \cdots$ , and the shape is identical. The weights are probabilities now, but the operation is the same. Reinforcement learning’s “expected reward”, attention’s “weighted retrieval” — they all sit on top of this weighted average.

Variance and standard deviation are “how spread out”

Variance and standard deviation are probably the most off-putting pair at the entrance of statistics. The actual formulas are these.

Variance: $\text{Var}(X) = E[(X - E[X])^2]$
Standard deviation: $\sigma = \sqrt{\text{Var}(X)}$

The symbols are nested, which makes it hard to read. Taken step by step, it gets much friendlier.

What does “spread out” even mean?

Concrete first. Two classes of test scores:

Class A: 50, 50, 50, 50, 50 (average 50)
Class B: 30, 40, 50, 60, 70 (average 50)

Same average of 50, but Class A is dead on the average while Class B is all over the place. Quantifying this “same average, different feel” is what a spread measure tries to do.

Step 1: How far each value is from the average

First, take each value and subtract the average. For Class B:

Value	Average	Deviation
30	50	-20
40	50	-10
50	50	0
60	50	+10
70	50	+20

This is the $X - E[X]$ piece. $E[X]$ is the expected value (average) from earlier, and $X - E[X]$ captures “how far, and in which direction, a value is from the average.” The innermost piece of the formula is plain subtraction.

Step 2: Square the deviations so positive and negative don’t cancel

Add these deviations up and they cancel out to zero. $(-20) + (-10) + 0 + 10 + 20 = 0$ . This can’t tell “not spread out at all” apart from “coincidentally balanced above and below.”

The fix is to square each deviation before summing. $(-20)^2 = 400$ , $10^2 = 100$ — they all become positive, so nothing cancels.

Value	Deviation	Squared deviation
30	-20	400
40	-10	100
50	0	0
60	+10	100
70	+20	400

That’s the $(X - E[X])^2$ piece — step 1 with squaring on top.

“Couldn’t we just use absolute values?” Sure, but absolute values bend the graph and make calculus messy, so squaring wins on mathematical convenience.

Step 3: The average of squared deviations is the variance

Sum them and divide by the count — take the average.

\frac{400 + 100 + 0 + 100 + 400}{5} = 200

200 is the variance. The outermost $E[\,\cdot\,]$ in the formula is “take the average”, wrapping around the $(X - E[X])^2$ inside.

\text{Var}(X) = E[(X - E[X])^2]

Read from the inside out: “compute deviation → square it → take the average”, and steps 1 through 3 are literally nested inside the formula.

For reference, running the same calculation on Class A (all 50s) gives deviations of 0, squared deviations of 0, average of 0, and a variance of 0. “Totally uniform” maps cleanly to “indicator = 0.”

Step 4: Take the square root to get back to the original unit

Variance is useful but has one awkward quality. Since we squared the deviations, the unit is also squared (score² or m², for example).

Taking the square root brings it back to the original unit — that’s the standard deviation.

\sigma = \sqrt{\text{Var}(X)} = \sqrt{200} \approx 14.1

So Class B reads as “scores scattered around the average of 50, by roughly 14 points.” Variance 200 makes you ask “200 what?”; standard deviation 14.1 gives a direct “plus/minus 14 around the mean” feel.

Where variance and standard deviation show up in AI

In AI articles, three common contexts:

Data preprocessing (standardization, normalization) to “rescale to mean 0, variance 1”
Batch norm / layer norm does something similar internally
As a health indicator for model outputs or training stability

No need to compute anything — “a measure of spread” is plenty to read AI articles with.

Covariance: do two values move together?

Variance measured “how far one value strays from its average.” Extend this to two values, and “do they move together, or opposite to each other?” is what covariance captures.

\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])]

When $X$ goes above its average and $Y$ also goes above its average, the product is positive × positive, so positive value accumulates. When $X$ goes up and $Y$ goes down, it’s positive × negative, so negative value accumulates. If they’re unrelated, the positives and negatives cancel out and you land near 0.

Set $X = Y$ and you get $\text{Cov}(X, X) = E[(X - E[X])^2]$ , which is just the variance. Variance is “covariance with yourself”; covariance is “variance for two variables.”

Non-AI example: FFT and correlation coefficient

Variance and covariance show up outside AI too. In my earlier karaoke scoring article, I used the correlation coefficient to measure how close an input singing voice was to a reference after running an FFT to decompose the audio into frequency components.

\rho_{AB} = \frac{\text{Cov}(A, B)}{s(A) \cdot s(B)}

$s(A)$ and $s(B)$ are the standard deviations of each. Raw covariance is hard to compare across datasets because its magnitude depends on the original scale, so dividing by the product of standard deviations rescales it to the $-1$ to $1$ range. This is the standard recipe for “not whether the values are the same, but whether they move in the same direction.”

The same idea shows up in AI too — embedding similarities, feature preprocessing, etc. Variance, covariance, and standard deviation are, ultimately, tools for normalizing raw scales, and they appear identically in AI and non-AI contexts.

Logarithm $\log$ is “the tool that lets you add probabilities instead of multiplying them”

Probability formulas often have $\log$ show up out of nowhere. A lot of people hated $\log$ in high school, and “multiplication becomes addition” doesn’t help if “why?” and “where do I ever use this?” are still in the air. Clearing those two up before getting to AI.

What is $\log$ again?

$\log$ answers “to go from one number to another, what exponent do I need?” $\log_{10} 100 = 2$ means “to get from 10 to 100, raise it to the 2nd power.” The base (the small subscript) might be 10, or $e$ , but the “what exponent” idea is common to all. While we’re here: base-10 log is called the common logarithm, and base- $e$ log is the natural logarithm. The mysterious $e$ makes a brief appearance in the calculus article later, so for now, just know “bases like $e$ exist” and move on.

Expression	Reading
$\log_{10} 10 = 1$	To go from 10 to 10, exponent 1 (itself)
$\log_{10} 100 = 2$	To go from 10 to 100, exponent 2
$\log_{10} 1000 = 3$	To go from 10 to 1000, exponent 3
$\log_{10} 10000 = 4$	To go from 10 to 10000, exponent 4

Line them up and the original values grow exponentially (10×, 100×, …), while the $\log$ values step up linearly: 1, 2, 3, 4. Think of it as “close to counting digits.”

Why multiplication turns into addition

With the “what exponent” lens, $\log(a \cdot b) = \log a + \log b$ isn’t mysterious.

Look at $100 \times 1000 = 100000$ through $\log_{10}$ .

$\log_{10} 100 = 2$ (10 to 100 needs exponent 2)
$\log_{10} 1000 = 3$ (10 to 1000 needs exponent 3)
$\log_{10} 100000 = 5$ (10 to 100000 needs exponent 5 — namely 2+3)

Words like “100000 is 10 to the 5th” don’t click for everyone, so follow it as a formula. The key is the pre-existing rule for exponents: “multiply → exponents add.”

10^2 \times 10^3 = 10^{2+3} = 10^5

Multiply “10 twice over” by “10 three times over” and you end up with “10 five times over.” Plain enough.

Rewrite this under $\log_{10}$ :

\log_{10}(10^2 \times 10^3) = \log_{10} 10^2 + \log_{10} 10^3 = 2 + 3 = 5

Multiplication on the left-hand side becomes addition on the right. $\log(a \cdot b) = \log a + \log b$ is just the existing “exponents add” rule viewed from the $\log$ side. That’s what “multiplication turns into addition” is.

Bonus: another rule — you can pull exponents out front

In the equations above, $\log_{10} 10^2 = 2$ slipped through without justification. That used a different rule.

\log a^n = n \log a

“An exponent inside a $\log$ can come out in front as a coefficient.” With this, $\log_{10} 10^2$ lands at 2 via this path:

\log_{10} 10^2 = 2 \log_{10} 10 = 2 \times 1 = 2

It’s the sibling rule to $\log(a \cdot b) = \log a + \log b$ , and it’s handy for quickly computing individual $\log$ values.

Where does $\log$ show up?

“I did logs in school but I’ve never used them in real life,” a lot of people think. Actually a surprising number of everyday indicators are logarithmic under the hood. Humans perceive “times-over” changes more evenly than absolute changes, so laying them out on a log scale lines them up at equal intervals, which is easier to work with.

Indicator	What it measures on a log scale	Rule of thumb
Earthquake magnitude	Energy of the quake	+1 magnitude ≈ 32× the energy
Sound in decibels (dB)	Intensity of sound	+10 dB = 10× the energy
pH	Hydrogen ion concentration	−1 pH = 10× the ion concentration
Stellar magnitude	Brightness of stars	5-magnitude difference = 100× brightness
Musical octave	Frequency	+1 octave = 2× frequency

They all take a “doubling/decupling” quantity and map it onto equal intervals. $\log$ is less a math trick and more a tool for squeezing a wide-range quantity into a human-sized number.

Why AI uses $\log$

Two main reasons AI uses $\log$ :

To avoid underflow when multiplying probabilities
Because derivatives (slopes) during training come out cleaner

For #1, an LLM multiplies thousands to tens of thousands of token probabilities, and the result quickly slips into $0.00000\ldots$ underflow territory. Take the $\log$ and multiplication becomes addition, so you’re just summing small negative numbers — safe territory.

#2 is a calculus topic, so it’s pushed to next time. Short version: rewriting things in $\log$ form makes the training update rule cleaner.

When you see $\log P(x)$ in a formula, read it as “probabilities would collapse under multiplication, so we switched to an additive form.” The base depends on context — sometimes $e$ , sometimes $2$ — but in AI it’s almost always the natural log (base $e$ ).

Likelihood is “the probability of the observed data, as seen by the model”

Now into training.

Likelihood is a word that rarely comes up elsewhere, but its definition is simple. “The probability that, under this model (or parameters), the actually observed data is produced” — that’s the likelihood.

“Training” is ultimately the process of adjusting the model’s parameters so that this likelihood gets larger. A model that can produce the observed data with higher probability is better at explaining the data. That’s the underlying idea.

Likelihood itself is a long product, so it’s usually converted to a log and written as log-likelihood.

\log P(\text{data} | \text{model}) = \sum_i \log P(x_i | \text{model})

“Sum the logs of each data point’s probability” — that’s the shape. Moving parameters in the direction that makes this larger is the idea known as maximum likelihood estimation. The name is worth remembering; reading it as “make the observed data’s probability larger” is enough.

Cross-entropy is “the training loss itself”

The most common probability formula in AI articles is cross-entropy. It’s used as the loss for classification and LLM training.

What is “loss” anyway?

One term to pin down first. Loss is the model’s prediction-vs-truth gap, condensed into a single number. Training is the process of nudging parameters to reduce this loss. What matters is “how you quantify the gap,” and for classification and LLMs, cross-entropy is the standard recipe.

The formula

H(p, q) = -\sum_x p(x) \log q(x)

Plenty of symbols, but two roles:

$p(x)$ : the ground-truth distribution (which one is correct)
$q(x)$ : the model’s predicted distribution (softmax output)

For LLMs and classification, the truth is often “just one class is 1, the rest are 0” (one-hot). In that case, the formula collapses dramatically.

H(p, q) = -\log q(\text{correct class})

It’s literally “the log of the probability the model assigned to the correct class, with a minus sign in front.”

High confidence on the right class ( $q \approx 1$ ) → $\log q \approx 0$ → loss close to 0
Low confidence on the right class ( $q \approx 0$ ) → $\log q$ is large and negative → loss is huge

LLM training computes this cross-entropy per token and updates parameters in the direction that reduces the average. The right-falling loss curves in pre-training plots are basically looking at the average of this cross-entropy.

Why the minus sign?

The minus at the front is the kind of thing people quietly wonder about. “Couldn’t we just use positive values for the gap?” The reason is simple: $q$ is a probability in the $[0, 1]$ range, so $\log q$ is always $\le 0$ .

$q = 1$ (fully correct) → $\log q = 0$
$q = 0.5$ → $\log q \approx -0.69$
$q \to 0$ (completely wrong) → $\log q \to -\infty$

If the negative-log intuition doesn’t land, rewriting $q = 1/2$ in exponent form helps.

\log(1/2) = \log(2^{-1}) = -\log 2

Using the “bonus” rule from before (“exponents come out in front”), the $-1$ drops down from the exponent and the negative sign stays. Any number less than 1 can usually be written as $a^{-n}$ , so its $\log$ ends up negative by construction.

Leaving the loss negative would make it confusing: “do we want it close to 0, or deep in the negatives?” Flipping the sign up front makes everything non-negative and turns it into “0 is ideal, bigger is worse” — a natural direction. Entropy in the next section also carries a minus in front of the formula; same reason — $\log p$ is negative, and the minus turns it back into a positive number.

Why “cross” entropy?

For those curious about the name. There’s a quantity called entropy, $H(p) = -\sum p(x) \log p(x)$ , which measures the uncertainty of a distribution. Cross-entropy brings in a second distribution $q$ into the picture — hence “cross.”

And about that name “entropy”

“Entropy” might ring a bell from Puella Magi Madoka Magica where Kyubey kept going on about it, or from science class as “the stuff that makes things get more disordered when left alone.” That version is thermodynamic entropy — the degree of disorder of a physical system (how many microscopic states it can occupy).

The information-theoretic $H(p) = -\sum p(x) \log p(x)$ has the same mathematical shape and the same conceptual spirit, which is why the name is reused. Here it means “how unpredictable the distribution is”: high when you can’t guess which outcome will show up, low when one outcome is basically locked in. If you read “disorder = hard to predict next”, the thermodynamic and information-theoretic entropies are mostly the same idea, staged in two different worlds.

This article won’t dive deeper into information-theoretic entropy, but “cross-entropy measures how close two distributions are” is enough reading — close → small, far → large.

KL divergence is “the gap between two distributions”

Another one that shows up in training and distribution comparison: KL divergence. It’s written like $D_{KL}(p \| q)$ , with a double vertical bar.

In Japanese materials or exams (like the G-test), you’ll also see it called KL information or Kullback-Leibler information. They all refer to the same thing — if you see the “information” version in a different article, mentally translate. Also, “KL” / “Kullback-Leibler” isn’t some abbreviation or code — it’s the names of Solomon Kullback and Richard Leibler, the two mathematicians who proposed the concept in 1951.

D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

It measures “how different $p$ and $q$ are” — 0 if they match, bigger the farther apart they are. It’s not called a distance, though. Swap $p$ and $q$ and the value changes, which violates the conditions for a mathematical distance.

Its relation to cross-entropy is:

H(p, q) = H(p) + D_{KL}(p \| q)

When the ground-truth distribution $p$ is fixed, $H(p)$ is constant, so minimizing cross-entropy is the same as minimizing KL divergence. “Reduce cross-entropy = bring the predicted distribution closer to the truth” is a valid reading.

In AI articles, KL divergence shows up in RLHF, DPO and other RL-style methods, in distillation where you’re matching distributions, in VAE losses, and more.

Perplexity is “cross-entropy lifted to an exponent”

The LLM evaluation metric you see most is perplexity (PPL). You’ll run into things like “PPL 3.5” in model cards and papers.

Note: not to be confused with the AI search product Perplexity AI — this is a metric. Same name, no relation.

\text{PPL} = e^{H(p, q)}

Cross-entropy, exponentiated by $e$ . Since cross-entropy lives in the log domain, this lifts it back to the original scale.

The meaning: “on average, how many candidates is the model effectively choosing between when predicting the next token?”

PPL = 1: always picks the right one (perfect)
PPL = 10: wobbling between about 10 choices on average
PPL = hundreds to thousands: mostly flailing

Smaller = stronger model. Essentially the same metric as cross-entropy — drop one, the other drops too. So when papers or release notes say “loss went down” and “perplexity improved”, they’re saying the same thing from two angles.

Temperature is “the knob that controls how sharp the distribution is”

temperature — the one you see in generation settings — is softmax with one extra parameter.

\text{softmax}(x_i / T)

Divide the scores by $T$ before running them through softmax.

$T = 1$ : standard softmax
$T < 1$ : sharpens the distribution (top tokens become even more dominant)
$T > 1$ : flattens the distribution (low-probability tokens get a better shot)
$T \to 0$ : only the top token gets chosen (equivalent to argmax)

flowchart LR
    A[Scores] --> B[Divide by T]
    B --> C[softmax to distribution]
    C --> D[Sampling]

Raise the temperature for more randomness; lower it for more stable output. The temperature setting in OpenAI, Anthropic, and local LLM inference configs all works the same way.

Why “temperature”?

The name comes from the Boltzmann distribution in physics. In statistical mechanics, the probability that a particle occupies state $i$ is written:

P(i) \propto e^{-E_i / kT}

$E_i$ is the state’s energy, $T$ is temperature, $k$ is Boltzmann’s constant. Look at the shape, and it lines up exactly with the temperature-softmax $e^{x_i / T}$ from before. Treat the score $x_i$ as “negative energy” $-E_i$ , and the temperature $T$ in the denominator is the same $T$ .

The real-world behavior of temperature matches the distribution shift:

High temperature → particles bounce around, most states accessible → distribution flattens
Low temperature → particles lock into the lowest-energy state → distribution sharpens

AI sampling’s “crank up the temperature for random outputs, drop it for stability” borrows this physical intuition directly. That’s where the name comes from.

How it relates to top-k and top-p (nucleus) sampling

Top-k and top-p often show up alongside temperature.

top-k: keep only the top- $k$ highest-probability tokens, zero the rest
top-p: accumulate probability from the top until the total crosses $p$ , cut it off there

Temperature changes “the shape of the distribution”; top-k / top-p changes “which candidates are in the pool.” Combining them — drop the tail so low-probability tokens don’t misfire while keeping some randomness — is the default recipe.

Enough to read LLM training logs and sampling settings

With the tools from this article, most of the formulas and numbers in LLM-related articles start being followable.

Common form	Reading
$P(x)$	Probability that $x$ happens
`P(A\|B)`	Probability of $A$ given $B$ happened
`P(next token\|past)`	The LLM’s next-token prediction itself
softmax	Device that turns scores into a probability distribution
$E[X]$	Weighted average: value × probability, summed
$\log P(x)$	A form that lets you add instead of multiply
$H(p, q)$	Cross-entropy — the training loss
`D_KL(p \|\| q)`	Gap between two distributions
PPL	Exponentiated cross-entropy — smaller is stronger
$T$ (temperature)	The knob that controls sampling sharpness

You don’t need to actually solve the formulas — just being able to see “what this is doing at a high level” turns the numbers in papers and model cards into something more than decoration.

Stuff you can skip for now

At the entry level, these can be ignored:

Term	Summary	Why it’s safe to skip
Set notation ( $\cup$ , $\cap$ , $\emptyset$ , $\in$ )	Math notation for treating events as sets	Shows up in rigorous probability definitions, but AI articles overwhelmingly use function forms like $P(x)$ or `P(x\|y)`
Bayes’ theorem	Formula for flipping conditional probabilities	Knowing the name is enough; read up when you actually need it
Maximum likelihood (MLE)	Find parameters that maximize likelihood	Same story as “reducing cross-entropy = increasing log-likelihood”
Normal distribution	The bell curve that shows up for continuous values	LLM articles mostly deal with discrete distributions; “bell-shaped distribution” is enough to skim it
Covariance matrix	Multivariate variances packaged as a matrix	More relevant to generative models or classical stats than LLMs
Jaccard, correlation coefficient	Various similarity / correlation indices	Look up when you actually encounter them
Beta, Dirichlet distributions	Distributions for distributions	Lots of names, narrow use cases

For reading AI articles, the high-frequency items in probability and statistics are really just conditional probability and cross-entropy, so starting there is the most efficient path.

The small set of math that makes AI articles readable Weighted sums, sigmoid, softmax, the training loop. Part 1 of this series.
Vectors and matrices, just enough to read AI articles Dot products, matrix products, transpose. The kit for reading Attention’s $Q K^T$ . Part 2 of this series.

Glossary (feel free to skip)

Term	Meaning
Probability $P(x)$	A number from 0 to 1 representing how likely event $x$ is
Conditional probability `P(A\|B)`	Probability of $A$ given $B$ has happened
Probability distribution	Probabilities lined up per candidate; they sum to 1
Discrete distribution	A distribution over a finite set of candidates; LLM token prediction is this
softmax	Function that turns scores into a probability distribution
Expected value $E[X]$	Value × probability, summed — a weighted average
Variance, standard deviation	Measures of spread
Covariance	Whether two values move together or oppositely; covariance with oneself is variance
Correlation coefficient	Covariance divided by the product of standard deviations, rescaled to $[-1, 1]$
Likelihood	Probability of the observed data under the model
Log-likelihood	Log of the likelihood; additive, so easier to work with during training
Cross-entropy	Gap between the truth and predicted distributions; used as training loss
Entropy	Uncertainty of a distribution
KL divergence	Gap between two distributions; 0 if identical, larger as they diverge. Also called KL information or Kullback-Leibler information
Perplexity	Exponentiated cross-entropy — the go-to LLM evaluation metric. Also called PPL
Temperature	Parameter that adjusts the sharpness of the distribution during sampling

Next up: calculus, specifically gradient descent and backpropagation, in the same “read, don’t solve” style.