Tech 23 min read

Probability and statistics, just enough to read AI articles

IkesanContents

Keep reading AI articles and, right after the math symbols, probability and statistics symbols start showing up. Stuff like P(x|y), log, or H(p, q) suddenly lining up is enough to make you close the tab.

This article covers those probability and statistics symbols — with the goal of being able to read them, not solve them. It follows the previous math and vectors and matrices articles as the third in the series, but it’s written to stand on its own too.

No rigorous derivation of Bayes or maximum likelihood here. Conditional probability, cross-entropy, perplexity, and temperature are enough to make LLM training logs and model cards readable.

Probability is “how likely, on a 0-to-1 scale”

Start with notation.

The most common form in AI articles is P(x)P(x). It represents “the probability that event xx happens”.

  • P(x)=0P(x) = 0: never happens
  • P(x)=1P(x) = 1: happens for sure
  • P(x)=0.3P(x) = 0.3: happens 30% of the time

That’s the whole reading. Nothing tricky. Think of the weather report’s “70% chance of rain” as P(rain)=0.7P(\text{rain}) = 0.7, and you’re set.

When there are multiple possible outcomes, their probabilities always add up to 1.

P(sunny)+P(cloudy)+P(rainy)=1P(\text{sunny}) + P(\text{cloudy}) + P(\text{rainy}) = 1

It’s just saying “one of them has to happen.” Probability distributions and softmax, which come later, all exist to uphold this “sums to 1” rule.

Conditional probability P(AB)P(A | B) is “the probability of A given B happened”

Next up is the version with a vertical bar: P(AB)P(A | B). It reads as “the probability that AA happens, given that BB has happened”. Right of the bar is the condition; left of the bar is what you want to know.

  • P(raincloudy)P(\text{rain} | \text{cloudy}): probability of rain given that it’s cloudy
  • P(positiveinfected)P(\text{positive} | \text{infected}): probability the test returns positive given the person is infected
  • P(infectedpositive)P(\text{infected} | \text{positive}): probability the person is actually infected given the test came back positive

The trick is that the last two look similar but mean totally different things — swap the sides of the bar and the meaning flips. This “flipping sides” story leads to Bayes’ theorem, but this article doesn’t chase it.

LLMs predict the next word via conditional probability

What LLMs do, stripped down, is compute this conditional probability.

P(next tokentokens so far)P(\text{next token} | \text{tokens so far})

For input like “The weather today is”, the LLM computes a probability across the entire vocabulary (tens of thousands of tokens).

CandidateProbability
sunny0.42
cloudy0.18
rainy0.15
nice0.08
…(thousands more)remainder

A list of probabilities that sums to 1 falls out, and picking one (or taking the top one) is one step of generation. This “list” is what the next section calls a probability distribution, and softmax is the device that turns scores into this list.

A probability distribution is “probabilities lined up per candidate”

Line up probabilities per candidate like in the table above, and you have a probability distribution. “Distribution” sounds fancy, but it’s just a row of probabilities.

  • Finite candidates → discrete distribution (LLM token prediction lives here)
  • Continuous values → continuous distribution (like the normal distribution)

AI articles mostly deal with discrete distributions, showing up as “a probability vector over the whole vocabulary.” Each element of the vector is “the probability that token is picked”, and they sum to 1.

So the LLM’s output layer is returning a tens-of-thousands-of-dimensions probability distribution.

Softmax is “a device that turns scores into a probability distribution”

Inside the model, each candidate gets a raw “score” (any value, positive or negative). At this stage it’s not a probability yet — the numbers don’t sum to 1, and some are negative.

softmax converts this into a probability distribution.

softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

Plenty of symbols, but it’s two steps:

  1. Push each score through exe^x to make it positive (exe^x is always positive, no matter the input)
  2. Divide by the sum so everything adds to 1

What comes out is a set of probabilities that sum to 1. Details on softmax are in the previous math article, so for this article, “a device that turns scores into a distribution” is enough.

The LLM’s final output and image classifiers’ class decisions are both looking at the probability distribution after softmax.

Expected value E[X]E[X] is another name for “weighted average”

Next up: the expected value E[X]E[X]. “Expected” makes it sound like a prediction, but it’s just a weighted average.

E[X]=xxP(x)E[X] = \sum_x x \cdot P(x)

All it does is sum up “value xx × its probability P(x)P(x)”.

For example, the expected value of a die roll is

E[X]=116+216++616=3.5E[X] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + \cdots + 6 \cdot \tfrac{1}{6} = 3.5

Not that every roll gives you 3.5 — it means “if you roll many times, you’ll settle around this value.”

Compare this with the weighted sum from the previous article, y=w1x1+w2x2+y = w_1 x_1 + w_2 x_2 + \cdots, and the shape is identical. The weights are probabilities now, but the operation is the same. Reinforcement learning’s “expected reward”, attention’s “weighted retrieval” — they all sit on top of this weighted average.

Variance and standard deviation are “how spread out”

Variance and standard deviation are probably the most off-putting pair at the entrance of statistics. The actual formulas are these.

  • Variance: Var(X)=E[(XE[X])2]\text{Var}(X) = E[(X - E[X])^2]
  • Standard deviation: σ=Var(X)\sigma = \sqrt{\text{Var}(X)}

The symbols are nested, which makes it hard to read. Taken step by step, it gets much friendlier.

What does “spread out” even mean?

Concrete first. Two classes of test scores:

  • Class A: 50, 50, 50, 50, 50 (average 50)
  • Class B: 30, 40, 50, 60, 70 (average 50)

Same average of 50, but Class A is dead on the average while Class B is all over the place. Quantifying this “same average, different feel” is what a spread measure tries to do.

Step 1: How far each value is from the average

First, take each value and subtract the average. For Class B:

ValueAverageDeviation
3050-20
4050-10
50500
6050+10
7050+20

This is the XE[X]X - E[X] piece. E[X]E[X] is the expected value (average) from earlier, and XE[X]X - E[X] captures “how far, and in which direction, a value is from the average.” The innermost piece of the formula is plain subtraction.

Step 2: Square the deviations so positive and negative don’t cancel

Add these deviations up and they cancel out to zero. (20)+(10)+0+10+20=0(-20) + (-10) + 0 + 10 + 20 = 0. This can’t tell “not spread out at all” apart from “coincidentally balanced above and below.”

The fix is to square each deviation before summing. (20)2=400(-20)^2 = 400, 102=10010^2 = 100 — they all become positive, so nothing cancels.

ValueDeviationSquared deviation
30-20400
40-10100
5000
60+10100
70+20400

That’s the (XE[X])2(X - E[X])^2 piece — step 1 with squaring on top.

“Couldn’t we just use absolute values?” Sure, but absolute values bend the graph and make calculus messy, so squaring wins on mathematical convenience.

Step 3: The average of squared deviations is the variance

Sum them and divide by the count — take the average.

400+100+0+100+4005=200\frac{400 + 100 + 0 + 100 + 400}{5} = 200

200 is the variance. The outermost E[]E[\,\cdot\,] in the formula is “take the average”, wrapping around the (XE[X])2(X - E[X])^2 inside.

Var(X)=E[(XE[X])2]\text{Var}(X) = E[(X - E[X])^2]

Read from the inside out: “compute deviation → square it → take the average”, and steps 1 through 3 are literally nested inside the formula.

For reference, running the same calculation on Class A (all 50s) gives deviations of 0, squared deviations of 0, average of 0, and a variance of 0. “Totally uniform” maps cleanly to “indicator = 0.”

Step 4: Take the square root to get back to the original unit

Variance is useful but has one awkward quality. Since we squared the deviations, the unit is also squared (score² or m², for example).

Taking the square root brings it back to the original unit — that’s the standard deviation.

σ=Var(X)=20014.1\sigma = \sqrt{\text{Var}(X)} = \sqrt{200} \approx 14.1

So Class B reads as “scores scattered around the average of 50, by roughly 14 points.” Variance 200 makes you ask “200 what?”; standard deviation 14.1 gives a direct “plus/minus 14 around the mean” feel.

Where variance and standard deviation show up in AI

In AI articles, three common contexts:

  • Data preprocessing (standardization, normalization) to “rescale to mean 0, variance 1”
  • Batch norm / layer norm does something similar internally
  • As a health indicator for model outputs or training stability

No need to compute anything — “a measure of spread” is plenty to read AI articles with.

Covariance: do two values move together?

Variance measured “how far one value strays from its average.” Extend this to two values, and “do they move together, or opposite to each other?” is what covariance captures.

Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])]

When XX goes above its average and YY also goes above its average, the product is positive × positive, so positive value accumulates. When XX goes up and YY goes down, it’s positive × negative, so negative value accumulates. If they’re unrelated, the positives and negatives cancel out and you land near 0.

Set X=YX = Y and you get Cov(X,X)=E[(XE[X])2]\text{Cov}(X, X) = E[(X - E[X])^2], which is just the variance. Variance is “covariance with yourself”; covariance is “variance for two variables.”

Non-AI example: FFT and correlation coefficient

Variance and covariance show up outside AI too. In my earlier karaoke scoring article, I used the correlation coefficient to measure how close an input singing voice was to a reference after running an FFT to decompose the audio into frequency components.

ρAB=Cov(A,B)s(A)s(B)\rho_{AB} = \frac{\text{Cov}(A, B)}{s(A) \cdot s(B)}

s(A)s(A) and s(B)s(B) are the standard deviations of each. Raw covariance is hard to compare across datasets because its magnitude depends on the original scale, so dividing by the product of standard deviations rescales it to the 1-1 to 11 range. This is the standard recipe for “not whether the values are the same, but whether they move in the same direction.”

The same idea shows up in AI too — embedding similarities, feature preprocessing, etc. Variance, covariance, and standard deviation are, ultimately, tools for normalizing raw scales, and they appear identically in AI and non-AI contexts.

Logarithm log\log is “the tool that lets you add probabilities instead of multiplying them”

Probability formulas often have log\log show up out of nowhere. A lot of people hated log\log in high school, and “multiplication becomes addition” doesn’t help if “why?” and “where do I ever use this?” are still in the air. Clearing those two up before getting to AI.

What is log\log again?

log\log answers “to go from one number to another, what exponent do I need?” log10100=2\log_{10} 100 = 2 means “to get from 10 to 100, raise it to the 2nd power.” The base (the small subscript) might be 10, or ee, but the “what exponent” idea is common to all. While we’re here: base-10 log is called the common logarithm, and base-ee log is the natural logarithm. The mysterious ee makes a brief appearance in the calculus article later, so for now, just know “bases like ee exist” and move on.

ExpressionReading
log1010=1\log_{10} 10 = 1To go from 10 to 10, exponent 1 (itself)
log10100=2\log_{10} 100 = 2To go from 10 to 100, exponent 2
log101000=3\log_{10} 1000 = 3To go from 10 to 1000, exponent 3
log1010000=4\log_{10} 10000 = 4To go from 10 to 10000, exponent 4

Line them up and the original values grow exponentially (10×, 100×, …), while the log\log values step up linearly: 1, 2, 3, 4. Think of it as “close to counting digits.”

Why multiplication turns into addition

With the “what exponent” lens, log(ab)=loga+logb\log(a \cdot b) = \log a + \log b isn’t mysterious.

Look at 100×1000=100000100 \times 1000 = 100000 through log10\log_{10}.

  • log10100=2\log_{10} 100 = 2 (10 to 100 needs exponent 2)
  • log101000=3\log_{10} 1000 = 3 (10 to 1000 needs exponent 3)
  • log10100000=5\log_{10} 100000 = 5 (10 to 100000 needs exponent 5 — namely 2+3)

Words like “100000 is 10 to the 5th” don’t click for everyone, so follow it as a formula. The key is the pre-existing rule for exponents: “multiply → exponents add.”

102×103=102+3=10510^2 \times 10^3 = 10^{2+3} = 10^5

Multiply “10 twice over” by “10 three times over” and you end up with “10 five times over.” Plain enough.

Rewrite this under log10\log_{10}:

log10(102×103)=log10102+log10103=2+3=5\log_{10}(10^2 \times 10^3) = \log_{10} 10^2 + \log_{10} 10^3 = 2 + 3 = 5

Multiplication on the left-hand side becomes addition on the right. log(ab)=loga+logb\log(a \cdot b) = \log a + \log b is just the existing “exponents add” rule viewed from the log\log side. That’s what “multiplication turns into addition” is.

Bonus: another rule — you can pull exponents out front

In the equations above, log10102=2\log_{10} 10^2 = 2 slipped through without justification. That used a different rule.

logan=nloga\log a^n = n \log a

“An exponent inside a log\log can come out in front as a coefficient.” With this, log10102\log_{10} 10^2 lands at 2 via this path:

log10102=2log1010=2×1=2\log_{10} 10^2 = 2 \log_{10} 10 = 2 \times 1 = 2

It’s the sibling rule to log(ab)=loga+logb\log(a \cdot b) = \log a + \log b, and it’s handy for quickly computing individual log\log values.

Where does log\log show up?

“I did logs in school but I’ve never used them in real life,” a lot of people think. Actually a surprising number of everyday indicators are logarithmic under the hood. Humans perceive “times-over” changes more evenly than absolute changes, so laying them out on a log scale lines them up at equal intervals, which is easier to work with.

IndicatorWhat it measures on a log scaleRule of thumb
Earthquake magnitudeEnergy of the quake+1 magnitude ≈ 32× the energy
Sound in decibels (dB)Intensity of sound+10 dB = 10× the energy
pHHydrogen ion concentration−1 pH = 10× the ion concentration
Stellar magnitudeBrightness of stars5-magnitude difference = 100× brightness
Musical octaveFrequency+1 octave = 2× frequency

They all take a “doubling/decupling” quantity and map it onto equal intervals. log\log is less a math trick and more a tool for squeezing a wide-range quantity into a human-sized number.

Why AI uses log\log

Two main reasons AI uses log\log:

  1. To avoid underflow when multiplying probabilities
  2. Because derivatives (slopes) during training come out cleaner

For #1, an LLM multiplies thousands to tens of thousands of token probabilities, and the result quickly slips into 0.000000.00000\ldots underflow territory. Take the log\log and multiplication becomes addition, so you’re just summing small negative numbers — safe territory.

#2 is a calculus topic, so it’s pushed to next time. Short version: rewriting things in log\log form makes the training update rule cleaner.

When you see logP(x)\log P(x) in a formula, read it as “probabilities would collapse under multiplication, so we switched to an additive form.” The base depends on context — sometimes ee, sometimes 22 — but in AI it’s almost always the natural log (base ee).

Likelihood is “the probability of the observed data, as seen by the model”

Now into training.

Likelihood is a word that rarely comes up elsewhere, but its definition is simple. “The probability that, under this model (or parameters), the actually observed data is produced” — that’s the likelihood.

“Training” is ultimately the process of adjusting the model’s parameters so that this likelihood gets larger. A model that can produce the observed data with higher probability is better at explaining the data. That’s the underlying idea.

Likelihood itself is a long product, so it’s usually converted to a log and written as log-likelihood.

logP(datamodel)=ilogP(ximodel)\log P(\text{data} | \text{model}) = \sum_i \log P(x_i | \text{model})

“Sum the logs of each data point’s probability” — that’s the shape. Moving parameters in the direction that makes this larger is the idea known as maximum likelihood estimation. The name is worth remembering; reading it as “make the observed data’s probability larger” is enough.

Cross-entropy is “the training loss itself”

The most common probability formula in AI articles is cross-entropy. It’s used as the loss for classification and LLM training.

What is “loss” anyway?

One term to pin down first. Loss is the model’s prediction-vs-truth gap, condensed into a single number. Training is the process of nudging parameters to reduce this loss. What matters is “how you quantify the gap,” and for classification and LLMs, cross-entropy is the standard recipe.

The formula

H(p,q)=xp(x)logq(x)H(p, q) = -\sum_x p(x) \log q(x)

Plenty of symbols, but two roles:

  • p(x)p(x): the ground-truth distribution (which one is correct)
  • q(x)q(x): the model’s predicted distribution (softmax output)

For LLMs and classification, the truth is often “just one class is 1, the rest are 0” (one-hot). In that case, the formula collapses dramatically.

H(p,q)=logq(correct class)H(p, q) = -\log q(\text{correct class})

It’s literally “the log of the probability the model assigned to the correct class, with a minus sign in front.”

  • High confidence on the right class (q1q \approx 1) → logq0\log q \approx 0 → loss close to 0
  • Low confidence on the right class (q0q \approx 0) → logq\log q is large and negative → loss is huge

LLM training computes this cross-entropy per token and updates parameters in the direction that reduces the average. The right-falling loss curves in pre-training plots are basically looking at the average of this cross-entropy.

Why the minus sign?

The minus at the front is the kind of thing people quietly wonder about. “Couldn’t we just use positive values for the gap?” The reason is simple: qq is a probability in the [0,1][0, 1] range, so logq\log q is always 0\le 0.

  • q=1q = 1 (fully correct) → logq=0\log q = 0
  • q=0.5q = 0.5logq0.69\log q \approx -0.69
  • q0q \to 0 (completely wrong) → logq\log q \to -\infty

If the negative-log intuition doesn’t land, rewriting q=1/2q = 1/2 in exponent form helps.

log(1/2)=log(21)=log2\log(1/2) = \log(2^{-1}) = -\log 2

Using the “bonus” rule from before (“exponents come out in front”), the 1-1 drops down from the exponent and the negative sign stays. Any number less than 1 can usually be written as ana^{-n}, so its log\log ends up negative by construction.

Leaving the loss negative would make it confusing: “do we want it close to 0, or deep in the negatives?” Flipping the sign up front makes everything non-negative and turns it into “0 is ideal, bigger is worse” — a natural direction. Entropy in the next section also carries a minus in front of the formula; same reason — logp\log p is negative, and the minus turns it back into a positive number.

Why “cross” entropy?

For those curious about the name. There’s a quantity called entropy, H(p)=p(x)logp(x)H(p) = -\sum p(x) \log p(x), which measures the uncertainty of a distribution. Cross-entropy brings in a second distribution qq into the picture — hence “cross.”

And about that name “entropy”

“Entropy” might ring a bell from Puella Magi Madoka Magica where Kyubey kept going on about it, or from science class as “the stuff that makes things get more disordered when left alone.” That version is thermodynamic entropy — the degree of disorder of a physical system (how many microscopic states it can occupy).

The information-theoretic H(p)=p(x)logp(x)H(p) = -\sum p(x) \log p(x) has the same mathematical shape and the same conceptual spirit, which is why the name is reused. Here it means “how unpredictable the distribution is”: high when you can’t guess which outcome will show up, low when one outcome is basically locked in. If you read “disorder = hard to predict next”, the thermodynamic and information-theoretic entropies are mostly the same idea, staged in two different worlds.

This article won’t dive deeper into information-theoretic entropy, but “cross-entropy measures how close two distributions are” is enough reading — close → small, far → large.

KL divergence is “the gap between two distributions”

Another one that shows up in training and distribution comparison: KL divergence. It’s written like DKL(pq)D_{KL}(p \| q), with a double vertical bar.

In Japanese materials or exams (like the G-test), you’ll also see it called KL information or Kullback-Leibler information. They all refer to the same thing — if you see the “information” version in a different article, mentally translate. Also, “KL” / “Kullback-Leibler” isn’t some abbreviation or code — it’s the names of Solomon Kullback and Richard Leibler, the two mathematicians who proposed the concept in 1951.

DKL(pq)=xp(x)logp(x)q(x)D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

It measures “how different pp and qq are” — 0 if they match, bigger the farther apart they are. It’s not called a distance, though. Swap pp and qq and the value changes, which violates the conditions for a mathematical distance.

Its relation to cross-entropy is:

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{KL}(p \| q)

When the ground-truth distribution pp is fixed, H(p)H(p) is constant, so minimizing cross-entropy is the same as minimizing KL divergence. “Reduce cross-entropy = bring the predicted distribution closer to the truth” is a valid reading.

In AI articles, KL divergence shows up in RLHF, DPO and other RL-style methods, in distillation where you’re matching distributions, in VAE losses, and more.

Perplexity is “cross-entropy lifted to an exponent”

The LLM evaluation metric you see most is perplexity (PPL). You’ll run into things like “PPL 3.5” in model cards and papers.

Note: not to be confused with the AI search product Perplexity AI — this is a metric. Same name, no relation.

PPL=eH(p,q)\text{PPL} = e^{H(p, q)}

Cross-entropy, exponentiated by ee. Since cross-entropy lives in the log domain, this lifts it back to the original scale.

The meaning: “on average, how many candidates is the model effectively choosing between when predicting the next token?”

  • PPL = 1: always picks the right one (perfect)
  • PPL = 10: wobbling between about 10 choices on average
  • PPL = hundreds to thousands: mostly flailing

Smaller = stronger model. Essentially the same metric as cross-entropy — drop one, the other drops too. So when papers or release notes say “loss went down” and “perplexity improved”, they’re saying the same thing from two angles.

Temperature is “the knob that controls how sharp the distribution is”

temperature — the one you see in generation settings — is softmax with one extra parameter.

softmax(xi/T)\text{softmax}(x_i / T)

Divide the scores by TT before running them through softmax.

  • T=1T = 1: standard softmax
  • T<1T < 1: sharpens the distribution (top tokens become even more dominant)
  • T>1T > 1: flattens the distribution (low-probability tokens get a better shot)
  • T0T \to 0: only the top token gets chosen (equivalent to argmax)
flowchart LR
    A[Scores] --> B[Divide by T]
    B --> C[softmax to distribution]
    C --> D[Sampling]

Raise the temperature for more randomness; lower it for more stable output. The temperature setting in OpenAI, Anthropic, and local LLM inference configs all works the same way.

Why “temperature”?

The name comes from the Boltzmann distribution in physics. In statistical mechanics, the probability that a particle occupies state ii is written:

P(i)eEi/kTP(i) \propto e^{-E_i / kT}

EiE_i is the state’s energy, TT is temperature, kk is Boltzmann’s constant. Look at the shape, and it lines up exactly with the temperature-softmax exi/Te^{x_i / T} from before. Treat the score xix_i as “negative energy” Ei-E_i, and the temperature TT in the denominator is the same TT.

The real-world behavior of temperature matches the distribution shift:

  • High temperature → particles bounce around, most states accessible → distribution flattens
  • Low temperature → particles lock into the lowest-energy state → distribution sharpens

AI sampling’s “crank up the temperature for random outputs, drop it for stability” borrows this physical intuition directly. That’s where the name comes from.

How it relates to top-k and top-p (nucleus) sampling

Top-k and top-p often show up alongside temperature.

  • top-k: keep only the top-kk highest-probability tokens, zero the rest
  • top-p: accumulate probability from the top until the total crosses pp, cut it off there

Temperature changes “the shape of the distribution”; top-k / top-p changes “which candidates are in the pool.” Combining them — drop the tail so low-probability tokens don’t misfire while keeping some randomness — is the default recipe.

Enough to read LLM training logs and sampling settings

With the tools from this article, most of the formulas and numbers in LLM-related articles start being followable.

Common formReading
P(x)P(x)Probability that xx happens
P(A|B)Probability of AA given BB happened
P(next token|past)The LLM’s next-token prediction itself
softmaxDevice that turns scores into a probability distribution
E[X]E[X]Weighted average: value × probability, summed
logP(x)\log P(x)A form that lets you add instead of multiply
H(p,q)H(p, q)Cross-entropy — the training loss
D_KL(p || q)Gap between two distributions
PPLExponentiated cross-entropy — smaller is stronger
TT (temperature)The knob that controls sampling sharpness

You don’t need to actually solve the formulas — just being able to see “what this is doing at a high level” turns the numbers in papers and model cards into something more than decoration.

Stuff you can skip for now

At the entry level, these can be ignored:

TermSummaryWhy it’s safe to skip
Set notation (\cup, \cap, \emptyset, \in)Math notation for treating events as setsShows up in rigorous probability definitions, but AI articles overwhelmingly use function forms like P(x)P(x) or P(x|y)
Bayes’ theoremFormula for flipping conditional probabilitiesKnowing the name is enough; read up when you actually need it
Maximum likelihood (MLE)Find parameters that maximize likelihoodSame story as “reducing cross-entropy = increasing log-likelihood”
Normal distributionThe bell curve that shows up for continuous valuesLLM articles mostly deal with discrete distributions; “bell-shaped distribution” is enough to skim it
Covariance matrixMultivariate variances packaged as a matrixMore relevant to generative models or classical stats than LLMs
Jaccard, correlation coefficientVarious similarity / correlation indicesLook up when you actually encounter them
Beta, Dirichlet distributionsDistributions for distributionsLots of names, narrow use cases

For reading AI articles, the high-frequency items in probability and statistics are really just conditional probability and cross-entropy, so starting there is the most efficient path.


Glossary (feel free to skip)

TermMeaning
Probability P(x)P(x)A number from 0 to 1 representing how likely event xx is
Conditional probability P(A|B)Probability of AA given BB has happened
Probability distributionProbabilities lined up per candidate; they sum to 1
Discrete distributionA distribution over a finite set of candidates; LLM token prediction is this
softmaxFunction that turns scores into a probability distribution
Expected value E[X]E[X]Value × probability, summed — a weighted average
Variance, standard deviationMeasures of spread
CovarianceWhether two values move together or oppositely; covariance with oneself is variance
Correlation coefficientCovariance divided by the product of standard deviations, rescaled to [1,1][-1, 1]
LikelihoodProbability of the observed data under the model
Log-likelihoodLog of the likelihood; additive, so easier to work with during training
Cross-entropyGap between the truth and predicted distributions; used as training loss
EntropyUncertainty of a distribution
KL divergenceGap between two distributions; 0 if identical, larger as they diverge. Also called KL information or Kullback-Leibler information
PerplexityExponentiated cross-entropy — the go-to LLM evaluation metric. Also called PPL
TemperatureParameter that adjusts the sharpness of the distribution during sampling

Next up: calculus, specifically gradient descent and backpropagation, in the same “read, don’t solve” style.