The small set of math that makes AI articles readable

Reading AI articles, equations show up out of nowhere and that’s where a lot of people close the tab.
Honestly though, you don’t need to be able to solve them all.

The goal of this article isn’t to derive AI rigorously with math.
It’s to get to “oh, this equation is just shorthand for that kind of processing.”

No calculus will be solved here.
For the training section, if “it’s nudging things slightly in the direction that reduces the mistake” reads, that’s enough.

Math isn’t a magic spell — it’s shorthand for processing

If you describe what AI is doing really loosely, the flow is this.

flowchart LR
    A[Input<br/>words images audio] --> B[Turn into numbers]
    B --> C[Weight and sum]
    C --> D[Bend the shape]
    D --> E[Make probabilities]
    E --> F[Pick output]

“Turn into numbers”, “add”, “bend the shape”, “turn into probabilities.”
Embarrassingly simple, but fundamentally this is the combination.

The LLM, encoder, image generation, and 3D model articles on this blog all boil down to roughly this flow if you look at the underlying computation loosely.

First, handle things as “a row of numbers”

AI isn’t touching characters or images directly.
It first turns them into rows of numbers.

For example,

For words: “what meaning is this word close to”
For images: “is there an edge or color change at this part”
For audio: “how strong is each frequency band”

— that kind of information is held as a long row of numbers.

These rows of numbers are usually written as vector in articles and papers.
It looks intimidating, but at first it’s enough to think “just numbers lined up horizontally.”

This stage is the same in experiments like I tried whether a local Vision LLM can pull RPG parameters out of a character image, where we fed an image and asked for JSON, and in stories like Running TRELLIS.2 on M1 Max 64GB — a hands-on verification log, where images are turned into 3D locally.

AI first does “weighted addition with priorities”

The first equation to look at can be this.

y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

Read the symbols like this.

Symbol	Rough meaning
$x$	Each input element
$w$	How heavily that element is weighted
$b$	A small shift to adjust the overall position
$y$	The final output value

What’s happening is simple — each input is multiplied by an importance and then summed.

For a sentence, for example,

Some words pull strongly
Some words barely pull
Combinations change how they pull

— that’s the kind of effect.

Same for images: information about edges, colors, and positions is weighted and mixed in.
AI isn’t knowing “it’s a cat” from the start; it’s more accurate to picture it scoring features and then totaling them.

Addition alone can’t draw boundaries

Just adding leaves AI only able to react in straight lines.
So a “shape-bending step” is inserted in between.

The easiest example for beginners is sigmoid.

Small inputs sit near 0, large inputs sit near 1. The middle is where it changes sharply.

Written out, it’s this.

\sigma(x) = \frac{1}{1 + e^{-x}}

But the shape of the graph matters more than the formula itself.
What this basically does is,

push very small values close to 0
push very large values close to 1
respond sharply only around the middle

— that’s the curve.

This lets you build boundaries like “is it dog-ish” / “cat-ish” / “is this feature strong or weak” smoothly instead of with a sudden jump.

That said, real recent models often use different functions.
Sigmoid is used here just to convey the feeling of “addition alone isn’t enough, so the shape is bent in between."

"The ChatGPT feel” comes from “lining up probabilities”

This is the most important part of LLM output.

P(y_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

This is called Softmax — it takes raw scores for each candidate and converts them into probability-like ratios.

Read it like this.

$z_i$ is the raw score of each candidate
$P(y_i)$ is how likely that candidate is to be picked

Say there are three candidates for the next word.

Candidate	Raw score
”is”	high
”was”	moderately high
”refrigerator”	low

Softmax is what properly turns “high / moderately high / low” into actual ratios.

That’s why ChatGPT isn’t pulling answers off a shelf.
It’s computing “given this context, which is the most natural next token” every single time.

The same idea shows up in experiments like detecting and correcting OCR typos with an encoder model + local LLM, where BERT is used for OCR.
Each candidate character’s score is converted into a probability, and “how suspicious is this character” is read off from that.

Articles like I looked up the source of “ChatGPT lies 27% of the time” that I wrote earlier are connected at the root to this same “line up plausibilities and output the next one” mechanism.

What “training data isn’t baked in directly” actually means

From the discussion so far, what AI holds isn’t the text itself but rather,

what kind of weights are placed on what features
which candidate gets a high score under which circumstances

— that kind of numerical form.

Basically, it’s not that the model stores training text like a warehouse and pulls it out when needed.
It’s holding connections and tendencies soaked into the weights, as the mental picture.

That said, this isn’t fully black-and-white.
Specific fragments can get memorized strongly, and long proper names or templated phrases can come out verbatim.
Even so, for the basic structure, seeing it as “the pattern is reflected in the weights” over “whole-text storage” makes AI behavior easier to follow.

Training is “when it misses, nudge a little” on repeat

Training, very roughly, can be read from these two lines.

L = - \log p

w_{\text{new}} = w_{\text{old}} - \eta g

$L$ in the first is “by how much it missed.”
$p$ here can be thought of as the probability assigned to the correct answer.
If the correct answer was only given a low probability, the loss is large.

The second is the weight update.
Read it as “new weight = current weight - adjustment amount.”

Symbol	Rough meaning
$w_{\text{old}}$	Current weight
$w_{\text{new}}$	Updated weight
$\eta$	How far to move per step
$g$	Direction that reduces the mistake

In other words,

first, measure how much it missed
next, move a little in the direction that reduces the miss
repeat that a huge number of times

— that’s it.

Differentiation really does show up here.
But as an entry point, “looking at the gap from the correct answer and slowly correcting” is enough, I think.

Once this much makes sense, other articles get easier to read

Once you can read this level of math, it’s much easier to follow what AI articles are saying.

Type	Rough role
Encoder	Pulls features from input and turns them into manageable numbers
LLM	Chains probabilities of the next word or token
VLM	Handles images and text together
Image generation	Gradually nudges noise toward “something image-like”
3D model	Takes features of images or video as numbers, then turns them back into shape

For example, working articles like I investigated whether Z-Image (Zaoxiang) runs on RunPod — hoping for stable character shapes still treat text and image features as numbers underneath.
Similarly, multimodal experiments like I tried whether a local Vision LLM can pull RPG parameters out of a character image are an extension of the same idea: “handle images and text in the same number space.”

Glossary for the curious

This section is fine to skip.
Just the minimum meaning of the words used above.

AI in general

Term	Meaning
vector	Numbers lined up horizontally. In AI, word and image features are often held as such a row
weight	A coefficient that decides which information is looked at strongly. Same input, different weights, different result
activation function	A step that adds bends you can’t make with addition alone. Sigmoid is one example
Softmax	A step that turns a row of scores into a ratio or probability-like form. Not just “what to output next” in LLMs, but also “how much to look at where” in other situations
loss function	A number for how much it missed. The larger it is, the more “correction is still needed”
learning rate	How far to move per single correction. Too large and it gets unstable, too small and it’s slow
encoder	The side of the model that converts input into manageable features. Think of it as a role that maps meaning and shape into numbers

Words you see in image generation

Term	Meaning
Text encoder	The part that turns a prompt into a row of numbers the image model can read. In ComfyUI you see it as `CLIP Text Encode` type nodes
VAE	The part that compresses an image into a manageable form and, at the end, reconstructs the image. `VAE Encode` and `VAE Decode` are these
latent	Not the image itself — an intermediate representation used inside VAE. If you see `latent` in ComfyUI, think “numbers before they become an image”
CFG	How strongly the prompt is applied. Too high and it becomes unnatural, too low and the instruction doesn’t land
step	The number of iterations turning noise back into an image. More means more careful, but slower
sampler	The procedure for how noise is reduced. If you’re touching the `KSampler` in ComfyUI, that’s enough

Can you actually make this yourself?

Making ChatGPT itself or a full image generator in Excel is out of the question.
But a mini version of the basic computation shown in this article is doable in a spreadsheet.

For example, weighting and summing two inputs looks like this.

$x_1 = 0.8$
$x_2 = 0.3$
$w_1 = 1.2$
$w_2 = -0.5$
$b = 0.1$

y = w_1 x_1 + w_2 x_2 + b

This one you can just compute in Excel or Google Sheets.

=1.2*0.8 + (-0.5)*0.3 + 0.1

If you want the output to be in the 0–1 range, you can pipe it through sigmoid too.

=1/(1+EXP(-A1))

Here A1 is the cell holding the $y$ you just computed.

Even this much is enough to get a feel for,

positive weights push the output up
negative weights push the opposite way
instead of outputting the raw value, sometimes the shape is bent at the end

A half-baked “will it rain tomorrow” AI

To make it a bit more AI-ish, you can think of a tiny version that outputs “is it rainy tomorrow.”

For explanation, the training data is shown in a table below.
But when making a prediction, you don’t look at this table directly.
What gets adjusted during training is “how heavily to weight each factor.”

Say the training data is like this.

Rain yesterday	High humidity	Pressure dropping	Rain tomorrow
1	1	1	1
1	1	0	1
0	1	1	1
1	0	0	0
0	0	1	0
0	0	0	0

After training, say the weights come out like this.

Element	Value
Weight of “rain yesterday”	0.6
Weight of “high humidity”	1.0
Weight of “pressure dropping”	1.3
Bias $b$	-1.4

Now feed in a new input not in the training data.

Rain yesterday	High humidity	Pressure dropping
1	0	1

The computation is this.

y = 0.6 \times 1 + 1.0 \times 0 + 1.3 \times 1 - 1.4 = 0.5

Pipe it through sigmoid.

\sigma(0.5) \approx 0.62

So it reads as “rainy-ish, about 62%.”

In Excel or Sheets, you can write this, for example.

=0.6*1 + 1.0*0 + 1.3*1 - 1.4

=1/(1+EXP(-0.5))

Obviously, real weather prediction isn’t this simple.
But,

there’s a training table
weights remain after training
at prediction time, the weights are used on new input

— to see that flow, this is enough, I think.

The important point is that it isn’t looking up rows in the training data.
Tendencies like “high humidity leans rainy” and “falling pressure leans further rainy” are reflected in the weights — that’s closer to the picture.

In short, even though the giant models are out of reach, “the basic computation behind AI” is traceable by hand.
The equations suddenly stop looking like a string of symbols and start looking like ordinary arithmetic.

A half-baked “horse-race prediction AI” or something similar has basically the same structure.
If you’re curious, play around with building one — you’ll quickly get a feel for turning inputs into numbers and giving them weights.
That said, since real horse racing has a lot of factors to look at, if you’re going to try something yourself, boat racing feels like it’s easier to keep tidy.

Within this blog, the ones where the math underneath is easiest to see are around these.

Detecting and correcting OCR typos with an encoder model + local LLM
A concrete example of an encoder reading per-position probabilities
I tried whether a local Vision LLM can pull RPG parameters out of a character image
An example of a multimodal experiment: give an image, get JSON
I investigated whether Z-Image (Zaoxiang) runs on RunPod — hoping for stable character shapes
An entry point where an image generation model is actually run
Running TRELLIS.2 on M1 Max 64GB — a hands-on verification log
An experiment generating 3D from image features locally

If you want a side-read

MoonshotAI (Kimi) proposes AttnRes, replacing Transformer’s residual connections with Attention, for 1.25× compute efficiency
An example of Softmax showing up in “how much to look at where” weighting
Z-Image — an Alibaba image generator said to surpass FLUX
An earlier article that lays out Z-Image’s position and structure