The small set of math that makes AI articles readable
Contents
Reading AI articles, equations show up out of nowhere and that’s where a lot of people close the tab.
Honestly though, you don’t need to be able to solve them all.
The goal of this article isn’t to derive AI rigorously with math.
It’s to get to “oh, this equation is just shorthand for that kind of processing.”
No calculus will be solved here.
For the training section, if “it’s nudging things slightly in the direction that reduces the mistake” reads, that’s enough.
Math isn’t a magic spell — it’s shorthand for processing
If you describe what AI is doing really loosely, the flow is this.
flowchart LR
A[Input<br/>words images audio] --> B[Turn into numbers]
B --> C[Weight and sum]
C --> D[Bend the shape]
D --> E[Make probabilities]
E --> F[Pick output]
“Turn into numbers”, “add”, “bend the shape”, “turn into probabilities.”
Embarrassingly simple, but fundamentally this is the combination.
The LLM, encoder, image generation, and 3D model articles on this blog all boil down to roughly this flow if you look at the underlying computation loosely.
First, handle things as “a row of numbers”
AI isn’t touching characters or images directly.
It first turns them into rows of numbers.
For example,
- For words: “what meaning is this word close to”
- For images: “is there an edge or color change at this part”
- For audio: “how strong is each frequency band”
— that kind of information is held as a long row of numbers.
These rows of numbers are usually written as vector in articles and papers.
It looks intimidating, but at first it’s enough to think “just numbers lined up horizontally.”
This stage is the same in experiments like I tried whether a local Vision LLM can pull RPG parameters out of a character image, where we fed an image and asked for JSON, and in stories like Running TRELLIS.2 on M1 Max 64GB — a hands-on verification log, where images are turned into 3D locally.
AI first does “weighted addition with priorities”
The first equation to look at can be this.
Read the symbols like this.
| Symbol | Rough meaning |
|---|---|
| Each input element | |
| How heavily that element is weighted | |
| A small shift to adjust the overall position | |
| The final output value |
What’s happening is simple — each input is multiplied by an importance and then summed.
For a sentence, for example,
- Some words pull strongly
- Some words barely pull
- Combinations change how they pull
— that’s the kind of effect.
Same for images: information about edges, colors, and positions is weighted and mixed in.
AI isn’t knowing “it’s a cat” from the start; it’s more accurate to picture it scoring features and then totaling them.
Addition alone can’t draw boundaries
Just adding leaves AI only able to react in straight lines.
So a “shape-bending step” is inserted in between.
The easiest example for beginners is sigmoid.
Written out, it’s this.
But the shape of the graph matters more than the formula itself.
What this basically does is,
- push very small values close to 0
- push very large values close to 1
- respond sharply only around the middle
— that’s the curve.
This lets you build boundaries like “is it dog-ish” / “cat-ish” / “is this feature strong or weak” smoothly instead of with a sudden jump.
That said, real recent models often use different functions.
Sigmoid is used here just to convey the feeling of “addition alone isn’t enough, so the shape is bent in between."
"The ChatGPT feel” comes from “lining up probabilities”
This is the most important part of LLM output.
This is called Softmax — it takes raw scores for each candidate and converts them into probability-like ratios.
Read it like this.
- is the raw score of each candidate
- is how likely that candidate is to be picked
Say there are three candidates for the next word.
| Candidate | Raw score |
|---|---|
| ”is” | high |
| ”was” | moderately high |
| ”refrigerator” | low |
Softmax is what properly turns “high / moderately high / low” into actual ratios.
That’s why ChatGPT isn’t pulling answers off a shelf.
It’s computing “given this context, which is the most natural next token” every single time.
The same idea shows up in experiments like detecting and correcting OCR typos with an encoder model + local LLM, where BERT is used for OCR.
Each candidate character’s score is converted into a probability, and “how suspicious is this character” is read off from that.
Articles like I looked up the source of “ChatGPT lies 27% of the time” that I wrote earlier are connected at the root to this same “line up plausibilities and output the next one” mechanism.
What “training data isn’t baked in directly” actually means
From the discussion so far, what AI holds isn’t the text itself but rather,
- what kind of weights are placed on what features
- which candidate gets a high score under which circumstances
— that kind of numerical form.
Basically, it’s not that the model stores training text like a warehouse and pulls it out when needed.
It’s holding connections and tendencies soaked into the weights, as the mental picture.
That said, this isn’t fully black-and-white.
Specific fragments can get memorized strongly, and long proper names or templated phrases can come out verbatim.
Even so, for the basic structure, seeing it as “the pattern is reflected in the weights” over “whole-text storage” makes AI behavior easier to follow.
Training is “when it misses, nudge a little” on repeat
Training, very roughly, can be read from these two lines.
in the first is “by how much it missed.”
here can be thought of as the probability assigned to the correct answer.
If the correct answer was only given a low probability, the loss is large.
The second is the weight update.
Read it as “new weight = current weight - adjustment amount.”
| Symbol | Rough meaning |
|---|---|
| Current weight | |
| Updated weight | |
| How far to move per step | |
| Direction that reduces the mistake |
In other words,
- first, measure how much it missed
- next, move a little in the direction that reduces the miss
- repeat that a huge number of times
— that’s it.
Differentiation really does show up here.
But as an entry point, “looking at the gap from the correct answer and slowly correcting” is enough, I think.
Once this much makes sense, other articles get easier to read
Once you can read this level of math, it’s much easier to follow what AI articles are saying.
| Type | Rough role |
|---|---|
| Encoder | Pulls features from input and turns them into manageable numbers |
| LLM | Chains probabilities of the next word or token |
| VLM | Handles images and text together |
| Image generation | Gradually nudges noise toward “something image-like” |
| 3D model | Takes features of images or video as numbers, then turns them back into shape |
For example, working articles like I investigated whether Z-Image (Zaoxiang) runs on RunPod — hoping for stable character shapes still treat text and image features as numbers underneath.
Similarly, multimodal experiments like I tried whether a local Vision LLM can pull RPG parameters out of a character image are an extension of the same idea: “handle images and text in the same number space.”
Glossary for the curious
This section is fine to skip.
Just the minimum meaning of the words used above.
AI in general
| Term | Meaning |
|---|---|
| vector | Numbers lined up horizontally. In AI, word and image features are often held as such a row |
| weight | A coefficient that decides which information is looked at strongly. Same input, different weights, different result |
| activation function | A step that adds bends you can’t make with addition alone. Sigmoid is one example |
| Softmax | A step that turns a row of scores into a ratio or probability-like form. Not just “what to output next” in LLMs, but also “how much to look at where” in other situations |
| loss function | A number for how much it missed. The larger it is, the more “correction is still needed” |
| learning rate | How far to move per single correction. Too large and it gets unstable, too small and it’s slow |
| encoder | The side of the model that converts input into manageable features. Think of it as a role that maps meaning and shape into numbers |
Words you see in image generation
| Term | Meaning |
|---|---|
| Text encoder | The part that turns a prompt into a row of numbers the image model can read. In ComfyUI you see it as CLIP Text Encode type nodes |
| VAE | The part that compresses an image into a manageable form and, at the end, reconstructs the image. VAE Encode and VAE Decode are these |
| latent | Not the image itself — an intermediate representation used inside VAE. If you see latent in ComfyUI, think “numbers before they become an image” |
| CFG | How strongly the prompt is applied. Too high and it becomes unnatural, too low and the instruction doesn’t land |
| step | The number of iterations turning noise back into an image. More means more careful, but slower |
| sampler | The procedure for how noise is reduced. If you’re touching the KSampler in ComfyUI, that’s enough |
Can you actually make this yourself?
Making ChatGPT itself or a full image generator in Excel is out of the question.
But a mini version of the basic computation shown in this article is doable in a spreadsheet.
For example, weighting and summing two inputs looks like this.
This one you can just compute in Excel or Google Sheets.
=1.2*0.8 + (-0.5)*0.3 + 0.1
If you want the output to be in the 0–1 range, you can pipe it through sigmoid too.
=1/(1+EXP(-A1))
Here A1 is the cell holding the you just computed.
Even this much is enough to get a feel for,
- positive weights push the output up
- negative weights push the opposite way
- instead of outputting the raw value, sometimes the shape is bent at the end
A half-baked “will it rain tomorrow” AI
To make it a bit more AI-ish, you can think of a tiny version that outputs “is it rainy tomorrow.”
For explanation, the training data is shown in a table below.
But when making a prediction, you don’t look at this table directly.
What gets adjusted during training is “how heavily to weight each factor.”
Say the training data is like this.
| Rain yesterday | High humidity | Pressure dropping | Rain tomorrow |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 |
After training, say the weights come out like this.
| Element | Value |
|---|---|
| Weight of “rain yesterday” | 0.6 |
| Weight of “high humidity” | 1.0 |
| Weight of “pressure dropping” | 1.3 |
| Bias | -1.4 |
Now feed in a new input not in the training data.
| Rain yesterday | High humidity | Pressure dropping |
|---|---|---|
| 1 | 0 | 1 |
The computation is this.
Pipe it through sigmoid.
So it reads as “rainy-ish, about 62%.”
In Excel or Sheets, you can write this, for example.
=0.6*1 + 1.0*0 + 1.3*1 - 1.4
=1/(1+EXP(-0.5))
Obviously, real weather prediction isn’t this simple.
But,
- there’s a training table
- weights remain after training
- at prediction time, the weights are used on new input
— to see that flow, this is enough, I think.
The important point is that it isn’t looking up rows in the training data.
Tendencies like “high humidity leans rainy” and “falling pressure leans further rainy” are reflected in the weights — that’s closer to the picture.
In short, even though the giant models are out of reach, “the basic computation behind AI” is traceable by hand.
The equations suddenly stop looking like a string of symbols and start looking like ordinary arithmetic.
A half-baked “horse-race prediction AI” or something similar has basically the same structure.
If you’re curious, play around with building one — you’ll quickly get a feel for turning inputs into numbers and giving them weights.
That said, since real horse racing has a lot of factors to look at, if you’re going to try something yourself, boat racing feels like it’s easier to keep tidy.
Related articles where I actually ran things
Within this blog, the ones where the math underneath is easiest to see are around these.
- Detecting and correcting OCR typos with an encoder model + local LLM
A concrete example of an encoder reading per-position probabilities - I tried whether a local Vision LLM can pull RPG parameters out of a character image
An example of a multimodal experiment: give an image, get JSON - I investigated whether Z-Image (Zaoxiang) runs on RunPod — hoping for stable character shapes
An entry point where an image generation model is actually run - Running TRELLIS.2 on M1 Max 64GB — a hands-on verification log
An experiment generating 3D from image features locally
If you want a side-read
- MoonshotAI (Kimi) proposes AttnRes, replacing Transformer’s residual connections with Attention, for 1.25× compute efficiency
An example of Softmax showing up in “how much to look at where” weighting - Z-Image — an Alibaba image generator said to surpass FLUX
An earlier article that lays out Z-Image’s position and structure