Fujitsu's PHOTON 475x is multi-query memory throughput, not single-shot speed

The ITmedia piece on Fujitsu’s PHOTON leads with a strong headline: “up to 475x over Transformers.”
But it doesn’t mean a single chat response gets 475x faster.
What the paper measures is TPM, i.e. K tokens/s/GiB — how many tokens you can emit per unit of GPU memory, a multi-query-leaning metric.

Fujitsu’s own announcement, in a footnote, defines multi-query performance as “throughput per GPU resource.”
It’s a number that gets large in server setups that pack many requests onto one GPU, in long contexts, and in inference that generates and merges multiple candidates.

When I looked at how Japanese LLMs get built through PLaMo, LLM-jp, and Sakana Fugu, the differences were in building weights, releasing them, or bundling external models.
PHOTON is none of those; it changes the architecture to cut the Transformer’s inference-time memory I/O.

Don’t hold the token sequence long as-is

A normal Transformer references the KV cache of past tokens every time it emits the next single token.
The longer the context, the more KV-cache reads and writes dominate over the arithmetic itself.
Even with GPU compute to spare, you stall on memory bandwidth.

PHOTON replaces this with hierarchical latent streams.
Bottom-up, it compresses the token sequence chunk by chunk into coarser context representations.
Top-down, it uses those coarse representations to reconstruct local token representations.

flowchart TD
  A[Input token sequence] --> B[Chunking]
  B --> C[Coarse context state]
  C --> D[Even coarser context state]
  D --> E[Top-down reconstruction]
  E --> F[Generate within local chunk]

It doesn’t re-read the entire fine-grained token sequence globally each time.
The global state advances in coarse units, and fine-grained generation stays within a chunk’s short attention window.
In the paper’s words, instead of scanning a horizontally growing token sequence every step, it accesses multi-resolution context vertically.

The paper also adds “recursive generation,” which updates only the coarsest latent stream as generation proceeds and skips bottom-up re-encoding.
Rather than rebuilding every level each time a new chunk is added, it advances the upper coarse state by one step and runs only the fine reconstruction below it.
This is where the decode-time KV-cache round trips drop, and it’s why throughput per unit memory goes up.

This structure is different from speculative decoding like Gemma 4’s MTP drafter.
Gemma 4’s MTP drafter reads ahead of the main model’s next token with a helper model, and the main model verifies in parallel.
PHOTON changes the very shape of how the main model holds context.

There’s also a way to hit the KV-cache weight from a different angle.
DeepSeek V4-Pro keeps the attention structure but squeezes the KV heads down to one — effectively multi-query — cutting the cache itself to a tenth of V3.2’s.
That direction reduces the head count; PHOTON changes how context is held into a hierarchy. Both target decode-time memory I/O.

475x shows up in the 1.2B decode-heavy condition

The paper’s table compares Vanilla Transformer, Block Transformer, and PHOTON at 600M and 1.2B.
The evaluation environment is an NVIDIA DGX H200. Training used Pile-uncopyrighted, 134B tokens, with a context window of 2048.
Efficiency is evaluated under two regimes: prefill-heavy (long input, short output) and decode-heavy (short input, long output).

For the 1.2B model in the decode-heavy condition, Vanilla Transformer is 2.56 K tokens/s/GiB and PHOTON is 1216.67 K tokens/s/GiB.
Divide those and you get about 475x.
On the same row, throughput goes from 1.00 to 43.80 K tokens/s, and memory drops from 0.390 GiB to 0.036 GiB.

In the same 1.2B prefill-heavy condition it’s 1.21 to 543.86 K tokens/s/GiB, about 449x.
At 600M it’s about 390x prefill-heavy and about 417x decode-heavy.
”Up to 475x” is the single best condition carved out, but the order of magnitude is the same in the other conditions too.

There are two baselines: the plain Vanilla Transformer, and Block Transformer, which already folds context block by block.
The 475x is against Vanilla; lined up against the already-efficient Block Transformer (540.20 K tokens/s/GiB at 1.2B decode-heavy), PHOTON’s 1216.67 is only about 2.25x.
The order of magnitude only shifts against Vanilla; against an existing memory-conscious method it’s in the “times” range.

The paper’s abstract itself says “up to 10³×,” i.e. up to 1000x.
But the maximum you can confirm in the tables is the ~475x of 1.2B decode-heavy; no row reaches 1000x.
The press release’s 475x is more conservative than the abstract, and it’s the figure that matches the measured tables.

But the quality isn’t the same.
At 1.2B, Wikitext perplexity worsens from 19.68 to 23.79, and the zero-shot average also falls below the Transformer.
PHOTON’s pitch isn’t being 475x faster at the same output quality; it’s lowering quality a bit to greatly increase generation per unit memory, then spending that freed-up compute budget on generating multiple candidates.

Recover the quality drop with multiple candidates

Fujitsu’s announcement explains the PHOTON architecture and the multi-query integration technique as a set.
For the same problem, you make several slightly different questions or candidates, then combine them at the end by majority vote or candidate selection.
If a single one-shot answer has a quality gap against the Transformer, you put out multiple answers cheaply and close the gap.

Fujitsu writes that integrating nine queries reached the same level as a conventional Transformer.
Reading “475x” while ignoring PHOTON’s standalone quality drop misuses the number.
The reading is that even running nine still leaves headroom per unit of GPU memory.

This differs in granularity from bundling external models like Sakana Fugu.
Fugu calls several existing models and a coordinator recomposes the results.
PHOTON’s story is generating multiple candidates cheaply within the same model architecture and recovering quality through integration.
Both share the idea of “don’t make it answer just once,” but where the cost lands is quite different.

Models evaluated only up to 1.2B

The models evaluated in this paper go up to 600M and 1.2B.
Whether the same ratio holds at the 7B, 30B, 70B, and several-hundred-B MoE scales that matter for commercial LLMs is still unknown.
The context length is also 2048, shorter than the “long text, many users” production conditions Fujitsu is aiming at.

The paper itself lists as limitations that the maximum context is capped at 2048, that the largest model is 1.2B, and that ablations over things like the number of hierarchy levels and chunk size are insufficient.
This part is hard to pick up from the press-release numbers alone.

One more thing: this isn’t a story of converting existing Transformer weights straight into PHOTON.
PHOTON is trained as a separate architecture that includes a hierarchical encoder, hierarchical decoder, reconstruction loss, and next-context loss.
It’s not the kind of speedup where you plug a plugin into an existing Llama-family model and get 475x.

On this point too, the adoption cost differs from Gemma 4 MTP or vLLM inference optimization.
MTP is about adding a helper model to an existing model, and you can try it by tuning vLLM’s recommended settings.
PHOTON is on the model-building side; it’s not something an inference-server operator can use by adding one option tomorrow.