Tech 6 min read

Fortress Token Optimizer trims 11% off LLM prompts but risks stripping system prompt constraints

IkesanContents

Fortress Token Optimizer is a small prompt-compression API that sits between you and the LLM, stripping filler words and redundant phrasing before the request goes out.
Diallo West posted How I Built an API That Cuts LLM Token Costs by 11-22% on DEV, showing modest but plausible reduction numbers.

The average across five prompt types was 11%.
Prompts stuffed with “Could you please help me” and “I was wondering if” compress well. Already-dense technical prompts barely move.
This is not a model swap or a fancy architecture change. It is trimming the string right before the API call.

What it actually removes

Fortress applies phrase compression, deduplication, meta-instruction removal, and sentence shortening to the prompt text before it reaches the model.
In the DEV article’s example, an 18-token request became 14 tokens, a 22% reduction.

The benchmark is not large-scale, but the shape of the results matches intuition.

Prompt typeReduction
Cover-letter-style conversational23%
Debugging request8%
Learning resource request10%
Business analysis4%
Project planning12%

Fortress is strongest on the words humans add to avoid sounding rude.
When the prompt is already packed with specs, constraints, and evaluation criteria, there is little left to cut.

In a post on wiring character settings into AI API calls, the system prompt held tone rules, banned words, and JSON output conditions.
That kind of prompt is not polite filler; it is the constraint set itself.
Running blanket compression on it can drop a character’s speech ending or a prohibition rule.

Prompt compression vs. context-gateway compression

I previously looked at Compresr Context Gateway, which sits between agentic tools like Claude Code or Cursor and the LLM API, compressing conversation history and tool outputs.
Fortress targets something earlier and simpler: the user-authored natural-language request. It ships npm and Python SDKs, a VS Code extension, and a web demo.

The two strip different things.
Compresr handles long logs, tool output, and stale conversation turns.
Fortress handles the sentence you type right now.
The sweet spot for Fortress is closer to chat UIs and internal tools than to RAG pipelines or agent history.

On npm, fortress-optimizer is at 1.0.1 and advertises “LLM API costs by 10-20%.”
PyPI has the same 1.0.1.
The Fortress website claims 10-20% average reduction, 68ms optimization time, and integrations with npm, Copilot, VS Code, Slack, and Claude Desktop.
These numbers are self-reported. Treat them accordingly.

Where 11% matters and where it does not

For a single developer typing prompts by hand, the monthly savings are small.
The DEV article’s own estimate puts it at a few dollars per month at 500 prompts/day.

But the broader billing trend favors per-token pricing.
OpenAI Codex moved from message-based pricing to token-based billing because message units could not absorb the variance in task size.
Independent measurements of Claude Opus 4.7’s tokenizer showed the same string consuming 1.2-1.45x more tokens, inflating the bill without any change on the user side.

In that context, 11% is not a dramatic saving. It is a small lever you can pull yourself on the billing side.
Batch jobs, internal chatbots, support-ticket classifiers, and review assistants that send similar inputs at volume are where the number compounds.
On the other side, any task where a single generation’s quality or instruction compliance directly affects revenue, an 11% input reduction matters less than a missing constraint.

Context window consumption slows down too

If each prompt is 11% shorter, the context window in a multi-turn conversation fills 11% slower.
Separate from cost, this means more turns before the model starts losing track of earlier context.

LLMs pay less attention to older content as the context grows.
Compressing each turn delays that degradation.
For chatbots and agents running long sessions, not forgetting an instruction at turn 30 can matter more than saving a few dollars a month.

Log the before-and-after diff

If you deploy this, reduction percentage alone is not enough.
Keep the original prompt, the compressed prompt, the model output, and the reduction rate side by side. When something breaks, you need to see which words disappeared.

The dangerous cases are negation, conditionals, numbers, proper nouns, and code blocks.
”Words that can be removed without changing meaning” and “short words that control meaning” look similar on the surface.
Drop “never,” “unless,” “only,” “JSON,” or “exactly” and the token count goes down but the output breaks.

The Fortress article says code blocks and meaningful modifiers are preserved.
Even so, starting in conservative mode with A/B logging is the natural first step for any production workload.
Prompt compression is input preprocessing. When it fails, the model’s output logs alone will not tell you what went wrong.

Saying please to an AI, and yelling at one

The phrases Fortress removes, “Could you please help me” and “I was wondering if,” are pure token waste for an API call.
But people do not type “please” because they calculated the token cost.

Whether you should thank an AI comes up periodically.
Skipping the courtesy does reduce tokens, exactly as the Fortress benchmark shows.
In 2024, Sam Altman joked that ChatGPT users’ “please” was costing OpenAI meaningful electricity.
But just as dropping “Thank you” and “Best regards” from a work email makes it feel hostile, typing bare commands into a chat UI feels uncomfortable for the human side.

These filler phrases signal non-hostility. Even when the other side is an API, they are hard to leave out.
A compression API that strips them mechanically before sending is a neat fit: the human’s habit stays intact, but the tokens do not.

On the opposite end, plenty of people yell at AI.
When the output is not what they wanted, they type “I just told you this” or “how many times are you going to make the same mistake.”
Social media is full of people raging at ChatGPT and Claude, even though they know anger will not improve the output.

There is a story about Claude Code internally detecting user frustration.
The system prompt reportedly included frustration detection, adjusting tone when the user seemed angry.
Politeness can be stripped before sending by a machine. Anger is handled on the model’s side after receiving it.

References