You can use the free LLM API 3,000 times a month with Sakura AI Engine
目次
“Sakura’s AI Engine” provided by Sakura Internet is an LLM inference API platform built entirely in domestic data centers. It is compatible with OpenAI API and can be used for free up to 3,000 requests per month. In March 2026, Moonshot AI’s 1 trillion parameter model “Kimi-K2.5” was also added as a public preview.
What is Sakura’s AI Engine?
The service will be generally available in September 2025, and you can perform LLM inference and RAG (search extension generation) just by tapping the API.
| Features | Contents |
|---|---|
| Compatible with OpenAI API | Can be used from existing OpenAI SDKs and tools by simply replacing the endpoint |
| Completed in Japan | All data processing is done on servers in Japan. Customer data is not used for learning |
| Compatible with closed networks | Also compatible with VPN, LGWAN, and proprietary networks. Easy to implement by local governments and financial institutions |
| Free tier | Text generation is free up to 3,000 requests per month, audio transcription is free up to 50 requests per month, and Embeddings is free up to 10,000 requests per month |
For companies that have requirements that prevent data from being exported overseas, it becomes a realistic alternative to OpenAI API and Claude API.
Available models
Model available as of March 2026.
Chat Completions (text generation)
| Model | Developer | Input | Output | Notes |
|---|---|---|---|---|
| gpt-oss-120b | OpenAI (open source version) | 0.15 yen/10,000 tokens | 0.75 yen/10,000 tokens | Free tier target |
| Qwen3-Coder-480B-A35B-Instruct-FP8 | Alibaba Cloud | 0.3 yen/10,000 tokens | 2.5 yen/10,000 tokens | Coding specialized |
| Qwen3-Coder-30B-A3B-Instruct | Alibaba Cloud | 0.15 yen/10,000 tokens | 0.75 yen/10,000 tokens | Light version |
| llm-jp-3.1-8x13b-instruct4 | LLM-jp | 0.15 yen/10,000 tokens | 0.75 yen/10,000 tokens | Domestic MoE model |
| PLaMo 2.0-31B | Preferred Networks | Individual inquiries | Individual inquiries | Domestic production |
| cotomi v3 | NEC | Individual inquiries | Individual inquiries | Domestic production |
Public preview (multimodal)
| Model | Developer | Input | Output |
|---|---|---|---|
| preview/Kimi-K2.5 | Moonshot AI | 0.6 yen/10,000 tokens | 3.0 yen/10,000 tokens |
| preview/Qwen3-VL-30B-A3B-Instruct | Alibaba Cloud | — | — |
| preview/Phi-4-multimodal-instruct | Microsoft | — | — |
Others
| Service | Model | Price | Free Tier |
|---|---|---|---|
| Audio transcription | whisper-large-v3-turbo | 0.5 yen/60 seconds | 50 requests per month |
| Embeddings | multilingual-e5-large | 2 yen/10,000 tokens | 10,000 requests per month |
| Voice synthesis | VOICEVOX (Zundamon, Tohoku Zunko, etc.) | 3 yen/10,000 mora | 50 requests per month |
| RAG | — | 3 yen/100 chunk | — |
Pricing plan
There are two plans.
Base model free plan
A plan that can only be used within the free tier. If the limit is exceeded, rate limiting (requests delayed or rejected) will occur. Credit card registration is required, but if you are within the free tier, you will not be charged.
Pay-as-you-go plan
A plan where you will be charged a pay-as-you-go fee for the amount that exceeds the free limit. You will be charged the unit price shown in the price list above. There is also a report that gpt-oss-120b costs about 138 yen using 110 requests, 1.6 million input tokens, and 140,000 output tokens, which is quite cheap for personal development.
Kimi-K2.5 added in public preview
On March 17, 2026, “Kimi-K2.5” developed by Moonshot AI (China) was added to Sakura’s AI Engine.
Kimi-K2.5 specifications
| Item | Value |
|---|---|
| Total number of parameters | 1 trillion (1T) |
| Number of active parameters | Approximately 32 billion (32B) |
| Architecture | Mixture-of-Experts (MoE) |
| Number of experts | 384 (8 selected per token, 1 shared) |
| Number of layers | 61 (including 1 dense layer) |
| Attention Hidden Dimension | 7,168 |
| MoE Hidden Dimension (per expert) | 2,048 |
| Attention head count | 64 |
| Attention mechanism | MLA (Multi-head Latent Attention) |
| Vision encoder | MoonViT (400 million parameters, image/video input supported) |
| Learning data | Approximately 15 trillion tokens (mixed data of text + images) |
| Vocabulary size | 160,000 |
| Activation function | SwiGLU |
| Knowledge cutoff | Based on April 2024, current events up to October are partially covered |
The MoE architecture uses only a portion of all parameters in each inference, and has a knowledge of 1 trillion parameters while operating at a computational cost equivalent to a 32B model.
What you can do
- Document understanding (text extraction/summarization from images)
- Code generation (HTML/JavaScript, Java Swing, etc. However, GLM-5 is said to be better)
- Image caption generation
- Multimodal Q&A (answers to questions with images)
Since this is a public preview, stability and response quality cannot be guaranteed, and the service may end or specifications may change without prior notice.
Steps to start using
graph TD
A[さくらインターネット<br/>会員ID作成] --> B[さくらのクラウド<br/>プロジェクト作成]
B --> C[クレジットカード登録]
C --> D[コントロールパネルから<br/>AI Engine有効化]
D --> E[APIトークン発行]
E --> F[APIリクエスト送信]
It’s compatible with OpenAI SDK, so it only works with Python.
from openai import OpenAI
client = OpenAI(
base_url="https://ai-engine.sakura.ad.jp/v1",
api_key="YOUR_API_TOKEN",
)
response = client.chat.completions.create(
model="gpt-oss-120b",
messages=[{"role": "user", "content": "さくらのAI Engineについて教えて"}],
)
print(response.choices[0].message.content)
There are cases where it can be used to build AI agents via Xcode’s Coding Intelligence and MCP servers.
Who is this service for?
For companies and local governments that cannot send data to overseas clouds, it is a realistic option because it can be done domestically. Individual developers can try it out for prototypes and personal projects for free for 3,000 requests per month. Migration from OpenAI API requires only replacing endpoints, so migration costs are low.
Although it may be inferior in performance compared to GPT-4o and Claude Sonnet, there is no other combination of domestic completion, low cost, and OpenAI compatibility. In particular, the free tier’s 3,000 requests per month is a practical enough amount for a small chatbot or in-house tool.
Limitations of Kimi-K2.5
Search plugin cannot be used
A web search plugin is available on the Kimi official platform (kimi.moonshot.cn), which allows you to generate answers while getting real-time information. On the other hand, Kimi-K2.5 provided by Sakura’s AI Engine only provides a basic model API and does not support search plug-ins or tool calls.
In other words, if you use it via Sakura, you will not be able to answer information after the knowledge cutoff (April 2024). Kimi Official is more suitable for applications that require the latest information. There is a workaround to use the RAG feature to feed the document yourself, but it is not a substitute for web searching.
Content filtering (censorship)
One thing that concerns me about LLMs originating from China is the handling of content filtering. Kimi-K2.5 is developed by Moonshot AI (Beijing, China), so the original model itself has built-in filtering that complies with Chinese regulations. When it comes to topics sensitive to the Chinese government, such as the Tiananmen Square massacre, Taiwan’s political status, and the Tibet issue, they tend to refuse to answer or give bland answers.
What will happen to this filtering when used via Sakura’s AI Engine? It has not been officially stated whether Sakura has added its own filtering layer, but basically the filtering built into the model itself is reflected as is. Sakura does not change model weights to ease censorship, so it can be assumed that the same filtering as the original Kimi will be applied.
This is not limited to the Kimi-K2.5, but is a characteristic common to all open models from China. A similar trend exists in the Qwen series. It is not suitable for applications that deal with politically sensitive topics, but it is not a practical problem for technical questions or business purposes.