You can use the free LLM API 3,000 times a month with Sakura AI Engine

“Sakura’s AI Engine” provided by Sakura Internet is an LLM inference API platform built entirely in domestic data centers. It is compatible with OpenAI API and can be used for free up to 3,000 requests per month. In March 2026, Moonshot AI’s 1 trillion parameter model “Kimi-K2.5” was also added as a public preview.

What is Sakura’s AI Engine?

The service will be generally available in September 2025, and you can perform LLM inference and RAG (search extension generation) just by tapping the API.

Features	Contents
Compatible with OpenAI API	Can be used from existing OpenAI SDKs and tools by simply replacing the endpoint
Completed in Japan	All data processing is done on servers in Japan. Customer data is not used for learning
Compatible with closed networks	Also compatible with VPN, LGWAN, and proprietary networks. Easy to implement by local governments and financial institutions
Free tier	Text generation is free up to 3,000 requests per month, audio transcription is free up to 50 requests per month, and Embeddings is free up to 10,000 requests per month

For companies that have requirements that prevent data from being exported overseas, it becomes a realistic alternative to OpenAI API and Claude API.

Available models

Model available as of March 2026.

Chat Completions (text generation)

Model	Developer	Input	Output	Notes
gpt-oss-120b	OpenAI (open source version)	0.15 yen/10,000 tokens	0.75 yen/10,000 tokens	Free tier target
Qwen3-Coder-480B-A35B-Instruct-FP8	Alibaba Cloud	0.3 yen/10,000 tokens	2.5 yen/10,000 tokens	Coding specialized
Qwen3-Coder-30B-A3B-Instruct	Alibaba Cloud	0.15 yen/10,000 tokens	0.75 yen/10,000 tokens	Light version
llm-jp-3.1-8x13b-instruct4	LLM-jp	0.15 yen/10,000 tokens	0.75 yen/10,000 tokens	Domestic MoE model
PLaMo 2.0-31B	Preferred Networks	Individual inquiries	Individual inquiries	Domestic production
cotomi v3	NEC	Individual inquiries	Individual inquiries	Domestic production

Public preview (multimodal)

Model	Developer	Input	Output
preview/Kimi-K2.5	Moonshot AI	0.6 yen/10,000 tokens	3.0 yen/10,000 tokens
preview/Qwen3-VL-30B-A3B-Instruct	Alibaba Cloud	—	—
preview/Phi-4-multimodal-instruct	Microsoft	—	—

Others

Service	Model	Price	Free Tier
Audio transcription	whisper-large-v3-turbo	0.5 yen/60 seconds	50 requests per month
Embeddings	multilingual-e5-large	2 yen/10,000 tokens	10,000 requests per month
Voice synthesis	VOICEVOX (Zundamon, Tohoku Zunko, etc.)	3 yen/10,000 mora	50 requests per month
RAG	—	3 yen/100 chunk	—

Pricing plan

There are two plans.

Base model free plan

A plan that can only be used within the free tier. If the limit is exceeded, rate limiting (requests delayed or rejected) will occur. Credit card registration is required, but if you are within the free tier, you will not be charged.

Pay-as-you-go plan

A plan where you will be charged a pay-as-you-go fee for the amount that exceeds the free limit. You will be charged the unit price shown in the price list above. There is also a report that gpt-oss-120b costs about 138 yen using 110 requests, 1.6 million input tokens, and 140,000 output tokens, which is quite cheap for personal development.

Kimi-K2.5 added in public preview

On March 17, 2026, “Kimi-K2.5” developed by Moonshot AI (China) was added to Sakura’s AI Engine.

Kimi-K2.5 specifications

Item	Value
Total number of parameters	1 trillion (1T)
Number of active parameters	Approximately 32 billion (32B)
Architecture	Mixture-of-Experts (MoE)
Number of experts	384 (8 selected per token, 1 shared)
Number of layers	61 (including 1 dense layer)
Attention Hidden Dimension	7,168
MoE Hidden Dimension (per expert)	2,048
Attention head count	64
Attention mechanism	MLA (Multi-head Latent Attention)
Vision encoder	MoonViT (400 million parameters, image/video input supported)
Learning data	Approximately 15 trillion tokens (mixed data of text + images)
Vocabulary size	160,000
Activation function	SwiGLU
Knowledge cutoff	Based on April 2024, current events up to October are partially covered

The MoE architecture uses only a portion of all parameters in each inference, and has a knowledge of 1 trillion parameters while operating at a computational cost equivalent to a 32B model.

What you can do

Document understanding (text extraction/summarization from images)
Code generation (HTML/JavaScript, Java Swing, etc. However, GLM-5 is said to be better)
Image caption generation
Multimodal Q&A (answers to questions with images)

Since this is a public preview, stability and response quality cannot be guaranteed, and the service may end or specifications may change without prior notice.

Steps to start using

graph TD
    A[さくらインターネット<br/>会員ID作成] --> B[さくらのクラウド<br/>プロジェクト作成]
    B --> C[クレジットカード登録]
    C --> D[コントロールパネルから<br/>AI Engine有効化]
    D --> E[APIトークン発行]
    E --> F[APIリクエスト送信]

It’s compatible with OpenAI SDK, so it only works with Python.

from openai import OpenAI

client = OpenAI(
    base_url="https://ai-engine.sakura.ad.jp/v1",
    api_key="YOUR_API_TOKEN",
)

response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "user", "content": "さくらのAI Engineについて教えて"}],
)
print(response.choices[0].message.content)

There are cases where it can be used to build AI agents via Xcode’s Coding Intelligence and MCP servers.

Who is this service for?

For companies and local governments that cannot send data to overseas clouds, it is a realistic option because it can be done domestically. Individual developers can try it out for prototypes and personal projects for free for 3,000 requests per month. Migration from OpenAI API requires only replacing endpoints, so migration costs are low.

Although it may be inferior in performance compared to GPT-4o and Claude Sonnet, there is no other combination of domestic completion, low cost, and OpenAI compatibility. In particular, the free tier’s 3,000 requests per month is a practical enough amount for a small chatbot or in-house tool.

Limitations of Kimi-K2.5

Search plugin cannot be used

A web search plugin is available on the Kimi official platform (kimi.moonshot.cn), which allows you to generate answers while getting real-time information. On the other hand, Kimi-K2.5 provided by Sakura’s AI Engine only provides a basic model API and does not support search plug-ins or tool calls.

In other words, if you use it via Sakura, you will not be able to answer information after the knowledge cutoff (April 2024). Kimi Official is more suitable for applications that require the latest information. There is a workaround to use the RAG feature to feed the document yourself, but it is not a substitute for web searching.

Content filtering (censorship)

One thing that concerns me about LLMs originating from China is the handling of content filtering. Kimi-K2.5 is developed by Moonshot AI (Beijing, China), so the original model itself has built-in filtering that complies with Chinese regulations. When it comes to topics sensitive to the Chinese government, such as the Tiananmen Square massacre, Taiwan’s political status, and the Tibet issue, they tend to refuse to answer or give bland answers.

What will happen to this filtering when used via Sakura’s AI Engine? It has not been officially stated whether Sakura has added its own filtering layer, but basically the filtering built into the model itself is reflected as is. Sakura does not change model weights to ease censorship, so it can be assumed that the same filtering as the original Kimi will be applied.

This is not limited to the Kimi-K2.5, but is a characteristic common to all open models from China. A similar trend exists in the Qwen series. It is not suitable for applications that deal with politically sensitive topics, but it is not a practical problem for technical questions or business purposes.