Kana Chat v2 Architecture Changes

In the previous article, I wrote about the basic design of “Kana Chat,” an AI agent built by wrapping official CLIs. About a month later, after actually using it daily and iterating on it, the architecture has changed considerably. Here’s a record of what changed and why.

How the Overall Picture Changed

The v1 system architecture looked like this:

flowchart TB
    A["iPhone browser"]
    B["FastAPI"]
    C1["Claude Code"]
    C2["Codex"]
    C3["Gemini CLI"]

    A --> B
    B --> C1
    B --> C2
    B --> C3

v2 has more layers.

v2 chat screen. Header shows Kana-chan icon, 4 tabs: Chat/Plan/Jobs/Blog

The icon changed too. v1 used an illustration generated with an Illustrious LoRA, but v2 switched to pixel art. Not just the header icon — the expression changes based on job status.

Left to right: header, thinking (?), working (⚙), completed (✓), failed (✗), waiting (Zzz), paused (⏸). Used in job cards and push notifications too.

flowchart TB
    A["iPhone browser<br/>PWA + Push notifications<br/>Image & audio input"]
    TS["Tailscale Serve<br/>HTTPS"]
    B["FastAPI<br/>WebSocket streaming"]
    R["Dual-model router<br/>gpt-5.3-codex-spark + gpt-5.4-mini"]
    P["Planner mode<br/>Design → execution separation"]
    HB["Heartbeat<br/>Memory extraction + task suggestions"]
    W1["Claude Code worker"]
    W2["Codex reviewer"]
    WD["Job Watchdog<br/>Stall detection"]

    A --> TS
    TS --> B
    B --> R
    R --> |"small_chat"| B
    R --> |"plan"| P
    R --> |"job"| W1
    W1 --> W2
    WD --> W1
    B --> HB

Nine main changes. Walking through them one by one.

1. Dual-Model Router

This was the part v1 glossed over as “intent classification with Haiku.” In practice, Haiku alone wasn’t accurate enough — it would route casual conversation to jobs, or try handling file operations through chat.

v2 classifies with a primary model (gpt-5.3-codex-spark), then falls back to a secondary model (gpt-5.4-mini) if confidence is below 0.84.

flowchart LR
    MSG["User message"]
    P["Primary<br/>gpt-5.3-codex-spark"]
    F["Fallback<br/>gpt-5.4-mini"]
    OUT["Route decision"]

    MSG --> P
    P --> |"confidence >= 0.84"| OUT
    P --> |"confidence < 0.84"| F
    F --> OUT

Routes were reorganized into 4 types:

Route	Use case
small_chat	Casual conversation, short text generation, light explanations
background_search	Questions requiring web research
plan	Requirements gathering, design discussions
job	File edits, code changes, repository work

Going from v1’s 2-way split (chat vs status query) to 4 routes eliminated the problem of “routing research questions to jobs and polluting Claude Code’s context.” background_search runs async web searches and injects results as runtime notes in the next turn.

Router output is locked to a JSON schema:

{
  "route": "small_chat",
  "confidence": 0.92,
  "ask_user": false,
  "short_reason": "casual conversation",
  "search_query": ""
}

Reasoning effort is set to low. Deep reasoning isn’t needed for routing — latency matters more. 25-second timeout, fails to small_chat.

2. Planner Mode

A new feature not in v1. Sending “I want to build this” directly to a job starts implementation with vague requirements and generates rework. Messages classified as plan route start a dedicated tmux window CLI for design and requirements only.

The planner CLI launches in a read-only sandbox. It can read code and produce design proposals but can’t modify files. Once the design is done, the /planner/job endpoint converts it to a job — only then does implementation begin.

flowchart LR
    U["User: I want to build this"]
    PL["Planner<br/>read-only sandbox"]
    PLAN["plan.md<br/>Design proposal"]
    JOB["Job execution<br/>Implementation starts"]

    U --> PL
    PL --> PLAN
    PLAN --> |"user approval"| JOB

Separating design from implementation means you can stop at the planner stage when “actually that’s not what I wanted.” That’s far cheaper than interrupting a job that’s already running.

Plan tab. Enter a topic to start a design session

3. Heartbeat Memory System

v1’s Heartbeat just extracted “things the user seems to want to do” from conversation and suggested them as task candidates. v2 adds a memory function that accumulates a structured user profile, daily activities, and task signals.

heartbeat-memory.json structure:

{
  "updated_at": "2026-03-23T10:00:00Z",
  "profile": [
    "Writes a blog every day",
    "Knowledgeable about security articles"
  ],
  "today": {
    "2026-03-23": [
      "Ate curry",
      "Went for a walk"
    ]
  },
  "task_signals": [
    {
      "text": "Want to look into automatic OGP image generation",
      "created_at": "2026-03-23T10:00:00Z",
      "date": "2026-03-23",
      "source": "heartbeat"
    }
  ]
}

In the Heartbeat loop (default: every hour), Haiku extracts 3 categories of information from the last 60 user messages:

Category	Content	Cap
profile	Hobbies, fixed preferences	40 entries
today	Things done today, places visited, food eaten	Date-keyed, 14 days
task_signals	Actionable intent signals	80 entries

This pays off in job context building. v1 passed raw conversation history; v2 includes structured information from Heartbeat memory — “this person has these preferences, did these things today, wants to do these things” — in job prompts.

When 2+ today activities accumulate, it automatically suggests creating a diary draft. I mostly toss diary entries from my phone myself, but it quietly writing one on days I forget is genuinely useful. “Oh right, that happened” when reading it back. Just leaving casual notes means they work as diary entries — casual enough input matters because writing from iPhone needs to be low-friction.

4. Gemini Backend and Hot-Switching

v1 barely mentioned Gemini CLI as one of the workers. v2 made Gemini usable as the main chat backend, with /switch gemini and /switch codex for instant switching.

The issue was sandbox constraints. Gemini’s sandbox can only read files inside the project directory. Conversation state (recent.md, summary.md, memory.md) lives in state/ directly under the project root, which Gemini CLI can’t see.

The solution was mirroring state files to a chat/state/ directory. Whenever state updates, it copies to chat/state/ too. On session resume, Gemini CLI gets a system message saying “read state/recent.md to get up to speed.”

flowchart LR
    S["state/<br/>recent.md<br/>summary.md<br/>memory.md"]
    M["Mirror<br/>chat/state/"]
    G["Gemini CLI<br/>sandbox"]

    S --> |"copy on update"| M
    M --> |"read"| G

Each backend has different TUI output formats (Codex uses • , Gemini uses ✦ as markers, etc.), so chat.py carries backend profiles and switches parsing logic accordingly.

5. Blog Article Validation

This was completely missing in v1. When jobs generate diary articles, they sometimes violate template rules — draft: true not set in frontmatter, missing “食事” (meals) section, wrong image placement.

v2 runs a validation gate after job completion.

Checks performed:

Check	Content
Frontmatter	Presence of `draft: true` and `category: diary`
Required sections	Presence of `## 食事` and `## 運動` (exercise has an omit flag)
Section order	Exercise must come before meals
Image placement	When `images_last` flag is set, all images must be at document end
Gallery	When multiple meal images exist, must be wrapped in `<div class="gallery">`

Validation results are written to results/result-validation.md. Errors are included in the job summary so they show up in notifications.

Finding the article file path is also fairly tricky — three-stage fallback: find backtick-wrapped paths in the job’s summary.md → estimate path from title → find newest .md in category directory.

Jobs tab. Each job shows a status badge (completed/stopped) and a Kana-chan icon

6. PWA and Push Notifications

v1 said “notifications come when done,” but the implementation was just browser notifications via WebSocket. Meaning no notifications unless the browser is open. Calling it fire-and-forget while requiring an open browser defeats the point.

A native app would solve it, but push notifications in iOS apps require Apple Developer Program registration ($99/year) — honestly annoying. Android could just use Firebase and call it done, but since my main device is iPhone, that doesn’t work. This is a tool only I use anyway, not something to submit to the App Store. On top of that, the free developer account requires re-deploying to physical device every 7 days, which I didn’t want to deal with.

PWA turned out to be the easiest path. Web Push API is supported in Safari on iOS 16.4+, so with a Service Worker you can send push notifications just like a native app. Add to home screen and it feels like an app, with no App Store submission or certificate management needed.

v2 added a Service Worker for PWA functionality and supports VAPID-based Web Push. Job completions, failures, and cancellations push to iPhone in the background. Icons change per status.

{
  "title": "Kana-chan: ✅ Job Complete",
  "body": "✅ Done: Blog article created",
  "tag": "job-completed-job-20260323-192000",
  "kind": "job_complete",
  "icon": "/static/icons/push/completed.png"
}

VAPID keys are passed via environment variable or auto-generated on first launch, stored in state/webpush-vapid.json. Subscriptions are managed in SQLite, with endpoints returning 404/410 automatically invalidated.

7. Tag-Driven Actions and Runtime Notes

Added a mechanism to trigger actions from chat backend responses. When the model’s response contains specific tags, the server parses and executes them.

Tag	Behavior
`[[EXEC:SEARCH:query]]`	Run async web search, inject results in next turn
`[[EXEC:JOB:description]]`	Send job confirmation dialog to frontend

Tags are stripped before displaying to user. [[EXEC:JOB:...]] never auto-executes a job — a confirmation UI always appears on the frontend first.

Runtime notes bring async processing results back into chat. Background search results, job completion notifications, etc. are injected on the next user message (up to 6 items, 900 chars each). From the chat backend CLI’s perspective, it looks like additional information arriving as system messages.

8. Image Input

v1 chat was text-only. To share a screenshot, you had to give a file path and say “look at this.” There was no way to pass images from iPhone at all.

v2 lets you send images directly from the chat UI — select from iPhone’s camera roll or take a photo on the spot. FastAPI receives the image, Base64-encodes it, and passes it to the chat backend’s vision API. Both Claude and Gemini support vision, so responses account for the image content.

flowchart LR
    U["iPhone<br/>image selection / camera"]
    API["FastAPI<br/>Base64 encode"]
    LLM["Chat backend<br/>Vision API"]

    U --> API
    API --> LLM

The most common uses are “fix this error screen” and “change this part of this screen.” Sending one screenshot is faster than explaining UI layout in text. Image attachment to jobs is also supported, so you can show a reference image and say “implement something like this.”

9. Speech Transcription and Tailscale HTTPS

Added speech transcription to chat. Records from iPhone microphone, transcribes in real-time using the browser’s Web Speech API (SpeechRecognition). No need to send audio to the server — transcription completes in the browser. When flick-typing a long message is tedious, you can just speak the instruction. Transcription results are sent as text messages so the chat side can’t tell them apart from typed input.

flowchart LR
    MIC["iPhone microphone"]
    SR["Browser<br/>SpeechRecognition API<br/>real-time transcription"]
    MSG["Send as<br/>text message"]
    API["FastAPI"]

    MIC --> SR
    SR --> MSG
    MSG --> API

However, SpeechRecognition API only works in a secure context (HTTPS). v1 connected directly to FastAPI over HTTP via Tailscale VPN, but HTTP causes iOS Safari to deny microphone access.

The fix was adding Tailscale Serve in between. Tailscale Serve auto-provisions TLS certificates for locally running HTTP servers and exposes them over HTTPS. No manual certificate management — HTTPS access via MagicDNS name (<hostname>.<tailnet>.ts.net) just works.

tailscale serve https:443 / http://localhost:8000

This enabled SpeechRecognition API and got speech transcription working. As a bonus, WebSocket connections upgraded to WSS, making all communications TLS-encrypted. Image uploads and everything else that was flowing in plaintext in v1 is now encrypted.

Other Minor Changes

Change	Detail
iPhone textarea input bar handling	`visualViewport` API to detect viewport resize and dynamically adjust textarea position by keyboard height. `env(safe-area-inset-bottom)` alone can’t absorb input bar height — need to calculate from the difference between `visualViewport.height` and `window.innerHeight`
Streaming chat	WebSocket sends thinking → stream → done events incrementally. Cancellation supported. v1 was polling
Job Watchdog	Extracted stall detection to external process. Claude Code worker: 300s stall threshold, reviewer: 180s
Worker lanes	Concurrent job count is configurable (default 2 lanes). v1 was sequential
Job timeout	3600 seconds (1 hour) forced kill. 120-second grace period immediately after start
Git safety settings	`AUTO_INIT_GIT_REPO` and `AUTO_CREATE_GH_REPO` default to false. Jobs won’t spontaneously create repositories