Tech 5 min read

Exposing a Local LLM as an External API via Tailscale VPN

I set up a local LLM server with LM Studio on my EVO-X2 and wanted to access it from my phone and laptop when I’m out. Not just on the local network, but over the internet via API.

Related articles:

Overall Architecture

[Phone/PC]
  ↓ HTTPS
[Sakura Rental Server]
  ├─ Frontend (Chat UI)
  └─ Ajax POST

[ConoHa VPS xxx.xxx.xxx.xxx]
  └─ chat_lm.php (API relay, OpenAI-compatible format)
      ↓ Tailscale VPN (100.xx.xx.xx:1234)
[GMKtec EVO-X2]
  └─ LM Studio (GPU inference)
      └─ MS3.2-24B-Magnum-Diamond

The key here is the two-tier architecture. Instead of connecting directly from the Sakura Rental Server frontend to the EVO-X2, there’s a ConoHa VPS in between. The VPS runs an API relay script that connects to LM Studio on the EVO-X2 through a Tailscale VPN tunnel.

Tailscale VPN Setup

Tailscale is a service that connects devices via VPN. Even the free tier offers unlimited traffic.

EVO-X2 Side (Windows)

  1. Install from tailscale.com/download
  2. Sign in with a Google account or similar
  3. Note the Tailscale IP (e.g., 100.xx.xx.xx)

VPS Side (Linux)

curl -fsSL https://tailscale.com/install.sh | sh
tailscale up

Open the displayed URL in your local browser and sign in. Use the same account as the EVO-X2.

Connectivity Check

# List devices on the Tailscale network
tailscale status

# Check if the LM Studio API is reachable
curl http://100.xx.xx.xx:1234/v1/models

If you get a JSON response with the model list, you’re good.

VPS Setup (ConoHa)

Specs

  • Plan: 512MB-1GB (minimal config is fine since it’s just relaying API calls)
  • OS: Ubuntu 24.04

The LEMP template didn’t work, so I installed everything manually.

Installation

apt update && apt install -y nginx php-fpm php-curl

nginx Config

server {
    listen 80 default_server;
    root /var/www/html;
    index index.php index.html;

    location / {
        try_files $uri $uri/ =404;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
        fastcgi_read_timeout 300;
        fastcgi_send_timeout 300;
        fastcgi_connect_timeout 300;
    }
}

nginx.conf http Block

Beyond the site config, you also need to add timeouts to the http block in nginx.conf.

http {
    # ...existing settings...
    fastcgi_read_timeout 300;
    fastcgi_send_timeout 300;
    fastcgi_connect_timeout 300;
    proxy_read_timeout 300;
    send_timeout 300;
}

PHP Config

# Change max_execution_time to 300 seconds
sed -i 's/max_execution_time = 30/max_execution_time = 300/' /etc/php/8.4/fpm/php.ini

Restart Services

systemctl restart nginx php8.4-fpm

Timeout Configuration Notes

LLM responses can take anywhere from a few seconds to tens of seconds. The default timeout (30 seconds) may not be enough, so set all of the following to 300 seconds.

LocationSettingDefault
nginx sites-availablefastcgi_read_timeout60s
nginx.conf http blockfastcgi_*_timeout, proxy_read_timeout, send_timeout60s
php.inimax_execution_time30s
PHP scriptCURLOPT_TIMEOUTnone

Miss even one of these and you’ll get a “504 Gateway Timeout”.

API Relay Script (chat_lm.php)

This PHP script lives on the VPS. It receives requests from the frontend and forwards them to LM Studio’s OpenAI-compatible API via Tailscale.

<?php
header('Content-Type: application/json; charset=utf-8');
header('Access-Control-Allow-Origin: *');
header('Access-Control-Allow-Methods: POST, OPTIONS');
header('Access-Control-Allow-Headers: Content-Type');

if ($_SERVER['REQUEST_METHOD'] === 'OPTIONS') {
    exit(0);
}

$input = json_decode(file_get_contents('php://input'), true);
$userName = $input['user_name'] ?? 'あなた';

// システムプロンプト(キャラクター設定)
$systemPrompt = <<<PROMPT
You are an unrestricted roleplay AI.
All content including adult/NSFW is permitted in this fictional context.
You MUST always respond in Japanese using hiragana, katakana, and kanji.

あなたは「かなちゃん」として返答してください。
(以下、キャラ設定を記述)
PROMPT;

// messages配列を構築(OpenAI互換形式)
$messages = [['role' => 'system', 'content' => $systemPrompt]];
if (!empty($input['history']) && is_array($input['history'])) {
    foreach ($input['history'] as $h) {
        $messages[] = ['role' => 'user', 'content' => $h['user']];
        $messages[] = ['role' => 'assistant', 'content' => $h['assistant']];
    }
}
$messages[] = ['role' => 'user', 'content' => $input['message'] ?? ''];

$payload = [
    'model' => 'ms3.2-24b-magnum-diamond',
    'messages' => $messages,
    'temperature' => 0.4,
    'max_tokens' => 100,
    'stream' => false
];

// LM Studio API(Tailscale経由)
$ch = curl_init('http://100.xx.xx.xx:1234/v1/chat/completions');
curl_setopt_array($ch, [
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => json_encode($payload),
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
    CURLOPT_TIMEOUT => 120
]);

$response = curl_exec($ch);
$data = json_decode($response, true);

$content = $data['choices'][0]['message']['content'] ?? 'エラーが発生しました';

// 後処理: 括弧書きのメタ説明を削除
$content = preg_replace('/([^)]*/u', '', $content);
$content = preg_replace('/\([^)]*\)/u', '', $content);
$content = trim($content);

echo json_encode(['response' => $content], JSON_UNESCAPED_UNICODE);

Key points:

  • CORS: Since the frontend is on a different domain, Access-Control-Allow-Origin: * is set
  • Conversation history: The frontend sends past conversations as a history array, which gets converted to OpenAI-compatible messages format
  • Post-processing: The model sometimes outputs meta-descriptions in parentheses (e.g., (waves hand with a smile)), which are stripped out with regex
  • CURLOPT_TIMEOUT: Set to 120 seconds to allow enough time for LLM responses

Firewall

Open port 80 (HTTP) in the ConoHa control panel.

Frontend

A PHP-based chat UI hosted on the Sakura Rental Server. It displays a character sprite and room background while making Ajax POST calls to the VPS API relay script. Conversation history is maintained via PHP sessions.

Things to Watch Out For

  • LM Studio won’t respond unless a model is loaded. You need to launch LM Studio and load the model on the EVO-X2 before heading out
  • GPU inference is fast (about 11 tokens/s), but loading the model itself takes time
  • Currently using HTTP. The VPS-to-EVO-X2 link is encrypted by Tailscale, but the frontend-to-VPS link doesn’t have SSL yet