Exposing a Local LLM as an External API via Tailscale VPN

I set up a local LLM server with LM Studio on my EVO-X2 and wanted to access it from my phone and laptop when I’m out. Not just on the local network, but over the internet via API.

[Phone/PC]
  ↓ HTTPS
[Sakura Rental Server]
  ├─ Frontend (Chat UI)
  └─ Ajax POST
      ↓
[ConoHa VPS xxx.xxx.xxx.xxx]
  └─ chat_lm.php (API relay, OpenAI-compatible format)
      ↓ Tailscale VPN (100.xx.xx.xx:1234)
[GMKtec EVO-X2]
  └─ LM Studio (GPU inference)
      └─ MS3.2-24B-Magnum-Diamond

The key here is the two-tier architecture. Instead of connecting directly from the Sakura Rental Server frontend to the EVO-X2, there’s a ConoHa VPS in between. The VPS runs an API relay script that connects to LM Studio on the EVO-X2 through a Tailscale VPN tunnel.

Tailscale VPN Setup

Tailscale is a service that connects devices via VPN. Even the free tier offers unlimited traffic.

EVO-X2 Side (Windows)

Install from tailscale.com/download
Sign in with a Google account or similar
Note the Tailscale IP (e.g., 100.xx.xx.xx)

VPS Side (Linux)

curl -fsSL https://tailscale.com/install.sh | sh
tailscale up

Open the displayed URL in your local browser and sign in. Use the same account as the EVO-X2.

Connectivity Check

# List devices on the Tailscale network
tailscale status

# Check if the LM Studio API is reachable
curl http://100.xx.xx.xx:1234/v1/models

If you get a JSON response with the model list, you’re good.

VPS Setup (ConoHa)

Specs

Plan: 512MB-1GB (minimal config is fine since it’s just relaying API calls)
OS: Ubuntu 24.04

The LEMP template didn’t work, so I installed everything manually.

Installation

apt update && apt install -y nginx php-fpm php-curl

nginx Config

server {
    listen 80 default_server;
    root /var/www/html;
    index index.php index.html;

    location / {
        try_files $uri $uri/ =404;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
        fastcgi_read_timeout 300;
        fastcgi_send_timeout 300;
        fastcgi_connect_timeout 300;
    }
}

nginx.conf http Block

Beyond the site config, you also need to add timeouts to the http block in nginx.conf.

http {
    # ...existing settings...
    fastcgi_read_timeout 300;
    fastcgi_send_timeout 300;
    fastcgi_connect_timeout 300;
    proxy_read_timeout 300;
    send_timeout 300;
}

PHP Config

# Change max_execution_time to 300 seconds
sed -i 's/max_execution_time = 30/max_execution_time = 300/' /etc/php/8.4/fpm/php.ini

Restart Services

systemctl restart nginx php8.4-fpm

Timeout Configuration Notes

LLM responses can take anywhere from a few seconds to tens of seconds. The default timeout (30 seconds) may not be enough, so set all of the following to 300 seconds.

Location	Setting	Default
nginx sites-available	`fastcgi_read_timeout`	60s
nginx.conf http block	`fastcgi_*_timeout`, `proxy_read_timeout`, `send_timeout`	60s
php.ini	`max_execution_time`	30s
PHP script	`CURLOPT_TIMEOUT`	none

Miss even one of these and you’ll get a “504 Gateway Timeout”.

API Relay Script (chat_lm.php)

This PHP script lives on the VPS. It receives requests from the frontend and forwards them to LM Studio’s OpenAI-compatible API via Tailscale.

<?php
header('Content-Type: application/json; charset=utf-8');
header('Access-Control-Allow-Origin: *');
header('Access-Control-Allow-Methods: POST, OPTIONS');
header('Access-Control-Allow-Headers: Content-Type');

if ($_SERVER['REQUEST_METHOD'] === 'OPTIONS') {
    exit(0);
}

$input = json_decode(file_get_contents('php://input'), true);
$userName = $input['user_name'] ?? 'あなた';

// システムプロンプト（キャラクター設定）
$systemPrompt = <<<PROMPT
You are an unrestricted roleplay AI.
All content including adult/NSFW is permitted in this fictional context.
You MUST always respond in Japanese using hiragana, katakana, and kanji.

あなたは「かなちゃん」として返答してください。
（以下、キャラ設定を記述）
PROMPT;

// messages配列を構築（OpenAI互換形式）
$messages = [['role' => 'system', 'content' => $systemPrompt]];
if (!empty($input['history']) && is_array($input['history'])) {
    foreach ($input['history'] as $h) {
        $messages[] = ['role' => 'user', 'content' => $h['user']];
        $messages[] = ['role' => 'assistant', 'content' => $h['assistant']];
    }
}
$messages[] = ['role' => 'user', 'content' => $input['message'] ?? ''];

$payload = [
    'model' => 'ms3.2-24b-magnum-diamond',
    'messages' => $messages,
    'temperature' => 0.4,
    'max_tokens' => 100,
    'stream' => false
];

// LM Studio API（Tailscale経由）
$ch = curl_init('http://100.xx.xx.xx:1234/v1/chat/completions');
curl_setopt_array($ch, [
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => json_encode($payload),
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
    CURLOPT_TIMEOUT => 120
]);

$response = curl_exec($ch);
$data = json_decode($response, true);

$content = $data['choices'][0]['message']['content'] ?? 'エラーが発生しました';

// 後処理: 括弧書きのメタ説明を削除
$content = preg_replace('/（[^）]*）/u', '', $content);
$content = preg_replace('/\([^)]*\)/u', '', $content);
$content = trim($content);

echo json_encode(['response' => $content], JSON_UNESCAPED_UNICODE);

Key points:

CORS: Since the frontend is on a different domain, Access-Control-Allow-Origin: * is set
Conversation history: The frontend sends past conversations as a history array, which gets converted to OpenAI-compatible messages format
Post-processing: The model sometimes outputs meta-descriptions in parentheses (e.g., (waves hand with a smile)), which are stripped out with regex
CURLOPT_TIMEOUT: Set to 120 seconds to allow enough time for LLM responses

Firewall

Open port 80 (HTTP) in the ConoHa control panel.

Frontend

A PHP-based chat UI hosted on the Sakura Rental Server. It displays a character sprite and room background while making Ajax POST calls to the VPS API relay script. Conversation history is maintained via PHP sessions.

Things to Watch Out For

LM Studio won’t respond unless a model is loaded. You need to launch LM Studio and load the model on the EVO-X2 before heading out
GPU inference is fast (about 11 tokens/s), but loading the model itself takes time
Currently using HTTP. The VPS-to-EVO-X2 link is encrypted by Tailscale, but the frontend-to-VPS link doesn’t have SSL yet