Can Local Vision LLMs Extract RPG Stats from Character Art?

I got the urge to build a pipeline that converts photos into pixel art and composites them into RPG battle scenes. The end goal is a camera-to-print workflow: snap a photo, pixelate it, generate stats, composite a battle scene, and print. But first, I need to nail down the intermediate steps.

Converting to pixel art is already covered by Qwen Image Edit. The open question is the “look at an image and generate RPG stats” step, which is a job for a Vision LLM (VLM). I ran tests with several local models to find the most practical option.

Goal

Feed character images to local Vision LLMs and get back RPG parameters in JSON format like this:

{
  "name": "Character Name",
  "class": "Class",
  "level": 23,
  "hp": 180,
  "mp": 45,
  "attack": 72,
  "defense": 58,
  "speed": 41,
  "special": ["Skill 1", "Skill 2"],
  "description": "Short description"
}

The model should infer stats from the image’s visual characteristics (outfit, expression, pose, vibe). There’s no right answer, so I’m comparing output quality, creativity, JSON stability, and speed.

Test Environment

M1 Max 64GB
Ollama 0.17.7 (upgraded to 0.20.6 for the Gemma 4 test)
macOS

Test Images

Four images of the same character in different art styles and proportions.

Image 1: Full Illustration	Image 2: Pixel Art (Taller)	Image 3: Pixel Art (Shorter)	Image 4: Pixel Art (2-Head)

School uniform + demon wings, moon background	Transparent background	Transparent background	Beige background

The interesting question is how VLMs interpret different art styles, and whether they can generate stats from pixel art at all.

Shared Prompt

All models received the same prompt. Images were sent as base64 via Ollama’s API with stream: false.

Look at this character image and generate Dragon Quest-style RPG stats in JSON format.
Infer plausible parameters from the image's appearance.
Output JSON only, no other text.
{"name": "Character Name", "class": "Class", "level": number, "hp": number,
 "mp": number, "attack": number, "defense": number, "speed": number,
 "special": ["Skill 1", "Skill 2", "Skill 3"],
 "description": "Short description"}

(The actual prompt was in Japanese for consistency.)

API calls were made with a simple Python script:

import json, base64, urllib.request

def generate_rpg_params(model, image_path):
    with open(image_path, 'rb') as f:
        img = base64.b64encode(f.read()).decode()
    payload = json.dumps({
        'model': model,
        'prompt': PROMPT,
        'images': [img],
        'stream': False
    }).encode()
    req = urllib.request.Request(
        'http://localhost:11434/api/generate',
        data=payload,
        headers={'Content-Type': 'application/json'}
    )
    with urllib.request.urlopen(req, timeout=300) as resp:
        return json.loads(resp.read())

Results

Gemma 3 4B

Model: gemma3:4b (3.3GB)

Image 1: Full Illustration (8.2s)

{
  "name": "Luna",
  "class": "Fairy Warrior",
  "level": 5,
  "hp": 120, "mp": 150, "attack": 30, "defense": 25, "speed": 60,
  "special": ["Starlight Magic", "Moon Shadow Dance", "Fairy's Whisper"],
  "description": "A fairy dancing through the night sky. Outmaneuvers enemies with swift moves and magic."
}

Image 2: Pixel Art, Taller (5.5s)

{
  "name": "Yui",
  "class": "School Girl",
  "level": 5,
  "hp": 80, "mp": 120, "attack": 15, "defense": 10, "speed": 20,
  "special": ["Bubble", "Prank", "Heal"],
  "description": "A cheerful girl on her way home from school. Kind-hearted and loyal to her friends."
}

Image 3: Pixel Art, Shorter (5.2s)

{
  "name": "Alice",
  "class": "PreCure",
  "level": 5,
  "hp": 120, "mp": 80, "attack": 45, "defense": 30, "speed": 60,
  "special": ["Whistle", "Fresh", "Star"],
  "description": "A spirited girl who'll face any challenge to protect her friends."
}

Image 4: Pixel Art, 2-Head (5.3s)

{
  "name": "Luna",
  "class": "Mage",
  "level": 5,
  "hp": 80, "mp": 150, "attack": 30, "defense": 20, "speed": 60,
  "special": ["Merafire", "Meraice", "Merahooo"],
  "description": "A cool, slightly mature mage. Calm normally, but unleashes powerful magic in battle."
}

Findings

JSON format was stable across all 4 images
Level locked at 5 for every image. Little variation in stats
Recognized the wings and moon, generating “Fairy Warrior” and “Moon Shadow Dance” appropriately
“PreCure” as a class doesn’t fit Dragon Quest at all
Invented non-existent spells like “Merahooo” (it seems aware of Dragon Quest spell naming but can’t reproduce it accurately)
Pixel art recognition worked fine
Inference speed was fast at 5-8 seconds

Qwen2.5-VL 7B

Model: qwen2.5vl:7b (6.0GB)

Image 1: Full Illustration (24.1s)

{
  "name": "Louise",
  "class": "Warrior",
  "level": 10,
  "hp": 150, "mp": 50, "attack": 80, "defense": 70, "speed": 60,
  "special": ["Sword Dance", "Magic Armor", "Healing Potion"],
  "description": "A girl with blue wings. Standing quietly under the night sky, a mysterious warrior."
}

Image 2-4: Pixel Art (3.7-4.2s)

All three pixel art images returned nearly identical stats: class “Warrior”, HP 150, MP 50, Attack 80, Defense 70, Speed 60, with the same skills (“Sword Dance”, “Shield”, “High Speed Move”). The only variation was in names and descriptions.

Findings

JSON format was stable across all images
Class locked to “Warrior” for every image. Visual features (wings, magical vibe) weren’t reflected in the class
Images 3 and 4 returned completely identical stat blocks. Heavy template feel
Skills were nearly identical across images. Little per-image differentiation
Did recognize the wings (description mentions “girl with blue wings”) but didn’t reflect it in stats
Pixel art recognition worked fine, even picking up small details like ribbons
Full illustration took 24 seconds, while pixel art was 3-4 seconds

Gemma 3 12B

Model: gemma3:12b (8.1GB)

Image 1: Full Illustration (12.2s)

{
  "name": "Ciel Valkyria",
  "class": "Fallen Angel Knight",
  "level": 27,
  "hp": 185, "mp": 112, "attack": 78, "defense": 55, "speed": 93,
  "special": ["Fallen Light Ray", "Sanctuary Revoke", "Charm Song"],
  "description": "A composed fallen angel. Beguiles with beauty and power, sometimes leading others to ruin."
}

Image 2: Pixel Art, Taller (9.2s)

{
  "name": "Kasuga An",
  "class": "School Idol",
  "level": 7,
  "hp": 68, "mp": 42, "attack": 25, "defense": 18, "speed": 35,
  "special": ["Cheer Song", "Flash Dance", "Cute Attack"],
  "description": "The light that illuminates the stage. A girl with idol sparkle and hidden fighting spirit."
}

Image 3: Pixel Art, Shorter (8.2s)

{
  "name": "Kanon",
  "class": "Mage",
  "level": 7,
  "hp": 65, "mp": 120, "attack": 30, "defense": 25, "speed": 70,
  "special": ["Fire", "Ice", "Lightning"],
  "description": "A bright, cheerful mage. Surprisingly strong sense of justice."
}

Image 4: Pixel Art, 2-Head (9.3s)

{
  "name": "Melancholy Hanako",
  "class": "Science Mage",
  "level": 5,
  "hp": 85, "mp": 120, "attack": 30, "defense": 25, "speed": 60,
  "special": ["Frown Breath", "Study-Reading Mana", "Comeback Judgment"],
  "description": "A science-type mage who always looks disgruntled. Wields powerful magic with genius intellect."
}

Findings

JSON format was stable across all images
By far the most creative output among all models tested. Excellent at reflecting image features in classes and skills
Derived “Fallen Angel Knight” from the wings + uniform + moon combination, with thematically consistent skills like “Fallen Light Ray” and “Sanctuary Revoke”
Read the slightly grumpy expression on the 2-head pixel art and generated “Melancholy Hanako”, “Science Mage”, and “Frown Breath”. Impressive expression recognition
Level varied by image mood (27 for the full illustration, 5-7 for pixel art). Avoids monotony
Stat distribution showed personality. Fallen Angel Knight had speed at 93, mages had MP at 120
Inference speed was 8-12 seconds. Slower than 4B but perfectly usable

Llama 3.2 Vision 11B

Model: llama3.2-vision:11b (7.8GB)

Image 1: Full Illustration (19.2s)

{"name": "Hana", "class": "Vampire", "level": 10, "hp": 100, "mp": 50,
 "attack": 20, "defense": 15, "speed": 30,
 "special": ["Blood Drain", "Shadow Bite", "Darkness"],
 "description": "A young vampire with a mysterious past..."}

Image 2: Pixel Art, Taller (17.5s)

Output was Markdown instead of JSON. Skills were placeholder text (“magic1”, “magic2”, “magic3”).

Image 3: Pixel Art, Shorter (15.8s)

{"name": "Eliza", "class": "Magic Girl", "level": 10, "hp": 100, "mp": 80,
 "attack": 15, "defense": 10, "speed": 12,
 "special": ["Fire Flame", "Water Wave", "Wind Whirl", "Electric Light", "Earth Wall"],
 "description": "Genius girl, Eliza wields magic to save the world and conquer all"}

Image 4: Pixel Art, 2-Head (14.2s)

{"name": "Saya", "class": "Witch", "level": 10, "hp": 100, "mp": 50,
 "attack": 20, "defense": 15, "speed": 30,
 "special": ["Fireball", "Ice Wall", "Electric Whip"],
 "description": "A witch being a witch, basically"}

Findings

Responded in English despite receiving a Japanese prompt (Image 1). Poor Japanese instruction-following
Unstable JSON format. Image 2 returned Markdown. Image 3 mixed Markdown headings with JSON
Level 10 and HP 100 fixed across images. Template-like parameters
Descriptions were lazy (“A witch being a witch, basically”)
FF-style prompt test produced French-sounding hallucinations (“Til de Gint”, “Mars ex Deus”)
Carddass-style prompt test returned English (“Vampirina”, “The Moonlight”) and output both JSON and Markdown simultaneously
Not suitable for Japanese content generation tasks

Gemma 4 E4B

Model: gemma4:e4b (9.6GB)

Ollama 0.17.7 returned a 412 error when pulling this model. Version 0.19+ is required; I upgraded to 0.20.6.

Image 1: Full Illustration (28.9s)

{
  "name": "Lily",
  "class": "Wing Bearer (Yokoshi)",
  "level": 25,
  "hp": 2800, "mp": 6500, "attack": 1200, "defense": 1500, "speed": 3500,
  "special": [
    "Heavenly Healing Dance: A light spell that greatly restores all allies' HP.",
    "Azure Wing Strike: A wide-range attack releasing powerful light waves from giant wings.",
    "Feather Sound Rush: A support spell that greatly recovers own Speed and MP while binding enemies."
  ],
  "description": "A mysterious girl wearing a school uniform with giant wings spread from behind..."
}

Image 2-4: Pixel Art (14.1-19.4s)

Stats were more reasonable for pixel art images (HP 240-280, level 15-18), but the model consistently added flavor text to skills and used parenthetical annotations.

Findings

JSON structure was valid, but ignored specified value ranges. Image 1 had HP 2800 and speed 3500, far exceeding the prompt’s HP 1-999 and speed 1-255 limits
Tendency to add flavor text (descriptions, English translations) to skills. Gets in the way when parsing
Creative touches like “Wing Bearer (Yokoshi)” with custom ruby annotations, but doesn’t feel like Dragon Quest
Strongly picked up school elements across all images (recognizing the uniform as “Academy Mage”, “enrolled at the academy”, etc.)
Descriptions ran long. Image 1 spanned 3 sentences
Slowest inference at 14-29 seconds
Comparable recognition accuracy to Gemma 3 12B, but inferior in format compliance

Prompt Variation Test

I also tested all models with FF (Final Fantasy) style and Carddass-style prompts to compare how they handle different output formats.

Tested with the full illustration and 2-head pixel art.

FF-Style Highlights

Model	Job Names	Quality
Gemma 3 12B	”Magick Knight”, “White Mage”	Natural FF vocabulary, good limit breaks
Gemma 3 4B	”Starlight Knight”, “Guardian of the Academy”	More original RPG than FF
Qwen2.5-VL 7B	”Mage” (both images)	Job locked to “Mage”. No FF feel
Llama 3.2 Vision	”Til de Gint” (gibberish)	French-sounding hallucinations
Gemma 4 E4B	”Mystical Dancer”	Mixed skill descriptions into abilities

Carddass-Style Highlights

Model	Best Title	Quality
Gemma 3 12B	”After-School Witch”, “Melancholy Princess”	Excellent titles, print-ready format
Gemma 3 4B	”Maiden of Star Shadows”	Short, punchy flavor text
Qwen2.5-VL 7B	”Moon Fairy”	Extreme stat values (8500+)
Llama 3.2 Vision	”Vampirina”	Returned English, output both JSON and Markdown
Gemma 4 E4B	”Moonlit Wing-Bearer”	Rarity format inconsistent (“Star 5” vs ”★★★★”)

Prompt changes don’t fix model-level tendencies. Gemma 3 12B was stable across all formats. Carddass-style produced more stable JSON from all models due to fewer fields (except Llama).

Comparison

Aspect	Gemma 3 4B	Qwen2.5-VL 7B	Gemma 3 12B	Llama 3.2 Vision	Gemma 4 E4B
JSON Stability	Stable	Stable	Stable	Unstable (Markdown leaks)	Valid but ignores ranges
Image Recognition	Wings, moon	Wings, ribbons	Wings, expression, mood	Wings (as “Vampire”)	Wings, uniform, school context
Output Creativity	Medium	Low (all “Warrior”)	High (“Fallen Angel Knight”, “Science Mage”)	Low (template, lazy)	High (“Wing Bearer”) but verbose
Pixel Art Support	Works	Works	Works, reads expressions	Recognizes but JSON breaks	Works
Japanese Support	Good	Good	Good	English leaks frequently	Good
Speed	5-8s	3-24s	8-12s	14-19s	14-29s
Model Size	3.3GB	6.0GB	8.1GB	7.8GB	9.6GB
Format Compliance	Good	Good	Good	Poor	Ranges and skill format break

Gemma 3 12B is the clear winner overall. It has the best balance of JSON stability, creativity, and format compliance.

Gemma 4 E4B matches it in recognition accuracy and creativity but falls short on structured data usability, ignoring value ranges and injecting descriptions into skill fields. Integrating it into a pipeline would require extra parsing.

Llama 3.2 Vision was unusable for this task due to poor Japanese instruction-following and unstable JSON output. Qwen2.5-VL can recognize images but lacks output variety. Gemma 3 4B punches above its weight for its size, but 12B is a clear step up in both recognition and output quality.

Going forward, Gemma 3 12B is the VLM for this pipeline. At 8.1GB, it fits comfortably on an M1 Max with 64GB. Gemma 4 E4B could potentially be tamed with better prompting, but the speed disadvantage makes it a pass for now.