Tech 10 min read

Can Local Vision LLMs Extract RPG Stats from Character Art?

IkesanContents

I got the urge to build a pipeline that converts photos into pixel art and composites them into RPG battle scenes. The end goal is a camera-to-print workflow: snap a photo, pixelate it, generate stats, composite a battle scene, and print. But first, I need to nail down the intermediate steps.

Converting to pixel art is already covered by Qwen Image Edit. The open question is the “look at an image and generate RPG stats” step, which is a job for a Vision LLM (VLM). I ran tests with several local models to find the most practical option.

Goal

Feed character images to local Vision LLMs and get back RPG parameters in JSON format like this:

{
  "name": "Character Name",
  "class": "Class",
  "level": 23,
  "hp": 180,
  "mp": 45,
  "attack": 72,
  "defense": 58,
  "speed": 41,
  "special": ["Skill 1", "Skill 2"],
  "description": "Short description"
}

The model should infer stats from the image’s visual characteristics (outfit, expression, pose, vibe). There’s no right answer, so I’m comparing output quality, creativity, JSON stability, and speed.

Test Environment

  • M1 Max 64GB
  • Ollama 0.17.7 (upgraded to 0.20.6 for the Gemma 4 test)
  • macOS

Test Images

Four images of the same character in different art styles and proportions.

Image 1: Full IllustrationImage 2: Pixel Art (Taller)Image 3: Pixel Art (Shorter)Image 4: Pixel Art (2-Head)
Full IllustrationPixel Art TallerPixel Art ShorterPixel Art 2-Head
School uniform + demon wings, moon backgroundTransparent backgroundTransparent backgroundBeige background

The interesting question is how VLMs interpret different art styles, and whether they can generate stats from pixel art at all.

Shared Prompt

All models received the same prompt. Images were sent as base64 via Ollama’s API with stream: false.

Look at this character image and generate Dragon Quest-style RPG stats in JSON format.
Infer plausible parameters from the image's appearance.
Output JSON only, no other text.
{"name": "Character Name", "class": "Class", "level": number, "hp": number,
 "mp": number, "attack": number, "defense": number, "speed": number,
 "special": ["Skill 1", "Skill 2", "Skill 3"],
 "description": "Short description"}

(The actual prompt was in Japanese for consistency.)

API calls were made with a simple Python script:

import json, base64, urllib.request

def generate_rpg_params(model, image_path):
    with open(image_path, 'rb') as f:
        img = base64.b64encode(f.read()).decode()
    payload = json.dumps({
        'model': model,
        'prompt': PROMPT,
        'images': [img],
        'stream': False
    }).encode()
    req = urllib.request.Request(
        'http://localhost:11434/api/generate',
        data=payload,
        headers={'Content-Type': 'application/json'}
    )
    with urllib.request.urlopen(req, timeout=300) as resp:
        return json.loads(resp.read())

Results

Gemma 3 4B

Model: gemma3:4b (3.3GB)

Image 1: Full Illustration (8.2s)

{
  "name": "Luna",
  "class": "Fairy Warrior",
  "level": 5,
  "hp": 120, "mp": 150, "attack": 30, "defense": 25, "speed": 60,
  "special": ["Starlight Magic", "Moon Shadow Dance", "Fairy's Whisper"],
  "description": "A fairy dancing through the night sky. Outmaneuvers enemies with swift moves and magic."
}

Image 2: Pixel Art, Taller (5.5s)

{
  "name": "Yui",
  "class": "School Girl",
  "level": 5,
  "hp": 80, "mp": 120, "attack": 15, "defense": 10, "speed": 20,
  "special": ["Bubble", "Prank", "Heal"],
  "description": "A cheerful girl on her way home from school. Kind-hearted and loyal to her friends."
}

Image 3: Pixel Art, Shorter (5.2s)

{
  "name": "Alice",
  "class": "PreCure",
  "level": 5,
  "hp": 120, "mp": 80, "attack": 45, "defense": 30, "speed": 60,
  "special": ["Whistle", "Fresh", "Star"],
  "description": "A spirited girl who'll face any challenge to protect her friends."
}

Image 4: Pixel Art, 2-Head (5.3s)

{
  "name": "Luna",
  "class": "Mage",
  "level": 5,
  "hp": 80, "mp": 150, "attack": 30, "defense": 20, "speed": 60,
  "special": ["Merafire", "Meraice", "Merahooo"],
  "description": "A cool, slightly mature mage. Calm normally, but unleashes powerful magic in battle."
}

Findings

  • JSON format was stable across all 4 images
  • Level locked at 5 for every image. Little variation in stats
  • Recognized the wings and moon, generating “Fairy Warrior” and “Moon Shadow Dance” appropriately
  • “PreCure” as a class doesn’t fit Dragon Quest at all
  • Invented non-existent spells like “Merahooo” (it seems aware of Dragon Quest spell naming but can’t reproduce it accurately)
  • Pixel art recognition worked fine
  • Inference speed was fast at 5-8 seconds

Qwen2.5-VL 7B

Model: qwen2.5vl:7b (6.0GB)

Image 1: Full Illustration (24.1s)

{
  "name": "Louise",
  "class": "Warrior",
  "level": 10,
  "hp": 150, "mp": 50, "attack": 80, "defense": 70, "speed": 60,
  "special": ["Sword Dance", "Magic Armor", "Healing Potion"],
  "description": "A girl with blue wings. Standing quietly under the night sky, a mysterious warrior."
}

Image 2-4: Pixel Art (3.7-4.2s)

All three pixel art images returned nearly identical stats: class “Warrior”, HP 150, MP 50, Attack 80, Defense 70, Speed 60, with the same skills (“Sword Dance”, “Shield”, “High Speed Move”). The only variation was in names and descriptions.

Findings

  • JSON format was stable across all images
  • Class locked to “Warrior” for every image. Visual features (wings, magical vibe) weren’t reflected in the class
  • Images 3 and 4 returned completely identical stat blocks. Heavy template feel
  • Skills were nearly identical across images. Little per-image differentiation
  • Did recognize the wings (description mentions “girl with blue wings”) but didn’t reflect it in stats
  • Pixel art recognition worked fine, even picking up small details like ribbons
  • Full illustration took 24 seconds, while pixel art was 3-4 seconds

Gemma 3 12B

Model: gemma3:12b (8.1GB)

Image 1: Full Illustration (12.2s)

{
  "name": "Ciel Valkyria",
  "class": "Fallen Angel Knight",
  "level": 27,
  "hp": 185, "mp": 112, "attack": 78, "defense": 55, "speed": 93,
  "special": ["Fallen Light Ray", "Sanctuary Revoke", "Charm Song"],
  "description": "A composed fallen angel. Beguiles with beauty and power, sometimes leading others to ruin."
}

Image 2: Pixel Art, Taller (9.2s)

{
  "name": "Kasuga An",
  "class": "School Idol",
  "level": 7,
  "hp": 68, "mp": 42, "attack": 25, "defense": 18, "speed": 35,
  "special": ["Cheer Song", "Flash Dance", "Cute Attack"],
  "description": "The light that illuminates the stage. A girl with idol sparkle and hidden fighting spirit."
}

Image 3: Pixel Art, Shorter (8.2s)

{
  "name": "Kanon",
  "class": "Mage",
  "level": 7,
  "hp": 65, "mp": 120, "attack": 30, "defense": 25, "speed": 70,
  "special": ["Fire", "Ice", "Lightning"],
  "description": "A bright, cheerful mage. Surprisingly strong sense of justice."
}

Image 4: Pixel Art, 2-Head (9.3s)

{
  "name": "Melancholy Hanako",
  "class": "Science Mage",
  "level": 5,
  "hp": 85, "mp": 120, "attack": 30, "defense": 25, "speed": 60,
  "special": ["Frown Breath", "Study-Reading Mana", "Comeback Judgment"],
  "description": "A science-type mage who always looks disgruntled. Wields powerful magic with genius intellect."
}

Findings

  • JSON format was stable across all images
  • By far the most creative output among all models tested. Excellent at reflecting image features in classes and skills
  • Derived “Fallen Angel Knight” from the wings + uniform + moon combination, with thematically consistent skills like “Fallen Light Ray” and “Sanctuary Revoke”
  • Read the slightly grumpy expression on the 2-head pixel art and generated “Melancholy Hanako”, “Science Mage”, and “Frown Breath”. Impressive expression recognition
  • Level varied by image mood (27 for the full illustration, 5-7 for pixel art). Avoids monotony
  • Stat distribution showed personality. Fallen Angel Knight had speed at 93, mages had MP at 120
  • Inference speed was 8-12 seconds. Slower than 4B but perfectly usable

Llama 3.2 Vision 11B

Model: llama3.2-vision:11b (7.8GB)

Image 1: Full Illustration (19.2s)

{"name": "Hana", "class": "Vampire", "level": 10, "hp": 100, "mp": 50,
 "attack": 20, "defense": 15, "speed": 30,
 "special": ["Blood Drain", "Shadow Bite", "Darkness"],
 "description": "A young vampire with a mysterious past..."}

Image 2: Pixel Art, Taller (17.5s)

Output was Markdown instead of JSON. Skills were placeholder text (“magic1”, “magic2”, “magic3”).

Image 3: Pixel Art, Shorter (15.8s)

{"name": "Eliza", "class": "Magic Girl", "level": 10, "hp": 100, "mp": 80,
 "attack": 15, "defense": 10, "speed": 12,
 "special": ["Fire Flame", "Water Wave", "Wind Whirl", "Electric Light", "Earth Wall"],
 "description": "Genius girl, Eliza wields magic to save the world and conquer all"}

Image 4: Pixel Art, 2-Head (14.2s)

{"name": "Saya", "class": "Witch", "level": 10, "hp": 100, "mp": 50,
 "attack": 20, "defense": 15, "speed": 30,
 "special": ["Fireball", "Ice Wall", "Electric Whip"],
 "description": "A witch being a witch, basically"}

Findings

  • Responded in English despite receiving a Japanese prompt (Image 1). Poor Japanese instruction-following
  • Unstable JSON format. Image 2 returned Markdown. Image 3 mixed Markdown headings with JSON
  • Level 10 and HP 100 fixed across images. Template-like parameters
  • Descriptions were lazy (“A witch being a witch, basically”)
  • FF-style prompt test produced French-sounding hallucinations (“Til de Gint”, “Mars ex Deus”)
  • Carddass-style prompt test returned English (“Vampirina”, “The Moonlight”) and output both JSON and Markdown simultaneously
  • Not suitable for Japanese content generation tasks

Gemma 4 E4B

Model: gemma4:e4b (9.6GB)

Ollama 0.17.7 returned a 412 error when pulling this model. Version 0.19+ is required; I upgraded to 0.20.6.

Image 1: Full Illustration (28.9s)

{
  "name": "Lily",
  "class": "Wing Bearer (Yokoshi)",
  "level": 25,
  "hp": 2800, "mp": 6500, "attack": 1200, "defense": 1500, "speed": 3500,
  "special": [
    "Heavenly Healing Dance: A light spell that greatly restores all allies' HP.",
    "Azure Wing Strike: A wide-range attack releasing powerful light waves from giant wings.",
    "Feather Sound Rush: A support spell that greatly recovers own Speed and MP while binding enemies."
  ],
  "description": "A mysterious girl wearing a school uniform with giant wings spread from behind..."
}

Image 2-4: Pixel Art (14.1-19.4s)

Stats were more reasonable for pixel art images (HP 240-280, level 15-18), but the model consistently added flavor text to skills and used parenthetical annotations.

Findings

  • JSON structure was valid, but ignored specified value ranges. Image 1 had HP 2800 and speed 3500, far exceeding the prompt’s HP 1-999 and speed 1-255 limits
  • Tendency to add flavor text (descriptions, English translations) to skills. Gets in the way when parsing
  • Creative touches like “Wing Bearer (Yokoshi)” with custom ruby annotations, but doesn’t feel like Dragon Quest
  • Strongly picked up school elements across all images (recognizing the uniform as “Academy Mage”, “enrolled at the academy”, etc.)
  • Descriptions ran long. Image 1 spanned 3 sentences
  • Slowest inference at 14-29 seconds
  • Comparable recognition accuracy to Gemma 3 12B, but inferior in format compliance

Prompt Variation Test

I also tested all models with FF (Final Fantasy) style and Carddass-style prompts to compare how they handle different output formats.

Tested with the full illustration and 2-head pixel art.

FF-Style Highlights

ModelJob NamesQuality
Gemma 3 12B”Magick Knight”, “White Mage”Natural FF vocabulary, good limit breaks
Gemma 3 4B”Starlight Knight”, “Guardian of the Academy”More original RPG than FF
Qwen2.5-VL 7B”Mage” (both images)Job locked to “Mage”. No FF feel
Llama 3.2 Vision”Til de Gint” (gibberish)French-sounding hallucinations
Gemma 4 E4B”Mystical Dancer”Mixed skill descriptions into abilities

Carddass-Style Highlights

ModelBest TitleQuality
Gemma 3 12B”After-School Witch”, “Melancholy Princess”Excellent titles, print-ready format
Gemma 3 4B”Maiden of Star Shadows”Short, punchy flavor text
Qwen2.5-VL 7B”Moon Fairy”Extreme stat values (8500+)
Llama 3.2 Vision”Vampirina”Returned English, output both JSON and Markdown
Gemma 4 E4B”Moonlit Wing-Bearer”Rarity format inconsistent (“Star 5” vs ”★★★★”)

Prompt changes don’t fix model-level tendencies. Gemma 3 12B was stable across all formats. Carddass-style produced more stable JSON from all models due to fewer fields (except Llama).


Comparison

AspectGemma 3 4BQwen2.5-VL 7BGemma 3 12BLlama 3.2 VisionGemma 4 E4B
JSON StabilityStableStableStableUnstable (Markdown leaks)Valid but ignores ranges
Image RecognitionWings, moonWings, ribbonsWings, expression, moodWings (as “Vampire”)Wings, uniform, school context
Output CreativityMediumLow (all “Warrior”)High (“Fallen Angel Knight”, “Science Mage”)Low (template, lazy)High (“Wing Bearer”) but verbose
Pixel Art SupportWorksWorksWorks, reads expressionsRecognizes but JSON breaksWorks
Japanese SupportGoodGoodGoodEnglish leaks frequentlyGood
Speed5-8s3-24s8-12s14-19s14-29s
Model Size3.3GB6.0GB8.1GB7.8GB9.6GB
Format ComplianceGoodGoodGoodPoorRanges and skill format break

Gemma 3 12B is the clear winner overall. It has the best balance of JSON stability, creativity, and format compliance.

Gemma 4 E4B matches it in recognition accuracy and creativity but falls short on structured data usability, ignoring value ranges and injecting descriptions into skill fields. Integrating it into a pipeline would require extra parsing.

Llama 3.2 Vision was unusable for this task due to poor Japanese instruction-following and unstable JSON output. Qwen2.5-VL can recognize images but lacks output variety. Gemma 3 4B punches above its weight for its size, but 12B is a clear step up in both recognition and output quality.

Going forward, Gemma 3 12B is the VLM for this pipeline. At 8.1GB, it fits comfortably on an M1 Max with 64GB. Gemma 4 E4B could potentially be tamed with better prompting, but the speed disadvantage makes it a pass for now.