Tech 19 min read

Rewriting WAI-Anima Character LoRA Training Captions with Natural Language and Hairstyle Tags

IkesanContents

In the previous WAI-Anima character LoRA run I concluded that side ponytail position couldn’t be controlled because of an architectural bias in Anima base + Qwen3 TE. The captions for the 53 training images had been reused as-is from the dataset cleaned up for WAI-IL, and at the end of that previous article I’d already noticed they didn’t follow Anima’s recommended caption format. So this round started by fixing the captions first.

Rework strategy

The next-step plan listed at the end of the previous article:

  • Reorder caption tags to match the Anima recommendation
  • Add natural-language sample descriptions that include directional information
  • Add rating and quality tags

The “directional info via natural language” part is the core. In test D from the previous article (pure natural language) it didn’t work at inference, but that was about inference — whether Qwen3 TE actually saw the directional concept during training is a separate question. The previous hypothesis was Qwen3 TE never had any material in the captions from which to extract direction information in the first place. This time we kill that hypothesis from the dataset side.

I summarized the current official guidance from the WAI-Anima v1 page and the Anima preview3-base page on Civitai.

Tag order:

[quality/rating/safety] [1girl/1boy] [character] [series] [artist] [general tags]

Order within each section is free.

The quality prefix is:

masterpiece, best quality, score_7, safe,

Use safe (not the SDXL-era general). Only the score_7 family of score tags is allowed underscores; everything else is space-separated and lowercase.

Natural language is recommended but not required. Anima was trained on Danbooru tags + natural-language captions + the combination of both, so since the training data includes natural-language versions, including them raises resolution.

The official recommended learning rate for a rank-32 LoRA starts at 2e-5. Last time I let AnimaLoraToolkit’s default of 1.0e-4 ride — about 5x higher — and loss progression was fine, so it worked out, but 2e-5〜5e-5 is closer to the actual recommendation.

Decided not to include the year tag

Initially I planned to keep year 2024 in the captions following the previous article’s style, but on re-reading the official documentation:

  • year-family tags are optional (not required, not recommended)
  • Use case: tags like year 2024 or year 2010 to bias the era’s drawing style
  • The model is trained with random tag dropout so not all tags are required

Putting year 2024 into the training captions would tie kanachan to “art style of the 2024 era,” which risks degrading character reproduction at inference time when that tag isn’t included. If you want to keep character reproduction separable from era, you don’t put it on the training side.

→ Don’t include it in training captions. If style tweaking is wanted, just add it on the inference prompt side.

Hairstyle tag policy: abandoning the “absorbed into kanachan” approach

This is the biggest strategy shift this round.

Up until now in both IL and Anima training, I’d avoided putting hairstyle tags like side ponytail into the captions and let the single kanachan trigger absorb “hairstyle, hair color, eye color, all of it.” That worked well in IL — kanachan alone reliably reproduced the hairstyle (the side ponytail came out without explicitly specifying it).

Last round’s Anima training used the same strategy, but the side ponytail’s left/right position broke. Turning off flip_augment didn’t change anything, and additional probing revealed Anima base itself has a “side ponytail = viewer-left” bias (preview3-base + no LoRA shows the same position).

There are two stuck states:

  1. Don’t put hairstyle in captions → maintains kanachan absorption but breaks direction control
  2. Put hairstyle in captions → kanachan’s definition weakens and hair could go scattered, but left side ponytail might recover direction control

This time I pick 2. Reasons:

  • The IL “absorbed into kanachan” approach was strong, but had a side effect at inference: it was hard to change the hairstyle. Trying kanachan, twin tails for twintails got pulled back to side ponytail.
  • Promoting hairstyle to independent tags opens the door to variations like kanachan, semi-long, hair down.
  • For Anima, since strategy 1 doesn’t get direction working, strategy 2 is worth trying.
  • Even if hair scatters, specifying left side ponytail every time should restore the original look.

Accepting the trade-off, I switched to learning the hairstyle as an independent tag.

Caption template

I rewrote all 53 captions in this final structure:

masterpiece, best quality, score_7, [safe/sensitive/explicit],
1girl, solo, kanachan,
[2+ sentences of natural language: scene description + bound hair position in image],
[character attribute tags: brown eyes, left side ponytail, ahoge, brown hair, double parted bangs, medium hair, blue scrunchie, (medium breasts)],
[outfit/pose/composition tags],
white background, simple background,

Natural language policy:

  • 2+ sentences for every image
  • Explicitly note the bound hair’s position in the image (right/left side of the image)
  • Avoid danbooru-tag-style side ponytail phrasing on the natural language side; describe it like bound hair with a blue scrunchie instead

Example (angry):

masterpiece, best quality, score_7, safe,
1girl, solo, kanachan,
A close-up portrait of a young girl with an angry expression and a slight frown looking directly at the viewer.
Her bound hair with a blue scrunchie is visible on the right side of the image,
and a small antenna of hair rises from the top of her head.
brown eyes, left side ponytail, ahoge, brown hair, double parted bangs, medium hair, blue scrunchie,
angry, frown, looking at viewer, portrait, bare shoulders,
white background, simple background

Bound hair position description split by composition:

CompositionDescription
Front-facing standing / portraitvisible on the right side of the image
From behindvisible on the left side of the image
Left profile (facing viewer’s left)extends from the back of her head
Right profile (facing viewer’s right)visible behind her head

All 53 training images have a side ponytail on the character’s left side, so the image-side left/right depends on the composition.

brown eyes is excluded from the 6 images where the face isn’t visible (back / bikini / sportswear_back / fullbody_left_2 / left_back / right_back). medium breasts is added only to nude.txt where the chest is directly visible.

Workflow for rewriting 53 captions

Doing 53 by hand guarantees missed spots, so I followed this flow:

  1. Combine into a CSV with 5 columns: filename, image_path, old_caption, new_caption, notes
  2. Convert to XLSX with a Python script, embedding thumbnails in the first column
  3. Open in Numbers and review images and new captions side by side
  4. Apply corrections in bulk via script
  5. Write back to the final txt files
# build_xlsx.py (excerpt)
from openpyxl.drawing.image import Image as XLImage
from PIL import Image as PILImage

for i, row in enumerate(rows, start=2):
    ws.cell(row=i, column=2, value=row["filename"])
    ws.cell(row=i, column=5, value=row["new_caption"])
    img_path = DIR / row["image_path"]
    thumb_path = THUMB_DIR / f"{img_path.stem}_thumb.png"
    with PILImage.open(img_path) as im:
        im.thumbnail((160, 320))
        im.save(thumb_path, "PNG")
    xl_img = XLImage(str(thumb_path))
    xl_img.anchor = f"A{i}"
    ws.add_image(xl_img)
    ws.row_dimensions[i].height = 130

Exporting CSV without UTF-8 BOM gets garbled in Excel (mistaken as SJIS), so going through XLSX is safer in the end. pip install openpyxl Pillow covers the deps.

Found 2 bugs in the old captions

Mid-rewrite I found 2 cases where the old captions didn’t match the actual images.

FileOld captionActual image
kanachan_bikini.txtnudePink bikini, from behind
kanachan_nude.txtbikini, pink bikini, from behindFully nude, front-facing

The captions on these two files were completely swapped. The most recent two training runs were therefore feeding these wrong captions. The bikini concept may have learned “nudity-like rendering” while the nude concept may have learned “pink bikini.” A landmine that got missed in the shadow of the side ponytail direction problem.

The new captions match the actual images. The filenames (bikini.png / nude.png) stay; only the captions are corrected to match reality.

Blazer color recognition drift

For the 5 files blazer_angry / pointing / run / stomach / left_side I had originally written brown blazer, but the actual images are pretty red. Visually reddish brown fits best — pure brown loses the red, red loses the brown — so I unified on reddish brown blazer taking the middle.

Reflected the same wording on the natural-language side (brown school blazerreddish brown school blazer). Color-name drift is hard to nail down in AI-only conversations, but a human eyeballing the files side-by-side catches it instantly.

Caption diff

Old (angry):

kanachan, 1girl, solo, angry, portrait, front view, white background

New (angry):

masterpiece, best quality, score_7, safe, 1girl, solo, kanachan,
A close-up portrait of a young girl with an angry expression and a slight frown looking directly at the viewer.
Her bound hair with a blue scrunchie is visible on the right side of the image, and a small antenna of hair rises from the top of her head.
brown eyes, left side ponytail, ahoge, brown hair, double parted bangs, medium hair, blue scrunchie,
angry, frown, looking at viewer, portrait, bare shoulders,
white background, simple background

Almost 10x the length, but Anima now receives information through both the natural-language route and the Danbooru-tag route. The captions are now in a state where Qwen3 TE has material to extract direction information from during training.

Side benefit: hairstyle changes might work

Abandoning “absorbed into kanachan” and promoting hairstyle to independent tags opens up some inference-time freedom.

With the IL-trained LoRA, asking for a different hairstyle like kanachan, twin tails got pulled back to side ponytail and broke. Hairstyle had been baked into the kanachan concept, so even with a different hairstyle specified, the priority of base model + original LoRA dragged it back to side ponytail.

In the new captions, left side ponytail is explicitly an independent tag every time, so at inference:

  • kanachan, semi-long hair, hair down → hair down
  • kanachan, twin tails → twintails
  • kanachan, ponytail → regular ponytail

variations like these might work. The flip side is the risk of kanachan alone not stabilizing on side ponytail. Adding left side ponytail every time should restore the original look, so the operational cost is acceptable.

Next verification points

The training-side prep is done. Remaining adjustments:

  • keep_tokens: 1kanachan was pinned at the front before, but in the Anima recommendation [quality] comes first. The keep_tokens value needs to be reconsidered.
  • learning_rate: 1.0e-4 → drop to 2e-5〜5e-5 (closer to official recommendation)
  • flip_augment: false continues (didn’t matter last time but just in case)
  • repeats / epochs unchanged

With everything aligned, retrain and check whether side ponytail position becomes controllable via left side ponytail. The bikini/nude swapped-caption fix may also show up in the learning. That’s the main test.

v1 retrain results

YAML settings:

ItemValueReason
shuffle_captionfalseAvoid breaking the natural-language block order
keep_tokens1Effectively meaningless with shuffle off
learning_rate5.0e-5Middle of Anima official recommendation (2e-5〜5e-5)
flip_augmentfalseContinued
sample_every1Want to observe behavior every epoch
epochs12Unchanged

Training finished in about 40 minutes. The per-epoch sample prompt was kanachan, 1girl, solo, left side ponytail, standing, looking at viewer, white background (still danbooru-tag-leaning, no natural-language directional info).

baseline
v1 baseline
ep4
v1 ep4
ep6
v1 ep6
ep8
v1 ep8
ep12
v1 ep12

The result: out of 12 epochs, only 1 epoch (ep6) had the side ponytail come out in the same direction as the training data (viewer-right = character-left). The other 11 epochs went viewer-left (character-right) like before. Even the apparently-correct ep6 is best read as a seed gacha that happened to land — direction control isn’t being held stably.

A couple of other things worth noting:

  • Hair color comes out orange-leaning, not the calmer brown of the training material
  • In the ep7 sample, the ahoge at the top of the head turns into something like animal ears (a strange black-outlined structure)

The likely cause for the hair color: by promoting brown hair to an independent tag, Anima base’s interpretation of “brown hair” (somewhat reddish brown to orange) gets pulled in and diverges from the training material’s actual color. In the previous IL training, kanachan absorbed the hair color too, so the trigger word was directly tied to the actual color of the training material. The trade-off of tag independence shows up here.

Caption re-fix: bring color information back into kanachan

Removed brown hair and brown eyes from the captions. A partial revert toward the previous strategy, but the structural hairstyle tags (side ponytail, ahoge, double parted bangs, medium hair) stay independent.

perl -pi -e 's/\bbrown hair, //g; s/\bbrown eyes, //g' *.txt

Policy:

CategoryExamplesTreatment
Character corehair color, eye colorAbsorb into kanachan (no need to vary at inference)
Variable accessoriesoutfit, expression, props, poseIndependent tags (vary at inference)
Structural tagshair shapeIndependent tags (leave room for twintails etc.)

blue scrunchie also stayed independent, since the scrunchie might be swapped to a different color or to a ribbon.

The tag left side ponytail doesn’t exist on Danbooru

A major discovery here. Searching Danbooru for left-side_ponytail returns 0 hits. What actually exists:

  • side_ponytail (parent tag)
  • high_side_ponytail / low_side_ponytail (height sub-tags)
  • short_side_ponytail, side_drill (variants)

left side ponytail is a non-existent tag. Anima base hasn’t learned this combination, so it either parses it as separate left + side + ponytail, or it picks up just the side ponytail part and effectively ignores left.

That means the entire “control via direction tag” strategy was built on a false premise. Both the previous IL training and this v1 — from base’s perspective, left side ponytail was just side ponytail + noise. No directional information was ever conveyed.

Fix:

  • Tag side: left side ponytailside ponytail (use only existing tags)
  • Direction info: fully delegate to the natural-language side (“visible on the right side of the image”)

v2 training: color tags removed + side ponytail + natural-language direction

Re-fixed all 53 captions:

  • Removed brown hair, brown eyes
  • left side ponytailside ponytail
  • Kept the natural-language direction description

Switched the sample prompt from tag-leaning to natural-language + side ponytail:

masterpiece, best quality, safe, 1girl, solo, kanachan,
Her bound hair with a blue scrunchie is visible on the right side of the image.
side ponytail, ahoge, standing, looking at viewer, white background, simple background

YAML uses output_dir: /workspace/output/rework-v2, output_name: kanachan-waianima-rework-v2 to keep it separate from the previous run. Everything else matches v1.

Observations

baseline
v2 baseline
ep4
v2 ep4
ep8
v2 ep8
ep10
v2 ep10
ep12
v2 ep12
  • baseline (no LoRA, with natural-language direction): side ponytail came out viewer-right (matching the training material). Anima base + natural-language direction works — confirmed. But as a side effect, the subject came out tied up with rope (covered later)
  • epoch 1〜 (LoRA on): direction isn’t pinned in one direction like in v1; left/right swap by epoch. No consistency, but it has escaped the “viewer-left locked” state
  • Hair color clearly improved: the orange-leaning v1 has shifted to the brown of the training material. Removing brown hair and letting kanachan absorb the color was correct

baseline ✓ → wobbly direction with LoRA on suggests the LoRA absorbs visual features but doesn’t strongly bind them to natural-language direction tokens. With only 53 samples this might be the capacity limit of a rank-32 LoRA, or possibly a directional-token resolution issue with Qwen3 0.6B TE.

bound hair interpreted as “tied-up subject” by base model

In the v2 baseline sample, the girl came out tied with rope. Anima base likely interpreted the prompt’s Her bound hair not as “bound = tied” + “hair” but as “tied-up (state) subject + hair.”

The rope disappears in epochs with LoRA on (the training material has zero rope, so LoRA cancels it), but it can resurface in the baseline state or when LoRA strength is dropped at inference.

Also bound hair is redundant. ponytail already means “tied hair,” so I’ll replace bound hair with side ponytail to simplify.

perl -pi -e 's/\bHer bound hair\b/Her side ponytail/g; s/\bbound hair\b/side ponytail/g' *.txt

Side benefit: the natural-language side now also contains the Danbooru-tag word side ponytail, strengthening correspondence with the tag list.

v3 training: removing the bound expression

Removed all bound hair references from the v2 captions. Same for the sample prompt:

masterpiece, best quality, safe, 1girl, solo, kanachan,
Her side ponytail with a blue scrunchie is visible on the right side of the image.
side ponytail, ahoge, standing, looking at viewer, white background, simple background

output_dir: /workspace/output/rework-v3, output_name: kanachan-waianima-rework-v3 to keep separate from v2. save_every: 1 so every epoch’s LoRA is preserved (v1/v2 saved every 4 epochs which made comparison harder).

v3 results

baseline
v3 baseline
ep4
v3 ep4
ep7
v3 ep7
ep8
v3 ep8
ep9
v3 ep9
ep12
v3 ep12

The rope binding is completely gone, baseline included, across every epoch. Removing the bound expression had a clear effect. Hair color stayed at the training-material brown that v2 already achieved — it doesn’t drift. The animal-ear-like artifact at the top of the head from v1 ep7 (the strange black-outlined structure) doesn’t appear in v3 either.

Side ponytail direction still varies between epochs, though. Same as v2 — not pinned in one direction, but not stable either.

Bust-up verification on local ComfyUI

The trainer’s built-in sampler with a minimal prompt isn’t a strong test, so I ran ep7 / ep8 / ep9 through a local ComfyUI on M1 Max with a real prompt. Settings: 832×1024, er_sde + simple, 30 steps, CFG 4.0, LoRA strength model=1.0 / clip=0.8, seed 42 fixed.

Positive is the sample prompt above + white collared shirt, red necktie, upper body, looking at viewer, front view. Negative is the standard set rejecting twintails / nsfw / anatomy breaks.

ep7 (seed 42)
v3 ep7 bust-up
ep8 (seed 42)
v3 ep8 bust-up
ep9 (seed 42)
v3 ep9 bust-up

Only ep8 came out with side ponytail on the viewer-right (matching the natural-language spec). ep7 and ep9 fell to the opposite direction (viewer-left). Looking at ep8 alone, character form / hair color / hairstyle / outfit are all reproduced consistently — the headline success criterion is met.

But it’s unclear whether this is a structural advantage of ep8 or a seed-42 gacha hit.

Additional verification on ep8: full body + motion + LoRA strength

Treating ep8 as the candidate, I checked direction reproducibility with standing pose, running pose, and a LoRA-strength sweep. Same ep8 LoRA, seed 42 fixed.

Standing pose (same prompt + same seed, varying only LoRA strength):

strength 0.5
strength 0.5
strength 0.7
strength 0.7
strength 1.0
strength 1.0
LoRA strengthDirection
0.5Viewer-right (matches NL) ✅
0.7Viewer-left ❌
1.0Viewer-left ❌

Running pose (added running, dynamic pose, motion blur, action shot):

running (strength 1.0, seed 42)
running

Running + strength 1.0 — somehow the direction came out viewer-right (correct). With the same strength 1.0, the standing pose fell to viewer-left, but the running pose lands on the correct side. Switching the prompt from standing to running, dynamic pose flips direction without changing seed. So the variables — LoRA strength, seed, prompt — all interact, and you can’t isolate any one of them cleanly.

Hairstyle isn’t baked into kanachan

To delimit what the LoRA actually absorbs, I removed side ponytail and asked for hair down, semi-long hair (with side ponytail, ponytail, blue scrunchie, scrunchie added to negative).

hair down test (ep8, seed 42)
hair down

Hair came down cleanly. Ponytail and scrunchie are gone. So:

  • ✅ Hairstyle isn’t baked into the kanachan trigger; it fires from the side ponytail tag + natural language
  • ✅ Character core (face, hair color, eye color, body) is absorbed into kanachan
  • ✅ Hairstyle independence (one of the rework goals) is functioning

The boundary of what the LoRA learned is now clear. kanachan = character’s face, color, basic form. side ponytail and similar tags = hairstyle structure. The former is strongly baked, the latter is switchable via tags.

Anima architecture-specific constraints

While digging in I learned this isn’t an AnimaLoraToolkit or rank-32 problem — it’s a known structural issue across the entire Anima architecture.

The Anima official repository discussion reports it as “LoRA causes strong style dilution / override on Anima”:

LoRA WeightResult
0.4–0.7Base knowledge (artist tags etc.) is heavily diluted
0.8–1.0Base knowledge is essentially zeroed out

The cause is Anima’s “CLIP-less” design. Instead of SDXL-era CLIP, it uses Qwen3 0.6B TE, and this powerful text encoding overpowers LoRA adaptation. The moment you apply a LoRA, the diverse visual knowledge base had (including directional rendering) starts to fade through “catastrophic forgetting.”

Official recommended mitigations:

  • Don’t train the LLM Adapter (the layer between TE and DiT). AnimaLoraToolkit’s defaults already exclude this.
  • Lower the learning rate to 1e-5 〜 2e-5 (mine was 5e-5, higher than recommended)
  • Push step count to 12,000+ (mine was 636 steps, about 5% of recommended)
  • Build in the Differential Output Preservation patch (in development)

This run misses the recommended conditions on learning rate and step count by a large margin. The likely picture: catastrophic forgetting prevented direction information from baking in adequately, the LoRA’s directional bias overrode the NL spec, but that override was inaccurate so it landed on viewer-left.

Conclusion

What was achieved:

  • bound hairside ponytail: Just changing the natural-language phrasing eliminated the “bound = restrained” misinterpretation
  • Removed brown hair / brown eyes: Color information goes back into kanachan, eliminating the divergence with Anima base’s interpretation that tag independence had created
  • left side ponytailside ponytail: Got rid of a non-existent Danbooru tag
  • Hairstyle independence: Confirmed that kanachan + side ponytail etc. allows hairstyle changes (validated with hair down)

What wasn’t achieved:

  • Side ponytail direction control: All 53 training images are in the same direction, but at LoRA strength 1.0 inference, gacha across epoch / seed / prompt remains
  • The realistic operational fallback is strength 0.5, leaning on base + NL for direction (though LoRA features get diluted)
  • The proper fix is dropping learning rate to 2e-5 and retraining at 12,000+ steps. About 5 hours and an extra $4

The main lesson: per-character LoRAs for Anima cost a lot more to train than SDXL-family ones. The kanachan trigger can land character likeness on its own, but baking in finer control like direction requires a big budget jump from the SDXL-era reflex (a few hundred steps, rank 32).

For now I’ll operate with ep8 + strength 0.5, and wait for longer training runs and the Differential Output Preservation release.