Tech 18 min read

Pushing WAI-Anima Character LoRA Training to the Official 12,000-Step Recommendation Made Direction Control Worse — Half That at ep150 Hit 100%

IkesanContents

In the previous caption rewrite article I concluded that side ponytail direction control is constrained at the Anima architecture level (CLIP-less design + Qwen3 0.6B TE causing catastrophic forgetting that overpowers LoRA adaptation), not by the LoRA or rank 32. The official Anima discussion recommends as a countermeasure dropping the learning rate to 1e-5 - 2e-5 and extending step count to 12,000+. This run actually tries that.

The v4 question

Without touching any of the caption design pinned down through v3 (drop bound hair / drop brown hair and brown eyes / use only the real Danbooru tag side ponytail / put direction in natural language), I bumped only the training amount to Anima’s official recommendation level, to isolate whether direction control stabilizes.

Two hypotheses, both useful as conclusions whichever lands.

HypothesisResulting take
More training amount burns the NL direction signal into the LoRA”If you make a character LoRA on Anima, pay the cost”
More training doesn’t help”Anima architectural constraint. Wait for Differential Output Preservation”

To allow a clean comparison with v3, only learning_rate and epochs changed. Everything else identical.

YAML diff (v3 → v4)

- # Kanachan LoRA on WAI-Anima v1 - Caption rework v3 (bound -> side ponytail in NL)
+ # Kanachan LoRA on WAI-Anima v1 - Caption rework v4 (long training: 12k+ steps at 2e-5)
+ # v3 captions kept as-is. Only learning_rate and epochs changed.

- epochs: 12
+ epochs: 227

- learning_rate: 5.0e-5
+ learning_rate: 2.0e-5

- output_dir: "/workspace/output/rework-v3"
- output_name: "kanachan-waianima-rework-v3"
+ output_dir: "/workspace/output/rework-v4"
+ output_name: "kanachan-waianima-rework-v4"

save_every: 1 / sample_every: 1 carried over from v3. Long-training runs in particular benefit from keeping every-epoch behavior. flip_augment: false also stays as v3 (was inert in earlier verification but I don’t want to add variables).

Restarted once due to an epoch-count miscalculation

I first kicked it off at epochs: 190 because my runbook noted 212 steps × 190 ≒ 12,000. But right after launch the AnimaLoraToolkit log printed:

2026-04-26 14:01:50,085 - INFO - 数据集大小: 212, 每 epoch 步数: 53, 总步数: 10070

数据集大小: 212 is the preprocessed sample count from repeats: 4 × 53 images = 212, divided by batch_size: 1 × grad_accum: 4 = effective batch 4, giving 53 effective steps per epoch. 53 × 190 = 10,070 — falls just short of the 12,000+ official recommendation.

Bumped epochs to 227 (53 × 227 = 12,031, clearing 12,000). Killing and restarting cost 3-5 minutes and roughly $1-1.5 of extra spend, but catching it at this point produces clean numbers (“just clears official recommendation”) for the article.

Rebuilding the environment

The Network Volume from the previous RunPod run was still intact. The v3 training output, AnimaLoraToolkit, the model bundle (waiANIMA_v10 / VAE / TE), the dataset, even the .npz latent cache — all still alive.

But the pod container is fresh = Python environment from zero. Following the previous gotcha where xformers wipes torch with cu130, this time I skipped xformers and manually installed only the deps I needed:

pip install einops safetensors transformers diffusers accelerate peft \
  lycoris-lora omegaconf tqdm Pillow numpy lpips pytorch-fid pytorch-msssim \
  scipy scikit-image matplotlib pandas pyyaml psutil rich tiktoken sentencepiece \
  protobuf

pillow-jxlpy had no wheel so I skipped it (it’s for JPEG-XL, not needed here). I missed protobuf initially and the T5Tokenizer sentencepiece conversion blew up, so I added it.

torch sanity check:

python -c 'import torch; print(torch.__version__, torch.cuda.is_available())'
# → 2.4.1+cu124 True

OK, the cu124 environment in the Volume still works.

Launch with tmux

Holding an SSH session for 11 hours × 1 pane is unrealistic, so I launched under tmux. RunPod’s base image didn’t include tmux, so apt install tmux first.

cd /workspace/AnimaLoraToolkit
tmux new-session -d -s train \
  "python anima_train.py --config ./config/train_kanachan_rework_v4.yaml \
   2>&1 | tee /workspace/output/rework-v4/train.log"

tmux ls showing train: 1 windows (created ...) confirms launch. tqdm progress bars use \r, so they don’t reach the tee’d log, but AnimaLoraToolkit dumps loss / step / speed per step to monitor_data/state.json, which is more accurate to read anyway.

python3 -c 'import json; d=json.load(open("/workspace/AnimaLoraToolkit/monitor_data/state.json")); \
  print("step:", d["step"], "/", d["total_steps"], "speed:", d["speed"])'
# → step: 5 / 12031 speed: 0.31331414944110736

Training speed matches v1 at 0.31 it/s (~170 sec for 53 steps/epoch). GPU 100%, VRAM 10.8GB, temp 69°C, stable.

Cost breakdown

ItemRateEstimated timeSubtotal
RTX 6000 Ada$0.77/h~12.5 hours (training + sample gen)$9.63
Network Volume 50GB$0.005/h~13 hours$0.07
Failed v1 launch$0.77/h~4 minutes$0.05
Launch / prep$0.77/h~20 minutes$0.26
Total estimate~ $10

sample_every: 1 adds 227 sample generations (~30 sec each = ~1.8 hours), so combined with the pure 10.8 hours of training, 12.5 hours is the realistic estimate.

Compared to the $1.22 + α from v1-v3, cost balloons in one go. But this is the floor for following Anima’s official recommendation. Set against SDXL where 53 images × 12 epochs = 636 steps cost $1, about 19× the training amount and roughly 8× the cost is needed for an Anima-optimized character LoRA, on the same source material.

Results

Per-epoch sample (fixed prompt: masterpiece, best quality, safe, 1girl, solo, kanachan, Her side ponytail with a blue scrunchie is visible on the right side of the image. side ponytail, ahoge, standing, looking at viewer, white background, simple background).

Endpoints and middle picked out (baseline = WAI-Anima v1 alone, no LoRA injected; ep1 = right after training started; ep227 = final epoch).

baseline
v4 baseline (no LoRA)
epoch 1
v4 epoch 1
epoch 227
v4 epoch 227

Rough direction hit count from training samples

The trainer’s built-in sampler emits one image per epoch with fixed seed 42 + fixed prompt, so it works for cross-epoch behavior comparison. Manually counting 51 images for ep177-227 where training looks settled, side ponytail rendered on the viewer-right (matching training material) 25 of 51 ≈ 49%.

Roughly 1/2 gambling. Direction wobbling between epochs means the LoRA hasn’t burned in direction information “as a concept”; instead, each epoch’s tiny weight update flips the tug-of-war between base bias and the new signal randomly.

For comparison, running 12,000 steps on the same material under IL (SDXL) would give 226 exposures per image, putting it deep in overfitting territory where the LoRA reproduces the source material near 1:1 (the “only the LoRA shows” state). Anima can’t even bake direction at that depth, which makes for decisive evidence of the catastrophic forgetting hypothesis where Qwen3 0.6B TE overpowers LoRA adaptation.

That said, the training samples are:

  • Single prompt (simplified: NL 1 sentence + few structural tags)
  • Single seed (seed: 42)

So in theory there’s still room for prompt format or seed changes to improve things. To close that gap I ran a bust-up matrix test with production prompts.

Local ComfyUI bust-up verification

Selected ep227 (final) and ep150 (a mid-range epoch including hits).

Selection rationale:

  • That shallow training fails to make hairstyle structural tags work was already shown up through v3, so the lead candidates come from deeper epochs
  • Beyond a certain point the face stops moving (overfitting convergence), so picking among deep epochs is safer
  • ep227 is the final, ep150 is a mid-point comparison neither too shallow nor too deep

Settings: 832×1024, er_sde + simple, 30 steps, CFG 4.0, LoRA strength model=1.0 / clip=0.8.

Three prompt formats for separation:

Format T (tags only)

masterpiece, best quality, score_7, safe, 1girl, solo, kanachan,
side ponytail, ahoge, double parted bangs, medium hair, blue scrunchie,
white collared shirt, red necktie, upper body, looking at viewer, front view,
white background, simple background

Format N (natural language only)

masterpiece, best quality, safe, 1girl, solo, kanachan,
A close-up portrait of a young girl looking at the viewer with a calm expression.
Her side ponytail with a blue scrunchie is visible on the right side of the image,
and a small antenna of hair rises from the top of her head.
She wears a white collared shirt with a red necktie.
white background, simple background

Format TN (both = matches training)

masterpiece, best quality, score_7, safe, 1girl, solo, kanachan,
A close-up portrait of a young girl looking at the viewer with a calm expression.
Her side ponytail with a blue scrunchie is visible on the right side of the image,
and a small antenna of hair rises from the top of her head.
side ponytail, ahoge, double parted bangs, medium hair, blue scrunchie,
white collared shirt, red necktie, upper body, looking at viewer, front view,
white background, simple background

Each format × seed 42 / 100 / 200 = 9 images, two LoRAs (ep227 and ep150) = 18 images total. Goal: separate whether the LoRA fires through tags / through NL / only when both align.

ep227 (final, 12,031 steps)

Each column = seed 42 / 100 / 200. Green = ponytail rendered as the prompt asks (viewer-right). Red = miss, rendered on viewer-left.

Format T (tags only)

seed 42
ep227 T s42
seed 100
ep227 T s100
seed 200
ep227 T s200

Format N (NL only)

seed 42
ep227 N s42
seed 100
ep227 N s100
seed 200
ep227 N s200

Format TN (both)

seed 42
ep227 TN s42
seed 100
ep227 TN s100
seed 200
ep227 TN s200

ep150 (mid-range, 7,950 steps)

Format T (tags only)

seed 42
ep150 T s42
seed 100
ep150 T s100
seed 200
ep150 T s200

Format N (NL only)

seed 42
ep150 N s42
seed 100
ep150 N s100
seed 200
ep150 N s200

Format TN (both)

seed 42
ep150 TN s42
seed 100
ep150 TN s100
seed 200
ep150 TN s200

Direction hit-rate summary

LoRAT (tags only)N (NL only)TN (both)Total
ep150 (7,950 steps)2/3 (67%)3/3 (100%)3/3 (100%)8/9 (89%)
ep227 (12,031 steps)0/3 (0%)2/3 (67%)2/3 (67%)4/9 (44%)

Observations

1. ep150 is dramatically better than ep227

The relationship between training amount and direction control is not monotonically increasing. ep150 (~8,000 steps) is the sweet spot, and ep227 (12,000 steps) is clearly in overfitting territory where the direction hit rate falls hard. Anima’s official “12,000+ steps” recommendation is excessive for this case.

ep227 T format 0/3 is symbolic: with no direction info from tags, every output collapses to base bias on viewer-left. The training material has all images with the side ponytail on the character’s left (viewer-right), yet the ep227 LoRA returns the opposite of training at full strength. This isn’t the LoRA having “burned in” direction information; it’s catastrophic forgetting that left only a weak signal opposite to base bias, and even that gets cancelled by seed jitter.

2. NL format is the decisive factor

For both ep150 and ep227, N and TN (formats containing NL) give a higher direction hit rate than T (tags only). This means the natural-language sentence in training, Her side ponytail with a blue scrunchie is visible on the right side of the image, was burned in as a direction signal. The tag side ponytail carries no left/right information (as confirmed in v3), so this is the effect of putting NL into the training caption showing up.

3. Character shape is sharper at ep150

ep227 outputs have slightly softer features and the kanachan-ness is fading. ep150 has the ahoge curl, the scrunchie color saturation, and eye details closer to the training material. In the overfitting region, even character core gets dragged back toward the base average — that may be what’s happening.

4. Even format T alone passes 67% at ep150

Even with just the side ponytail tag (no NL), ep150 gets 2/3. So even the tag column alone has accumulated some direction information into the LoRA. But at ep227 it drops to 0/3, so overfitting destroys this accumulation too.

Consistency with the 49% from training samples

The training-sample reading (49% over ep177-227) and the production test (44% at ep227) line up. Training samples use a fixed prompt (simplified, close to N) with seed 42 only, so they should be compared directly with the ep227 N format aggregate from production tests. At simplified NL prompts with seed jitter, ep227 sits stably bad around 50%.

Pinning down the sweet spot via epoch scan

The ep150 vs ep227 gap was unexpected, so I scanned to fill in the middle and outside. Using N format (where the hit rate gap is most visible), I ran ep100 / ep120 / ep180 / ep200 with seeds 42 / 100 / 200, three images each.

ep100 (5,300 steps) — still too shallow

seed 42
ep100 N s42
seed 100
ep100 N s100
seed 200
ep100 N s200

ep120 (6,360 steps) — beginning to lift off

seed 42
ep120 N s42
seed 100
ep120 N s100
seed 200
ep120 N s200

ep180 (9,540 steps) — plateau alongside ep150

seed 42
ep180 N s42
seed 100
ep180 N s100
seed 200
ep180 N s200

ep200 (10,600 steps) — collapse, with artifacts

seed 42
ep200 N s42
seed 100
ep200 N s100
seed 200
ep200 N s200

All three try to render a single side ponytail on the viewer-left, but the side ponytail itself is ghosted out and only half-rendered as translucent. Not double-rendered, not bidirectional drift either. The state is the LoRA can’t commit to the “shape” of side ponytail and is melting toward the base bias direction with vague commitment.

N-format hit-rate curve

epochtotal stepsexposures per imagehit/3hit rate
1005,3004000/30%
1206,3604801/333%
1507,9506003/3100%
1809,5407203/3100%
20010,6008000/30% ❌
22712,0319082/367%

“Exposures per image” = how many times the LoRA saw a single training image. With 53 source images × repeats 4, that’s repeats 4 × epoch count.

T format (tags only) sweep over the same range

To isolate how much direction information the LoRA itself holds when the NL path is cut, I generated the same epochs in T format with seeds 42 / 100 / 200. Added ep215 (between ep200 and ep227, post-collapse state).

ep100 (T) — shallow, full miss

seed 42
ep100 T s42
seed 100
ep100 T s100
seed 200
ep100 T s200

ep120 (T) — still full miss

seed 42
ep120 T s42
seed 100
ep120 T s100
seed 200
ep120 T s200

ep180 (T) — plateau alongside ep150 T

seed 42
ep180 T s42
seed 100
ep180 T s100
seed 200
ep180 T s200

ep200 (T) — collapses like N, ghost in the same place as N

seed 42
ep200 T s42
seed 100
ep200 T s100
seed 200
ep200 T s200

ep215 (T) — ghost gone, shape returned, but direction stays base-bias

seed 42
ep215 T s42
seed 100
ep215 T s100
seed 200
ep215 T s200

Full matrix (all eps × all formats)

Phase 4 completed the TN format sweep, filling all 3-format × 7-epoch cells.

epochTNTNNotes
1000/30/30/3All formats dead (training too shallow)
1200/31/31/3Only N and TN beginning to lift
1502/3 (67%)3/3 (100%)3/3 (100%)sweet spot
1802/3 (67%)3/3 (100%)3/3 (100%)sweet spot
2000/30/30/3All formats dead (ghosting)
2150/31/3 (33%)0/3Sole exception: TN < N
2270/32/3 (67%)2/3 (67%)N=TN partial recovery, T dead

Even in T format ep150-180 forms the same plateau. The hit rate is lower than N’s 100% (67%), but clearly higher than every other epoch (all 0%). The sweet spot region is structural and crosses formats.

The notable behavioral split is at ep227, where N partially recovers to 67% but T stays at 0%. The partial recovery at ep227 happens only via the NL path — meaning the LoRA keeps a partial response to NL direction descriptions to the end, but tags can no longer pull out direction information. An asymmetric process.

ep215 T has the ep200 ghosting gone and the shape restored, but direction is fixed to base bias (viewer-left). Reads as a 3-stage transition: collapse → shape recovery → direction loss.

TN format (tags + NL) sweep

Filling out TN format across all eps. ep150 / ep227 already shown (both equal to N). New scans for ep100 / ep120 / ep180 / ep200.

ep100 (TN) — full miss

seed 42
ep100 TN s42
seed 100
ep100 TN s100
seed 200
ep100 TN s200

ep120 (TN) — same 1/3 lift-off as N

seed 42
ep120 TN s42
seed 100
ep120 TN s100
seed 200
ep120 TN s200

ep180 (TN) — perfect 3/3, alongside ep150

seed 42
ep180 TN s42
seed 100
ep180 TN s100
seed 200
ep180 TN s200

ep200 (TN) — all formats collapsed, ghosting persists

seed 42
ep200 TN s42
seed 100
ep200 TN s100
seed 200
ep200 TN s200

ep215 in N / TN

To isolate transitional behavior, also generated ep215 in N / TN.

ep215 (N) — entry to partial recovery, 1/3

seed 42
ep215 N s42
seed 100
ep215 N s100
seed 200
ep215 N s200

ep200 (0%) → ep215 (33%) → ep227 (67%) — N format hit rate recovers in a roughly linear fashion. From the catastrophic-forgetting bottom (ep200), the LoRA is partially regaining response to the NL signal.

ep215 (TN) — worse than N, 0/3

seed 42
ep215 TN s42
seed 100
ep215 TN s100
seed 200
ep215 TN s200

The asymmetry ep215 TN (0/3) is worse than ep215 N (1/3) appears. In the sweet spot region, N and TN match (ep150 both 3/3, ep227 both 2/3). But in the collapse region, the T tags become an interference source dragging down the N direction signal. The tag side ponytail carries no direction information; combined with overfitted weights drifting toward base bias, the only association left seems to be “the tag and the viewer-left side ponytail pattern”. Adding T tags in TN format reactivates that association and overwrites the NL signal — that’s the read.

So:

  • Sweet spot (ep150-180): NL and tags cooperate, TN ≈ N (both high hit)
  • Overfitting boundary (ep215): NL is barely alive, tags become harmful, TN < N
  • End-stage recovery (ep227): only NL partially recovered, T path completely dead, TN ≈ N (same level)

The reason only ep215 shows “TN < N”: overfitting has burned in a strong association between side ponytail (the word) and viewer-left. Even with the NL direction signal alive, the side ponytail in the tag column reactivates the “tilt left” known pattern and overwrites the NL right side of the image. ep200 is broken to the point the LoRA can’t hold the shape, so no signal works at all. ep227 has training progressed enough that NL wins again. Each stage shows distinct behavior.

Observations (from the scan results)

5. The hit-rate curve isn’t unimodal

ep150-180 plateau → bottom at ep200 (0%, with ghosting) → partial recovery at ep227 (67%) — non-monotonic motion. Around ep200 is a transitional zone where the LoRA can’t commit to the side-ponytail shape itself, melting translucently toward the base bias direction. Beyond that (ep227) shape returns to some extent, but direction is dragged back to base bias (viewer-left) and gambling persists.

6. Even at the same epoch, expression / overall impression shifts with format

Lining up ep180 in 3 formats reveals that, separately from direction control, expression rendering varies by format. Format T (tags only, no expression specified) skews neutral / slightly stoic; format N (NL with calm expression) gives a soft, settled expression; TN sits in between. NL clamps not just direction but also overall impression; without it, the base falls back to default interpretation and wobbles.

Practical implication:

  • Picking ep150 / ep180 by direction hit rate alone still gets you an additional expression gamble at format T runtime
  • N format has direction 100% × stable expression = doubly stable, so as a deployment recipe, “ep150 or ep180 + N format + minimize seed jitter” is the optimum
  • T format combines 67% direction with expression jitter, so it’s a double gamble — not for production

7. T format also plateaus on the same range, but ep227 doesn’t recover

Re-sweeping in tags-only prompt, ep150 / ep180 are both 2/3 (67%), worse than N format but clearly higher than any other epoch. The sweet spot is structural and format-independent — within the range, the LoRA can pull direction information partially even from a tag-only column. Outside, it can’t pull anything (0%).

ep227 behavior splits cleanly between T and N:

  • N format: 67% (partial recovery)
  • T format: 0% (no recovery)

Meaning the weak direction signal still alive in the ep227 LoRA only fires through the NL path. The tag side ponytail carries no direction information, so for a LoRA thinned out by overfitting, it provides no handhold and falls to base bias. So the post-ep200 collapse is an asymmetric process where the tag→visual mapping breaks down first.

8. 600-720 exposures per image is the stable region

Both ep150 (600) and ep180 (720) are flat at 100%. A bit more than the SDXL/IL rule of thumb of 200-400 per image, but reasonable given the strength of Anima’s base bias. Below this is signal-starved (ep100/ep120), above it overfits and breaks (ep200) — that looks like a guideline for small character LoRAs.

9. The “12,000+” official recommendation is a complete over-shoot

Anima’s official 12,000+ steps is, I’m guessing, an absolute step count assuming a large dataset (hundreds to thousands of images). Applied to a 53-image character LoRA, ep227 hits 908 exposures per image, which is overfitting territory. Operating standards should be set in “exposures per image”, divided by source count.

Comparison with training material / WAI-IL LoRA

Checking how faithful ep150 (the v4 sweet spot) is to training material and the WAI-IL character LoRA. Prompts aligned to smile expression, generated with N format + seed 42.

training source
training source smile
WAI-IL v16 LoRA
WAI-IL v16 LoRA smile
v4 ep150 (Anima)
v4 ep150 smile

Looking at the three side by side, both LoRAs reflect the training source heavily and the differences are subtle, but:

  • WAI-IL v16 leans anime-stylized, the eye sparkle is strong. Comic-like saturation
  • v4 ep150 sits closer to the training source’s calmer color palette. The way the ahoge stands up, the scrunchie’s presence, the gentleness around the eyes feel closer to the training source
  • Hair color is closer to the training source’s soft brown on v4. WAI-IL is slightly more vivid

In terms of training-source fidelity, v4 ep150 sits closer to the training source. Likely because Anima’s base style is calmer than WAI-IL’s and is a better match for the training material. WAI-IL-based LoRAs end up with the IL’s inherent flair and comic-leaning saturation laid onto the face.

The flip side is that for “match the IL aesthetic” use cases, a WAI-IL-based LoRA is the right call. So this is less about superiority and more about the operational principle: the base model’s style rides on top, so pick the base for the style you want to land on. In light of the original motivation of this article (porting the character into Anima ecosystem), this reinforces the conclusion that training a separate Anima character LoRA is worth it, distinct from the IL one.

Conclusion

Set up as a binary between hypothesis A (training amount solves it) and hypothesis B (it doesn’t), but the result was a third answer.

  • Hypothesis A half-right: Increasing training from v3’s 636 steps to ep150-180’s 8,000-9,500 steps lifts the direction hit rate from ~50% to 100% (N format). Clearly effective
  • Hypothesis B half-right: But pushing further (ep200 / 10,600 steps) cratered to 0%, and ep227 / 12,031 only recovered to 67%. Anima’s “12,000+ steps” recommendation is completely excessive for a 53-image dataset

Practical conclusion:

  • ep150-180 + NL format prompt gives 100%, so an Anima character LoRA is workable
  • 600-720 exposures per image is the stable region (more than SDXL/IL’s rule of thumb of 200-400, accounting for Anima’s base bias strength)
  • Above 800 per image, catastrophic forgetting tears it apart (proven at ep200)

Operational guideline:

  • Aim training at (epoch × repeats) ≈ 600-720. For 53 images that’s ep150-180; for 100 images ep80-90; for 30 images ep270-320 — back-calculate from source count
  • Inference must use NL prompts that include direction descriptions. Tag-only breaks direction control
  • Training captions must include NL too (the v3 direction was correct)

Differences from the official documentation:

  • The “12,000+ steps” official line presumably presumes a large dataset (hundreds to thousands of images) absolute step count, not applicable to small character LoRAs. Think in “exposures per image”, divided by source count
  • learning_rate 1e-5 to 2e-5: 2e-5 (the upper bound) was fine. Probably no need to go lower (1e-5 untested)
  • Differential Output Preservation discussion still seems valid, but within this article’s scope you can deploy without waiting for it

Avenues for further verification

  1. Strength sweep: Run ep150 / ep180 at strength 0.5 / 0.7 / 1.0 to see if low strength still holds shape
  2. Hairstyle independence retest: Demand hair down, semi-long hair at ep150 and check whether the v3-confirmed property “hairstyle isn’t burned into kanachan” still holds at ep150
  3. ep150 vs ep180 finals: Detailed character comparison (face, ahoge, scrunchie color) to decide which is the right kanachan within the sweet spot
  4. i2i pipeline check: Use ep150 in IL → Anima i2i for character preservation, closing out the original motivation (porting character into Anima ecosystem)

For now, ep150 or ep180 + NL prompt is enough for actual operation.


Surveying the whole process in time and cost:

StageTimeCost
RunPod GPU v4 training (12,031 steps)~12 hours~ $10
Local ComfyUI initial bust-up (ep150 / ep227 all formats, 18 images)~90 minelectricity
Phase 1: N format sweep (ep100/120/180/200 × 3 seeds = 12 images)~60 minsame
Phase 2: T format sweep + ep215 all formats (21 images)~105 minsame
Phase 3: ep150 smile comparison (3 images)~15 minsame
Phase 4: TN format sweep (ep100/120/180/200 × 3 seeds = 12 images)~60 minsame
Total~17.5 hours$10+

Total verification image count is 51 + 229 training-time samples = 280 images. About 70 of them are embedded in the article. Same face, same composition over and over, so I almost achieved Gestaltzerfall while writing it.

The verification side ran on local ComfyUI on M1 Max, 4-5 minutes per image. RTX 6000 Ada would be 30-60 seconds, so running everything on RunPod could have shortened verification time by 3+ hours. Trading off the overhead of building a ComfyUI environment, uploading LoRAs, downloading results, and watching the billing meter, running it locally in a familiar setup was just easier — that’s the trade-off.

A normal person wouldn’t do this just to bake one character LoRA. Usually you’d produce ep5 and ep10, say “yeah that works”, and call it done. Because Anima’s direction control didn’t bend the way I wanted, this turned into a chain that spans v1 through v4, four articles deep, before landing here.

But it was worth pushing through. Without going this deep, the basic questions — “is an Anima character LoRA even practical to begin with?”, “is following the official recommendation mechanically the right answer?” — never get an answer. Lining up 51 images is what produced reusable operational knowledge: “the sweet spot is 600-720 exposures per image”, “NL is decisive”, “the ep215 TN < N exception”.

Probably not many people fall into the same situation, but anyone who tried to bake a character LoRA on Anima and hit “direction won’t move”, “it broke from overfitting”, or “tags alone aren’t doing anything” should find a shortcut just from the data points in this article. And it’s also, hopefully, a partial answer to the universal question of why is drawing pictures with generative AI so painful.

The Anima series in chronological order. Links are scattered through the body, but easy to miss, so re-listed here.