Birchlabsさんのプロフィール画像

Birchlabsさんのイラストまとめ


ML Engineer at Anlatan (@novelaiofficial). co-author of HDiT (Hourglass Diffusion Transformers). works on diffusion models and LLMs. 日本語を勉強してる。
birchlabs.co.uk

フォロー数:191 フォロワー数:4507

it changed this one a bit more drastically, but hey we went down to one person

0 1

generating out-of-distribution (bigger) image sizes by fiddling with self-attn softmax.
(incorrectly) implemented advice from Kharr at … it worked anyway.
left=usual
right=smaller softmax denominator; topk(preferred_key_tokens) attn scores per query

1 18

can we use this to make images *larger* than those in the training distribution? fix "double body parts"?
trying to evaluate. need to implement mem-efficient attn version first (or use Mac); fp32 quantile() uses lots of VRAM.
median (50%ile) fixes body shape. encouraging.

0 7

dynamic-thresholding latents in pixel space.
at sigmas≥1.1: we decode to pixel space, do Imagen-style thresholding, encode to latents.
trained a tiny latent decoder + RGB encoder on VAE outputs (you could call this distillation).
left = CFG 30, usual
right = dynthreshed

5 31

if I abort CFG too early (here's a more aggressive cutoff at 1.4), then medium/fine details are solved without CFG, look more like "most likely unconditional prediction". lost eyelashes and blush.
left = full CFG
right = CFG until sigma=1.4

0 2

brings bokeh backgrounds into focus

0 1

latent channels' means drift from 0 with each sampling step (especially when CFG is applied).
I re-centered denoising outputs on each latent channel's mean. it didn't help with CFG clipping, but I think it brings out high-frequency/low-sigma details (grass blades, leaves).

4 31

here's how my latent thresholding technique preserves dynamic range at high CFG scales
we mimic the dynamic range of known-good CFG7.5.
center latent channels on means, clamp out 99.9%ile outliers, multiply by ratio "CFG7.5's max / my 99.9%ile", un-center.

8 101

had an art night
I brought VSCode, an ssh tunnel and a wireless mouse

1 9

multi-cond guidance, without cubic easing.
since it's a linear interpolation schedule: we have more frames at the midpoints between conditions, where the mixing is least coherent.
still: it's a visual interpolation, so the motion's easy to track.

0 1