//=time() ?>
got #stablediffusion's Unet compiled to CoreML, targeting all compute units (including Neural Engine).
replaced self-attention with the MultiHeadedAttention that Apple optimized for Neural Engine.
not faster yet (need cross-attention too).
https://t.co/9tdKdGIK7X
k-diffusion added Brownian tree noise sampling, increasing the stability of convergence.
10, 15, 20, 25, 30, 35, 40, 100 step counts:
left = default noise
right = Brownian noise
default strategy has it jumping all over the place, but Brownian sampling is stable. #stablediffusion
skipping the clamping-by-%ile and just denormalizing CFG20's latents to CFG7.5's abs().max() is very similar to reducing cond_scale, but not quite.
think it works out something like:
(uncond + (cond - uncond) * 20)/(20/7.5)
versus
uncond + (cond - uncond) * (20/(20/7.5))
99.99999999999%ile dynthresh
I think this shows that "clamping out n%ile outliers" is only important when you have excessive outliers. the rest of the battle is "what range of values do you span". hence denormalizing the latents to CFG7.5's abs().max(), gives us a safer range.
each step: compute CFG7.5. for each channel: flatten, center on mean, grab abs().max()
compute CFG20. each channel: flatten, center on mean, compute 99.x%ile of abs(), pick larger of %ile or of CFG7.5's max. clamp channel by that. normalize. multiply by CFG7.5's max.
code coming
made a new algorithm for dynamic thresholding in #stablediffusion
enables us to set CFG scale high (e.g. 20) without clipping; keep dynamic range / subtlety in shadows, highlights
we refer to a known-good CFG (7.5)'s dynamic range, which helps us pick a ceiling.
detail to follow
got official DPMSolver++ sampler #stablediffusion working on Mac.
today, Cheng Lu added a trick to improve performance on <15 steps. probably k-diffusion doesn't have this yet.
dynthresh only works on pixel space; remains an unsolved problem on latents.
https://t.co/22rDBzDWiW
with DPM-Solver++(2M) sampler, we get coherent images in 5 sampler steps!
and these aren't Heun steps (where n steps = 2n-1 model calls), this is just 5 model calls! less than 3.5 secs on Mac!
Katherine released this implementation yesterday in k-diffusion — great work as usual!
classifier-free guidance:
ask model to denoise gauss noise.
no condition: model predicts a salad.
shrine maiden condition: model predicts graffiti of faces.
CFG is "what makes shrine maiden different from salad", multiplied by your guidance scale.
repeat this every sampler step.