With you can just take a low-resolution image as seed for the next stage, and increase the resolution in a coarse-to-fine manner.

With CLIP+VQGAN, it doesn't really work. Everything great about the original is lost after 50 iterations:

0 4