//=time() ?>
The basic idea here is to train MLPs to predict a CLIP image embedding from a text embedding, and vice versa; then normalize and add in the embeddings.
@assadollahi @GreatDismal