Stage 2 training #100

isamu-isozaki · 2023-07-23T19:53:26Z

Once the pr for maxvit is done, I'm thinking of adding the below super res conditioning part

Some steps needed I think are

In train_muse.py's prepare_inputs_and_labels function, interpolate pixel values to 256x256 and get tokens using f16 vqgan for low resolutions and 512x512 and f8 for high resolutions. We can use precomputed embeddings here
Then, we might want a SuperResTransformer class which takes as an attribute
the TransformerLayers for low resolution
the MaxVitTransformerLayers for high resolution
and projection layer and concatenating layer between the low res+text embeddings

isamu-isozaki mentioned this issue Aug 26, 2023

WIP: Second stage training #107

Open

5 tasks

Provide feedback