You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi mvsplat team. I have been reading through your paper here https://arxiv.org/abs/2403.14627 and I'm looking for more info about the cross view transformer blocks.
To construct the cost volumes, we first extract multi-view image features with a CNN and Transformer architecture. Specifically, a shallow ResNet-like CNN is first used to extract 4× downsampled per-view image features. Then, we use a multi-view Transformer with self and cross-attention layers to exchange information between different views.
Skimming through the code, it looks like this means you take each feature image (H, W, C) and turn it into a sequence of tokens by a simple reshape into (H*W, C) and then use a transformer model on these tokens, right? And the only other special thing that is happening is the generation of the shifted window masks for the "swin" style attention?
Thanks for the work.
The text was updated successfully, but these errors were encountered:
Hi mvsplat team. I have been reading through your paper here https://arxiv.org/abs/2403.14627 and I'm looking for more info about the cross view transformer blocks.
Skimming through the code, it looks like this means you take each feature image (H, W, C) and turn it into a sequence of tokens by a simple reshape into (H*W, C) and then use a transformer model on these tokens, right? And the only other special thing that is happening is the generation of the shifted window masks for the "swin" style attention?
Thanks for the work.
The text was updated successfully, but these errors were encountered: