What is the cross-view aware Transformer? #79

axbycc-mark · 2024-11-23T01:11:18Z

Hi mvsplat team. I have been reading through your paper here https://arxiv.org/abs/2403.14627 and I'm looking for more info about the cross view transformer blocks.

To construct the cost volumes, we first extract multi-view image features with a CNN and Transformer architecture. Specifically, a shallow ResNet-like CNN is first used to extract 4× downsampled per-view image features. Then, we use a multi-view Transformer with self and cross-attention layers to exchange information between different views.

Skimming through the code, it looks like this means you take each feature image (H, W, C) and turn it into a sequence of tokens by a simple reshape into (H*W, C) and then use a transformer model on these tokens, right? And the only other special thing that is happening is the generation of the shifted window masks for the "swin" style attention?

Thanks for the work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the cross-view aware Transformer? #79

What is the cross-view aware Transformer? #79

axbycc-mark commented Nov 23, 2024

What is the cross-view aware Transformer? #79

What is the cross-view aware Transformer? #79

Comments

axbycc-mark commented Nov 23, 2024