GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

albertz · 2022-11-02T19:20:27Z

I noticed that the nn.SelfAttention is a bit different to SelfAttentionLayer: SelfAttentionLayer does not have biases for the qkv and proj linear projections, while nn.SelfAttention currently has.

This is relevant for Conformer (e.g. #233) and Transformer.

The text was updated successfully, but these errors were encountered:

albertz · 2022-11-02T19:43:36Z

Well, maybe having those biases is actually standard? E.g. In Fairseq:
https://github.com/facebookresearch/fairseq/blob/b4001184f49ed0e20d619b54bb3d43088fabf990/fairseq/modules/multihead_attention.py#L123-L131

albertz · 2022-11-02T19:48:37Z

Also used by default in PyTorch nn.MultiheadAttention (https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention).

albertz · 2022-11-02T19:53:07Z

Note that SelfAttentionLayer was designed 1:1 to be equivalent to the Tensor2Tensor code, and also looking at the current T2T code (I think here), it seems there is no bias there. So that was the original Transformer. But since then, it has evolved, and I think Fairseq is probably much more used.

albertz · 2022-11-02T19:55:33Z

In ESPNet, bias is also used:
https://github.com/espnet/espnet/blob/a65cc78de7e18c867f4be5fc0b9b695875c78c70/espnet/nets/pytorch_backend/transformer/attention.py#L32-L34

albertz · 2022-11-02T19:56:32Z

So, as they seem to be standard nowadays, I think having them enabled is ok.

albertz · 2022-11-02T19:57:45Z

@patrick-wilken Are you aware of this?

albertz · 2022-11-02T21:47:28Z

I added the option with_bias, so you can specify it explicitly. The default is still True now.

patrick-wilken · 2022-11-04T12:28:52Z

No, I wasn't. You won't find papers discussing what difference it makes, right? Maybe I should try it out, bias seems like well spent parameters. 😄

albertz · 2022-11-04T13:24:07Z

Note that this with_bias was added to returnn-common. It's not available in SelfAttentionLayer.

albertz mentioned this issue Nov 2, 2022

Create good Conformer baselines #233

Open

albertz added this to the first-release milestone Nov 2, 2022

albertz changed the title ~~GenericSelfAttention, biases are non-standard~~ GenericSelfAttention, biases are inconsistent to SelfAttentionLayer Nov 2, 2022

albertz closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

albertz commented Nov 2, 2022 •

edited

Loading

albertz commented Nov 2, 2022 •

edited

Loading

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

patrick-wilken commented Nov 4, 2022

albertz commented Nov 4, 2022

GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

Comments

albertz commented Nov 2, 2022 • edited Loading

albertz commented Nov 2, 2022 • edited Loading

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

albertz commented Nov 2, 2022

patrick-wilken commented Nov 4, 2022

albertz commented Nov 4, 2022

albertz commented Nov 2, 2022 •

edited

Loading

albertz commented Nov 2, 2022 •

edited

Loading