-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create good Conformer baselines #233
Comments
I'll get right on it. May be diverted to other tasks but will try to find the time. |
Note that I'm simultaneously also working on this. But there are so many different things to test here that this should not be a problem. Specifically, my current setting is a BPE-based monotonic transducer (RNA-like) on Switchboard, and I compare some old Conformer config vs |
I noticed that the |
We also should check param init. And also look at other frameworks code. E.g. here in Fairseq: |
On rel pos enc, I'm collecting some overview here: https://github.com/rwth-i6/returnn_common/wiki/Relative-positional-encoding |
@albertz a few questions:
|
You will find a couple of recipes on returnn-experiments, for example: I also have an adopted variant where I embed this old net dict in returnn-common here: I was also able to fully replicate this config now in pure returnn-common. That means, I wrote a checkpoint converter script and verified that I get exactly the same outputs after every layer: |
I didn't thought too much about this yet. Maybe it makes sense when you want to really see the exact differences. First step to really know that we implemented exactly the same model (or rather: we can configure our model such that it exactly matches some ESPnet variant) is to import the model parameters, and verify we get the same exact outputs, layer by layer. This is some work but mostly straightforward and very systematic. And this already usually leads to lots of interesting insights of differences that we did not really realize before. Next step is to see that we get similar training behavior, in terms of learning curves and all numbers. This is trickier because this cannot really be checked systematically anymore. For this, you need to match the param init scheme, optimizer settings, other regularization settings, etc. We never really did this but actually this would be a very interesting experiment, because some people have observed that there are big differences in training behavior. |
Better ask Eugen (@curufinwe) what datasets he recommends.
Also better ask Eugen about this. |
Btw, you can see all my recent Conformer experiments here: https://github.com/rwth-i6/i6_experiments/tree/main/users/zeyer/experiments/exp2022_07_21_transducer/exp_fs_base Basically I create a new file there for every experiment I run. Also see the readme there with some notes. |
I noticed that we do not have dropout after the self-attention: # MHSA
x_mhsa_ln = self.self_att_layer_norm(x_ffn1_out)
x_mhsa = self.self_att(x_mhsa_ln, axis=spatial_dim)
x_mhsa_out = x_mhsa + x_ffn1_out This is different to the standard Transformer. |
Regarding param init, in ESPnet, there is almost no code at all for this, meaning it uses the PyTorch defaults. Mostly this is via torch.nn.init.xavier_uniform_(self.pos_bias_u)
torch.nn.init.xavier_uniform_(self.pos_bias_v) for the biases, which are directly created there. |
On some internal data, and maybe also Switchboard and Librispeech.
Using
nn.Conformer
. Making it somewhat more standard if possible, and then deviate from it when it makes sense.Also compare it to earlier Conformer recipes, and earlier BLSTM recipes. Make sure the conditions are sane for comparison, e.g. same number of epochs.
When we have that, we should also change our the
Conformer
defaults to sth reasonable.I think our earlier Conformer recipes (there are several variants floating around in our group...) are somewhat non-standard:
nn.Conformer
but we never really compared it systematically, and also thenn.Conformer
is not really well tested yet. See our wiki on relative positional encoding for further references.References, and related:
The text was updated successfully, but these errors were encountered: