Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network differs for consecutive calls of Conformer Module. #257

Open
JackTemaki opened this issue Mar 14, 2023 · 4 comments
Open

Network differs for consecutive calls of Conformer Module. #257

JackTemaki opened this issue Mar 14, 2023 · 4 comments

Comments

@JackTemaki
Copy link
Contributor

JackTemaki commented Mar 14, 2023

When starting a fresh training the network construction already runs twice because the network apparently differs:

The diff is:

dict diff:
['encoder'] dict diff:
['encoder'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape'] ['extra_deps'] list diff len: len self: 1, len other: 2
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape_0'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape_0'] ['extra_deps'] list diff len: len self: 1, len other: 2
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['dot'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['dot'] ['from'] list diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['dot'] ['from'] [0] self: 'base:relative_positional_encoding' != other: 'base:relative_positional_encoding/sin'
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['relative_positional_encoding'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['relative_positional_encoding'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['relative_positional_encoding'] ['subnetwork'] ['output'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['relative_positional_encoding'] ['subnetwork'] ['output'] ['from'] self: 'sin' != other: 'concat'
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['relative_positional_encoding'] ['subnetwork'] ['output'] ['out_shape'] set diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['relative_positional_encoding'] ['subnetwork'] ['output'] ['out_shape']   Dim{F'conformer-enc-default-out-dim'(512)} not in other

The network code can be found under:
https://github.com/rwth-i6/i6_experiments/blob/main/users/rossenbach/experiments/librispeech/librispeech_100_attention/rc_conformer_2023/rc_networks/conformer_aed_trial.py

The network is constructed via:

def get_network(epoch, **kwargs):
    nn.reset_default_root_name_ctx()
    net = construct_network(epoch=epoch, **network_kwargs)
    return nn.get_returnn_config().get_net_dict_raw_dict(net)

But within the construct_network epoch is not used:

def construct_network(
        epoch: int,
        audio_features: nn.Data,
        bpe_labels: nn.Data,
        **kwargs
):
    net = ConformerAEDModel(
        bpe_size=bpe_labels.sparse_dim,
        audio_feature_dim=audio_features.dim_tags[audio_features.feature_dim_axis],
        **kwargs
    )

    out = net(
        audio_features=nn.get_extern_data(audio_features),
        audio_time=audio_features.dim_tags[audio_features.time_dim_axis],
        bpe_labels=nn.get_extern_data(bpe_labels),
        bpe_time=bpe_labels.dim_tags[bpe_labels.time_dim_axis]
    )
    out.mark_as_default_output()

    return net

The full log can be found under:
https://gist.github.com/JackTemaki/bc24ac9d5ced81c823a0b94fa0871720

@albertz
Copy link
Member

albertz commented Mar 14, 2023

Can you link the full log in a Gist? Also, can you at least complete the diff log here? There was a bit more output on this.

@JackTemaki
Copy link
Contributor Author

This is the complete diff log!

@albertz
Copy link
Member

albertz commented Mar 14, 2023

In the log you send me earlier, you had:

reinit because network description differs. Diff:
dict diff:
['encoder'] dict diff:
['encoder'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] ['self_att'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] ['self_att'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['output'] dict diff:
['encoder'] ['subnetwork'] ['conformer_encoder_layer'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['output'] ['from'] self: 'Parameter.initial_11' != other: 'Parameter.initial_0'
['encoder'] ['subnetwork'] ['layers'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['output'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['output'] ['out_shape'] set diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['output'] ['out_shape']   Dim{F'conformer-enc-default-out-dim'(512)} not in other
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['output'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['output'] ['from'] self: 'slice_nd' != other: 'pad'
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape'] ['extra_deps'] list diff len: len self: 1, len other: 2
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape_0'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['reshape_0'] ['extra_deps'] list diff len: len self: 1, len other: 2
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['slice_nd'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['slice_nd'] ['out_shape'] set diff value: self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['slice_nd'] ['out_spatial_dim'] self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['RelPosSelfAttention._rel_shift'] ['subnetwork'] ['slice_nd'] ['size'] self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['add_1'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['add_1'] ['from'] list diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['add_1'] ['from'] [1] self: 'RelPosSelfAttention._rel_shift' != other: 'RelPosSelfAttention._rel_shift/slice_nd'
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['add_1'] ['out_shape'] set diff value: self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['att'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['att'] ['reduce'] self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['att_weights'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['att_weights'] ['axis'] self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['att_weights'] ['out_shape'] set diff value: self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['dot'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['dot'] ['out_shape'] set diff value: self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['dropout'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['dropout'] ['dropout_axis'] self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['dropout'] ['out_shape'] set diff value: self: Dim{'1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))+-1+(ceildiv_right(ceildiv_right(ceildiv_right(-199+audio_features_time+-200, 160), 3), 2))'[B]} != other: Dim{'conformer_aed_model/encoder/input_layer/time2_conv:out-spatial-dim0:kv'[B]}
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['dot'] dict diff:
['encoder'] ['subnetwork'] ['layers'] ['subnetwork'] ['0'] ['subnetwork'] ['self_att'] ['subnetwork'] ['linear_pos'] ['subnetwork'] ['dot'] ['from'] list diff:
... (189 diffs not shown)

@JackTemaki
Copy link
Contributor Author

I set num_layers from 12 to 1 so that the log is more readable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants