Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move core nn functions to RETURNN #252

Open
albertz opened this issue Feb 13, 2023 · 8 comments
Open

Move core nn functions to RETURNN #252

albertz opened this issue Feb 13, 2023 · 8 comments

Comments

@albertz
Copy link
Member

albertz commented Feb 13, 2023

As a result of rwth-i6/returnn#1120 and rwth-i6/returnn#1264, we want to move core functions over to RETURNN, as this should become the common compatibility layer for different backends such as TensorFlow and PyTorch.

These are mostly the functions in nn/_generated_layers.py, but this in turn also means to move over some of the other code logic, like the naming logic, etc.

@albertz
Copy link
Member Author

albertz commented Feb 25, 2023

On RETURNN side, the new frontend API is supposed to cover exactly those core nn functions. It already started in development, although it is quite minimal so far, and mostly targets TF low-level and PyTorch low-level.

Moving nn core functions to RETURNN means to use the TF high-level layer-based frontend for Tensor and to implement those frontend API functions, in returnn/tf/frontend_layers.

There is already the compare function with a small inconsistency, which does not break too much code though, so this should be fine.

This also means to rewrite nn.Tensor, to now use the RETURNN Tensor class.

The RETURNN Tensor class would also be extended by all the math ops like __add__, which just call the frontend API, like we have it currently in nn.Tensor.

albertz added a commit that referenced this issue Feb 26, 2023
albertz added a commit that referenced this issue Feb 26, 2023
@albertz
Copy link
Member Author

albertz commented Feb 26, 2023

About replacing nn.Tensor: We need to go through all the current attribs and methods, check for their usages, and make sure the same thing works with RETURNN Tensor. One most obvious question: What would be the raw tensor type? nn.NameCtx? It seems like NameCtx lacks a few things which is in nn.Tensor currently, such as:

  • Does it handle the subclass PrevTensorRef right?

Can we move those things over to NameCtx? Let's just try to do that step by step and see.

We need to be careful about cases where tensor.name_ctx is reassigned. This can happen via Tensor._replace_by and NameCtx.move_layer_ref_here.

Tensor._replace_by usage is only in prepare_for_config_serialization, only this code:

                if root_mod_call.layer_ref is not None:
                    assert not self.layer_ref  # not sure. maybe just reset?
                    assert root_mod_call.layer.layer_dict["class"] == "subnetwork"
                    sub_out = root_mod_call.children.pop("output")
                    assert sub_out.layer.layer_dict["class"] == "copy"
                    sub_real_out = sub_out.layer.layer_dict["from"]
                    assert isinstance(sub_real_out, nn.Tensor)
                    # noinspection PyProtectedMember
                    sub_out.layer._replace_by(sub_real_out)
                    # noinspection PyProtectedMember
                    root_mod_call.layer._replace_by(sub_real_out)

Tensor._assign_parent_name_ctx is called in
NameCtx._remove_unused_and_assign_parents_and_handle_subnets. _assign_parent_name_ctx can probably be moved to NameCtx.

NameCtx.move_layer_ref_here is only in PrevTensorRef.assign_new_cur_layer_name_ctx:

    def assign_new_cur_layer_name_ctx(self, cur_layer_name_ctx: nn.NameCtx):
        """
        Changes self.name_ctx to new name_ctx.
        """
        prev_layer_name = f"prev:{cur_layer_name_ctx.name}"
        assert prev_layer_name not in cur_layer_name_ctx.parent.children
        prev_layer_name_ctx = cur_layer_name_ctx.parent.get_child(prev_layer_name)
        prev_layer_name_ctx.move_layer_ref_here(self)
        assert self.name_ctx is prev_layer_name_ctx
        self.cur_layer_name_ctx = cur_layer_name_ctx

And this is only called via Loop.assign()._map_ref_to_name_ctx():

                    prev_ref.assign_new_cur_layer_name_ctx(layer_ref.name_ctx)

albertz added a commit that referenced this issue Feb 27, 2023
albertz added a commit that referenced this issue Feb 27, 2023
This is for preparation of RETURNN Tensor usage.
#252
albertz added a commit that referenced this issue Feb 28, 2023
This is for preparation of RETURNN Tensor usage.
#252
albertz added a commit that referenced this issue Feb 28, 2023
This is for preparation of RETURNN Tensor usage.
#252
albertz added a commit that referenced this issue Feb 28, 2023
This is for preparation of RETURNN Tensor usage.
#252
albertz added a commit that referenced this issue Feb 28, 2023
This is for preparation of RETURNN Tensor usage.
#252
albertz added a commit that referenced this issue Feb 28, 2023
This is for preparation of RETURNN Tensor usage.
#252
albertz added a commit that referenced this issue Feb 28, 2023
albertz added a commit that referenced this issue Feb 28, 2023
@albertz
Copy link
Member Author

albertz commented Mar 8, 2023

feature_dim / batch_dim. If possible remove? Why do we need them?

feature_dim is used quite often. The code might be maybe too verbose when we don't have this? There are also not really any ambiguous cases about this, I think. But I'm usure. Edit For now, we might just introduce feature_dim as an alias, and delay this. However, this also means we need feature_dim_axis. Edit See rwth-i6/returnn#1273 for a more in-depth discussion on feature_dim.

batch_dim only has a single use in stochastic_depth and this probably should be refactored. So we can remove batch_dim. (Done)

There is also nn.batch_dim, which might be removed as well. Currently there is some usage but not too much. Mostly it is for defining extern data, e.g. via nn.get_extern_data.

Then there is nn.single_step_dim. It might have some valid use cases.

albertz added a commit that referenced this issue Mar 10, 2023
We want to remove Tensor.batch_dim.

#252
albertz added a commit that referenced this issue Mar 10, 2023
albertz added a commit that referenced this issue Mar 10, 2023
albertz added a commit that referenced this issue Mar 10, 2023
albertz added a commit that referenced this issue Mar 10, 2023
albertz added a commit to rwth-i6/returnn that referenced this issue Mar 10, 2023
@albertz
Copy link
Member Author

albertz commented Mar 10, 2023

__copy__ / __deepcopy__, no-op for Tensor

This is somewhat unclear. There is a comment pointing to: #215 (comment)

In there, the argument was, Tensor (for returnn-common) is immutable, thus __copy__/__deepcopy__ can be a no-op. However, I'm not sure if this was a must be.

For graph-based frameworks (RETURNN, TensorFlow graph mode), it's true that a tensor is immutable. copy(x) on a TF tensor will also be a no-op and return the same object. But this is not necessarily true for eager-mode frameworks (PyTorch, Numpy). A copy of a tensor (copy(x)) will create a new instance with really separate memory allocation (at least as presented to the user, maybe it uses COW internally). TF eager mode is an exception. Tensors are also immutable in eager mode, and copy(x) is a no-op even in eager mode.

Wrt. our frontend API, we also have tensors immutable. There is no function which would operate inplace on it. Parameter will be an exception though. And immutable is just meant conceptually and referring to raw_tensor. In practice, the Tensor object is mutable, allows to reassign attribs, etc. As it is used widely, this probably cannot be changed anymore.

Also, for any developer writing code with Tensor and the frontend API, is it maybe unexpected if copy(x) does not return a copy? If it is really truely immutable, this should not matter at all. But as this is not the case, it might be dangerous.

Maybe a Tensor can also have a flag to be immutable, and if this flag is enabled, it would really disallow any modifications? (Btw, I wonder a bit how TF does it internally in tf.Tensor. Usually I would expect that you still can get access to internal objects, and modify then. However, I tried a bit, and I'm really not able to modify any internal data of a tf.Tensor, e.g. x._id.)

Or should copy() return a copy of our Tensor object? But what about the raw_tensor then? Also copy that (just using copy(raw_tensor), whatever that returns)? Again, the copy logic of Parameter can be different.

@albertz
Copy link
Member Author

albertz commented Mar 16, 2023

Now rwth-i6/returnn#1277 was merged, which simplifies the code move.

@albertz
Copy link
Member Author

albertz commented Mar 23, 2023

A bit on the planned process here: We would copy over bit by bit to RETURNN and integrate it into the RETURNN net dict backend of the RETURNN frontend (rwth-i6/returnn#1120). We would not copy all at once. We also would not touch existing code here in RC for now.

I'm not sure if there is a good way here for RC to gradually use the moved code.

There are also some changes:

  • Module does not have parents and _calls anymore. (Frontend API and PyTorch backend returnn#1120 (comment)) It should be possible to infer layer names also without that, but this needs some new code. This might result in different net dicts, but this should not matter anyway. There should be no visible differences to the user.

In general, as we did not finalize the RC nn API yet, changes are ok, even to the public API.

@albertz
Copy link
Member Author

albertz commented Mar 27, 2023

I will probably rename NameCtx to Layer, at least for the code on RETURNN side.

albertz added a commit to rwth-i6/returnn that referenced this issue Apr 18, 2023
#1120

Copied from RETURNN-common.
rwth-i6/returnn_common#252

On namespace (rf.encoder.conformer.Conformer), see:
#1120 (comment)

This is currently untested.
@albertz
Copy link
Member Author

albertz commented Apr 19, 2023

A bit on the process here: This is work-in-progress now. We already moved a lot over. But step-by-step, function-by-function, module-by-module. But even ConformerEncoder was moved over now (rwth-i6/returnn#1120 (comment)). Next things are Transformer, beam search (but needs a new API), AED/Transducer models, SpecAugment. And not sure what is really missing then. Maybe then this can be considered as finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant