Model Weight Refit in FX #1679

gs-olive · 2023-02-17T01:35:48Z

gs-olive
Feb 17, 2023
Collaborator

Model Weight Refit

TL;DR

TensorRT supports updating engine weights after compilation via the nvinfer1::IRefitter class, referenced here in C++, and here in Python. This could be a beneficial feature to bring into Torch-TensorRT, specifically the FX path, since models which are pre-compiled and saved can be easily refit to new training weights, so long as the model architecture is unchanged.

Goals and Usecases

Model weight refit will assist greatly in reducing time spent compiling models with Torch-TensorRT, since models would only need to be compiled once per architecture, and subsequent weight updates can be propagated into the compiled model post-compilation, without the overhead of recompiling. This also enables pre-compilation of a model architecture prior to training, which could allow for inference timing estimates ahead of sometimes-lengthy model training.

Proposed APIs / UX

Users of this feature (in FX), would compile their model via the FX API, for example with: torch_tensorrt.fx.compile(...), then save their model. Then, at a later time when loading the compiled model, if the weights have updated from their original values, the use could call the Model Weight Refit function to refit the stored weights.

Example Workflow

See below for a sample workflow of compiling and saving a model via FX2TRT

TensorRT/examples/fx/fx2trt_example.py

Lines 23 to 146 in deda87b

    
           class Model(nn.Module): 
        
               def __init__(self): 
        
                   super().__init__() 
        
                   self.linear = nn.Linear(10, 10) 
        
                   self.relu = nn.ReLU() 
        
               def forward(self, x): 
        
                   x = self.linear(x) 
        
                   x = self.relu(x) 
        
                   x = torch.linalg.norm(x, ord=2, dim=1) 
        
                   x = self.relu(x) 
        
                   return x 
        
           inputs = [torch.randn((1, 10), device=torch.device("cuda"))] 
        
           model = Model().cuda().eval() 
        
           # acc_tracer is a custom fx tracer that maps nodes whose targets are PyTorch operators 
        
           # to acc ops. 
        
           traced = acc_tracer.trace(model, inputs) 
        
           # Splitter will split the model into several submodules. The name of submodules will 
        
           # be either `run_on_acc_{}` or `run_on_gpu_{}`. Submodules named `run_on_acc_{}` can 
        
           # be fully lowered to TensorRT via fx2trt while submodules named `run_on_gpu_{}` has 
        
           # unsupported ops and can't be lowered by fx2trt. We can still run `run_on_gpu_{}` 
        
           # submodules on Gpu if ops there have cuda implementation, the naming is a bit 
        
           # confusing and we'll improve it. 
        
           splitter = TRTSplitter(traced, inputs) 
        
           # Preview functionality allows us to see what are the supported ops and unsupported 
        
           # ops. We can optionally the dot graph which will color supported ops and unsupported 
        
           # ops differently. 
        
           splitter.node_support_preview(dump_graph=False) 
        
           """ 
        
           Supported node types in the model: 
        
           acc_ops.linear: ((), {'input': torch.float32, 'weight': torch.float32, 'bias': torch.float32}) 
        
           acc_ops.relu: ((), {'input': torch.float32}) 
        
           Unsupported node types in the model: 
        
           acc_ops.linalg_norm: ((), {'input': torch.float32}) 
        
           """ 
        
           # Split. 
        
           split_mod = splitter() 
        
           # After split we have three submodules, _run_on_acc_0 and _run_on_gpu_1. 
        
           print(split_mod.graph) 
        
           """ 
        
           graph(): 
        
               %x : [#users=1] = placeholder[target=x] 
        
               %_run_on_acc_0 : [#users=1] = call_module[target=_run_on_acc_0](args = (%x,), kwargs = {}) 
        
               %_run_on_gpu_1 : [#users=1] = call_module[target=_run_on_gpu_1](args = (%_run_on_acc_0,), kwargs = {}) 
        
               %_run_on_acc_2 : [#users=1] = call_module[target=_run_on_acc_2](args = (%_run_on_gpu_1,), kwargs = {}) 
        
               return _run_on_acc_2 
        
           """ 
        
           # Take a look at what inside each submodule. _run_on_acc_0 contains linear and relu while 
        
           # _run_on_gpu_1 contains linalg_norm which currently is not supported by fx2trt. _run_on_acc_3 
        
           # is the another submodule supported. 
        
           print(split_mod._run_on_acc_0.graph) 
        
           print(split_mod._run_on_gpu_1.graph) 
        
           print(split_mod._run_on_acc_2.graph) 
        
           """ 
        
           graph(): 
        
               %x : [#users=1] = placeholder[target=x] 
        
               %linear_weight : [#users=1] = get_attr[target=linear.weight] 
        
               %linear_bias : [#users=1] = get_attr[target=linear.bias] 
        
               %linear_1 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.linear](args = (), ... 
        
               %relu_1 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.relu](args = (), ... 
        
               return relu_1 
        
           graph(): 
        
               %relu_1 : [#users=1] = placeholder[target=relu_1] 
        
               %linalg_norm_1 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.linalg_norm](args = (), ... 
        
               return linalg_norm_1 
        
           graph(): 
        
               %linalg_norm_1 : [#users=1] = placeholder[target=linalg_norm_1] 
        
               %relu_3 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.relu](args = (), kwargs = {input: %linalg_norm_1, inplace: False}) 
        
               return relu_3 
        
           """ 
        
           def get_submod_inputs(mod, submod, inputs): 
        
               acc_inputs = None 
        
               def get_input(self, inputs): 
        
                   nonlocal acc_inputs 
        
                   acc_inputs = inputs 
        
               handle = submod.register_forward_pre_hook(get_input) 
        
               mod(*inputs) 
        
               handle.remove() 
        
               return acc_inputs 
        
           # Since the model is splitted into three segments. We need to lower each TRT eligible segment. 
        
           # If we know the model can be fully lowered, we can skip the splitter part. 
        
           for name, _ in split_mod.named_children(): 
        
               if "_run_on_acc" in name: 
        
                   submod = getattr(split_mod, name) 
        
                   # Get submodule inputs for fx2trt 
        
                   acc_inputs = get_submod_inputs(split_mod, submod, inputs) 
        
                   # fx2trt replacement 
        
                   interp = TRTInterpreter( 
        
                       submod, 
        
                       InputTensorSpec.from_tensors(acc_inputs), 
        
                       explicit_batch_dimension=True, 
        
                   ) 
        
                   r = interp.run(lower_precision=LowerPrecision.FP32) 
        
                   trt_mod = TRTModule(*r) 
        
                   setattr(split_mod, name, trt_mod) 
        
           lowered_model_output = split_mod(*inputs) 
        
           # Save and load model 
        
           torch.save(split_mod, "trt.pt") 
        
           reload_trt_mod = torch.load("trt.pt") 
        
           reload_model_output = reload_trt_mod(*inputs) 
        
           # Make sure the results match 
        
           regular_model_output = model(*inputs) 
        
           torch.testing.assert_close( 
        
               reload_model_output, regular_model_output, atol=3e-3, rtol=1e-2 
        
           )

The additional step, as per the proposed API would be to call:

weights = {"weight0": torch.Tensor(...), ...}
##### OR #####
weights = model.state_dict()

fx2trt_model.refit_weights(weights)

This function would parse the input weights dictionary, determine which of those to assign to which submodule from the splitter (assuming the model was not fully compiled in TRT), and assign the weights accordingly, using the TRT Python API for TRT-accelerated modules, and the Torch API for non-accelerated modules. weights could potentially be the output of model.state_dict() after training, or a different format.

Internal Implementation

Design

A function would be added to FX2TRT, for example, as:

def refit_weights(self, weights_dict: Dict[str, Tensor]):
    ...

As mentioned above, refit_weights would parse the input weights, determine which of those to assign to which submodule in the compiled model, and assign the weights accordingly, respecting the module boundaries of TRT and Torch. It would likely need a few helper functions, for example:

def verify_weight_names(self, weights_dict: Dict[str, Tensor]):
    """ Returns a bool indicating whether the set of keys in the weight dictionary
         is identical to the set of weight names in the model (weight names should agree)
    """

def assign_weights(self, weights_dict: Dict[str, Tensor]):
    """ Iterates over submodules in the model and determines which weight names correspond to 
         that submodule, then retrieves the weights from the dictionary and assigns them to the submodule
         via one of the helper functions
    """

def assign_weights_torch(self, submodule, weights_subdict):
    """ Assigns selected weights to a Torch-executed submodule
    """

def assign_weights_trt(self, submodule, weights_subdict):
    """ Assigns selected weights to a TRT-executed submodule
    """

Extensions Required to Core API implementations

The existing library should not require many changes, as this add-on would simply add functionality while preserving existing core APIs.

Details specific for TorchScript Support

TorchScript Python API support is more challenging in this particular case, since the weight Tensor objects would need to be transferred from Python to C++. A similar design would function well, for example ts_model.refit_weights(weights), however the FX path would make a better MVP since the implementations could stay strictly in Python, via the tensorrt Python API.

Details specific for FX support

See above

Implementation Phases

Prototype - Small/Medium

Determine standard paradigm for providing new weights to existing model
- Ensure that weight names do not change between different instances of the same model
- Check if model.state_dict() for a newly-trained model contains sufficient information to update the weights of an existing model
Implement helper functions described in Design section above
Develop method to apply weight updates to a Torch/FX-only model (no TRT)

MVP `1.4.0` - Medium

Using the prototype, implement full support for the refit_weights function, including refitting weights with multiple TRT-accelerated submodules and multiple Torch/FX non-accelerated submodules

Extension Phase 1 [Potential] - Medium

Scope/Explore/Implement refit_weights(...) for TorchScript API
- PyBind, the library used to send data between Python and C++ does have support for dictionaries, and thus the pipeline to pass a dictionary of weights, or a state_dict would be feasible

narendasan · 2024-05-28T20:34:31Z

narendasan
May 28, 2024
Collaborator

Phase 1.

Use convert_module_to_engine to produce standalone TRT engines from PyTorch Modules, and then out of library, figure out how to populate programmatically refit settings and refit the engine.

TensorRT/py/torch_tensorrt/dynamo/_compiler.py

Line 438 in c7f20f8

def convert_module_to_trt_engine(

Phase 2. [MVP]

Now support the same workflow but inside of the Torch-TRT context, and start to prototype the UX for end users (only needs to support fully supported models, but should work end to end)

Phase 3.

Support advanced features of Torch-TensorRT, e.g. dynamic shape, fallback to pytorch, use library libraries like huggingface and torch.compile.

Extensions

Demos of interesting workflows (LoRAs, Offsite compilation, caching etc.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Weight Refit in FX #1679

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Model Weight Refit in FX #1679

gs-olive Feb 17, 2023 Collaborator

Model Weight Refit

TL;DR

Goals and Usecases

Proposed APIs / UX

Example Workflow

Internal Implementation

Design

Extensions Required to Core API implementations

Details specific for TorchScript Support

Details specific for FX support

Implementation Phases

Prototype - Small/Medium

MVP 1.4.0 - Medium

Extension Phase 1 [Potential] - Medium

Replies: 1 comment

narendasan May 28, 2024 Collaborator

Phase 1.

Phase 2. [MVP]

Phase 3.

Extensions

gs-olive
Feb 17, 2023
Collaborator

MVP `1.4.0` - Medium

narendasan
May 28, 2024
Collaborator