Integration of FX and TS shared components #1372

narendasan · 2022-09-23T22:33:04Z

narendasan
Sep 23, 2022
Collaborator

TL;DR

There is a lot of redundant components in both the FX and TS frontends. We should unify them.

Goal(s)

There is quite a bit of code duplication as the TorchScript and FX frontends developed pretty independently. Now that we are a shared project we should seek to unify generic components to maximize compatibility between the two frontends.

Usecases

This could allow for FX / TS crossover
Allow FX to access some of the benefits TS has e.g. Python Free deployment, single artifact

Proposed APIs / UX

There should be very little changing at the user level except for perhaps the specific helper classes for various common tasks like Input shapes.

Internal Implementation

Design

From first glance there are places where some APIs are superior to others, for instance, with improvement, I believe that TRTEngine provided by the TS runtime provides distinct benefits to directly calling the TRT APIs in a nn.Module including serializablity, scriptablity post compilation, C++ only execution, and a light weight runtime. Input Tensor Spec and Input perform the same role and could be merged. CompileSpec and LowerSetting use different names for various features and should be standardized though remain frontend specific. I.e. there should be a torch_tensorrt.fx.CompileSpec and a torch_tensorrt.ts.CompileSpec which are similar, using the same names for the same data but can have specializations specific to their respective IRs. The logger should be standardized between the two frontends as well.

Extensions Required to Core API implementations

Currently redundant components [FX/TS]:

torch_tensorrt.fx.types / torch_tensorrt.dtype
torch_tensorrt.fx.utils / torch_tensorrt.Input helpers
torch_tensorrt.fx.trt_module / torch.classes.tensorrt.TRTEngine, torch.ops.tensorrt.execute_engine
torch_tensorrt.fx.LOGGER / torch_tensorrt.logging
torch_tensorrt.fx.lower_setting / torch_tensorrt.ts.CompileSpec
torch_tensorrt.fx.input_tensor_spec / torch_tensorrt.Input

Implementation Phases

Prototype

Demonstrate FX and TS using a unified set of internal API.

MVP `1.3.0`

Merging all listed redundant components

narendasan · 2022-10-04T01:08:42Z

narendasan
Oct 4, 2022
Collaborator Author

Extension 1:

Wrap the ATen converter library and let FX frontend use it

3 replies

ncomly-nvidia Oct 7, 2022

@narendasan does this warrant its own design doc?

narendasan Oct 11, 2022
Collaborator Author

yes but i think this is a separate feature / task

ncomly-nvidia Oct 11, 2022

Agreed. Can we please do a design doc for this as well?

ncomly-nvidia · 2022-10-04T01:13:27Z

ncomly-nvidia
Oct 4, 2022

if TS specific API, FX should throw a warning as such

0 replies

ncomly-nvidia · 2023-01-25T23:56:39Z

ncomly-nvidia
Jan 25, 2023

@narendasan what is remaining on this list?

0 replies

Christina-Young-NVIDIA · 2023-02-07T01:39:21Z

Christina-Young-NVIDIA
Feb 7, 2023
Collaborator

Gaps (end goal is FX is a drop-in replacement for TorchScript):
Unification of APIs (functional support in place) -- need for 1.4 (input spec) -- this is the RFC
Automating FX graph partitioning and operator / module fallback -- 2.0 (1.4 stretch goal) -- needs RFC
Collection support -- verify if it works for Dynamo and FX in 1.4 -- no RFC needed until proven not to work
Dynamo front end for TRT - 2.0 -- no RFC needed (2-3 days)
Torch-TRT backend for Dynamo - 1.4 -- RFC needed (with phases, etc)
Operator support -- Torch2TRT need for 1.4, aten existing converters for 2.0 (already has RFC)
Proxy Tensor - 1.4 -- (already has RFC)
Plugin support - 2.x -- (already has RFC)
Dynamic shape -- 2.0 (needs RFC)
Dynamic batch -- 1.4 (already has RFC)
QAT PyTorch toolkit -- 2.x - (needs RFC)
PTQ calibration -- 2.x -- (needs RFC)
Support matrix - was this tested on Jetson, etc? -- 2.x -- (no RFC, needs testing only)

Note: FX to TRT on Dynamo no longer works

2 replies

gs-olive Feb 9, 2023
Collaborator

Collection Support in FX

At the time of writing (6ce3a44) FX has full support for complex collection-based outputs, including Dictionaries, Tuples, and Lists (of Tensors). FX does not appear to support collection-based inputs through the torch_tensorrt.fx.compile API, which TorchScript currently does via the input_signature argument. A reasonable approach might be to incorporate input_signature as one of the arguments in the TS/FX frontend unification, along with other keyword arguments.

Supporting Collections in FX Input [Small/Medium]

In order to support complex collection inputs for FX-path compilation, updating the input Tensor specification could be a reasonable first approach. Since the InputTensorSpec class is used throughout the FX codebase, it would be an important location to begin for the implementation. One approach could be to make a parent class, Inputs, which stores InputTensorSpecs in addition to the format of the inputs as a string encoding or other data structure.

TensorRT/py/torch_tensorrt/fx/input_tensor_spec.py

Lines 61 to 88 in 5fa6374

    
           class InputTensorSpec(NamedTuple): 
        
               """ 
        
               This class contains the information of a input tensor. 
        
               shape: shape of the tensor. 
        
               dtype: dtyep of the tensor. 
        
               device: device of the tensor. This is only used to generate inputs to the given model 
        
                   in order to run shape prop. For TensorRT engine, inputs have to be on cuda device. 
        
               shape_ranges: If dynamic shape is needed (shape has dimensions of -1), then this field 
        
                   has to be provided (default is empty list). Every shape_range is a tuple of three 
        
                   tuples ((min_input_shape), (optimized_input_shape), (max_input_shape)). Each shape_range 
        
                   is used to populate a TensorRT optimization profile. 
        
                   e.g. If the input shape varies from (1, 224) to (100, 224) and we want to optimize 
        
                   for (25, 224) because it's the most common input shape, then we set shape_ranges to 
        
                   ((1, 224), (25, 225), (100, 224)). 
        
               has_batch_dim: Whether the shape includes batch dimension. Batch dimension has to be provided 
        
                   if the engine want to run with dynamic shape. 
        
               """ 
        
               shape: Shape 
        
               dtype: torch.dtype 
        
               device: torch.device = torch.device("cpu") 
        
               shape_ranges: List[ShapeRange] = [] 
        
               has_batch_dim: bool = True

Special considerations would need to be made for the lowering pass, as it makes use of these inputs:

TensorRT/py/torch_tensorrt/fx/passes/lower_pass_manager_builder.py

Line 170 in 6ce3a44

def _trt_lower_pass(self) -> PassManager:

Using Dynamo with FX2TRT + Collection IO requires more investigation. Any updates made to support collections in FX input would need to be done with Dynamo in mind.

narendasan Feb 21, 2023
Collaborator Author

Id rather moving to our input class

narendasan · 2023-02-21T20:59:43Z

narendasan
Feb 21, 2023
Collaborator Author

Unifying APIs cont.

This is the current settings struct for FX:

import dataclasses as dc
from typing import List, Optional, Set, Type

from torch import nn
from torch.fx.passes.pass_manager import PassManager

from .input_tensor_spec import InputTensorSpec
from .passes.lower_basic_pass import fuse_permute_linear, fuse_permute_matmul
from .utils import LowerPrecision


@dc.dataclass
class LowerSettingBasic:
    """
    Basic class for lowering.
    max_batch_size: The maximum batch size for lowering job.
                    If run with TensorRT lowering, this is the maximum
                    batch size which can be used at execution time,
                    and also the batch size for which the ICudaEngine
                    will be optimized.
                    If run with AITemplate lowering, this the max batch_size
                    for the model.
    lower_precision: lower precision dtype during lowering.
    min_acc_module_size(int): minimal number of nodes for an accelerate submodule.
    ast_rewriter_allow_list (Optional[Set[nn.Module]]): Optional allow list of
    modules that need AST rewriting. This is aiming to eliminate input variable involve in
    exception checking control flow.
    leaf_module_list (Optional[Set[nn.Module]]): Optional leaf module list where
    modules will not be traced into.
    verbose_profile (bool): verbosity of profiler, default to False.
    """

    max_batch_size: int = 2048
    lower_precision: LowerPrecision = LowerPrecision.FP32
    min_acc_module_size: int = 10
    ast_rewriter_allow_list: Optional[Set[Type[nn.Module]]] = None
    leaf_module_list: Optional[Set[Type[nn.Module]]] = None
    verbose_profile: bool = False
    is_aten: bool = False


@dc.dataclass
class LowerSetting(LowerSettingBasic):
    """
    Basic configuration for lowering stack.
    Args:
    input_specs: Specs for inputs to engine, can either be a single size or a
    range defined by Min, Optimal, Max sizes.
    explicit_batch_dimension: Use explicit batch dimension during lowering.
    explicit_precision: Use explicit precision during lowering.
    max_workspace_size: The maximum workspace size. The maximum GPU temporary
    memory which the TensorRT engine can use at execution time.
    strict_type_constraints: Require TensorRT engine to strictly follow data type
    setting at execution time.
    customized_fuse_pass: List of custmozied pass to apply during lowering process.
    lower_basic_fuse_pass: Enable basic pass fuse duirng lowering, i.e. fuse multiple operations
    as (a->b->c->d)=>(e). Current basic fuse patterns are:
    permute->linear
    permute->matmul
    verbose_log: Enable TensorRT engine verbose log mode.
    algo_selector: Enable TensorRT algorithm selector at execution time.
    timing_cache_prefix: TensorRT timing cache file path. TensorRT engine will use timing
    cache file at execution time if valid timing cache file is provided.
    save_timing_cache: Save updated timing cache data into timing cache file if the timing
    cache file is provided.
    cuda_graph_batch_size (int): Cuda graph batch size, default to be -1.
    preset_lowerer (str): when specified, use a preset logic to build the
    instance of Lowerer.
    only used by explicit batch dim with dynamic shape mode. In general, we use 2 GPU setting with
    2 stream on each. Set total number to 8 as a safe default value.
    dynamic_batch: enable the dynamic shape in TRT with dim=-1 for the 1st dimension.
    tactic_sources: tactic sources for TensorRT kernel selection. Default to None,
    meaning all possible tactic sources.
    correctness_atol: absolute tolerance for correctness check
    correctness_rtol: relative tolerance for correctness check
    use_experimental_rt: Uses the next generation TRTModule which supports both Python and TorchScript based execution (including in C++).
    """

    input_specs: List[InputTensorSpec] = dc.field(default_factory=list)
    explicit_batch_dimension: bool = True
    explicit_precision: bool = False
    max_workspace_size: int = 1 << 30
    strict_type_constraints: bool = False
    customized_fuse_pass: PassManager = dc.field(
        default_factory=lambda: PassManager.build_from_passlist([])
    )
    lower_basic_fuse_pass: PassManager = dc.field(
        default_factory=lambda: PassManager.build_from_passlist(
            [fuse_permute_matmul, fuse_permute_linear]
        )
    )
    verbose_log: bool = False
    algo_selector = None
    timing_cache_prefix: str = ""
    save_timing_cache: bool = False
    cuda_graph_batch_size: int = -1
    preset_lowerer: str = ""
    opt_profile_replica: int = 8
    dynamic_batch: bool = True
    tactic_sources: Optional[int] = None
    correctness_atol: float = 0.1
    correctness_rtol: float = 0.1
    use_experimental_rt: bool = False

And this is the current TorchScript compile spec struct:

def compile(
    module: torch.jit.ScriptModule,
    inputs=[],
    input_signature=None,
    device=Device._current_device(),
    disable_tf32=False,
    sparse_weights=False,
    enabled_precisions=set(),
    refit=False,
    debug=False,
    capability=_enums.EngineCapability.default,
    num_avg_timing_iters=1,
    workspace_size=0,
    dla_sram_size=1048576,
    dla_local_dram_size=1073741824,
    dla_global_dram_size=536870912,
    calibrator=None,
    truncate_long_and_double=False,
    require_full_compilation=False,
    min_block_size=3,
    torch_executed_ops=[],
    torch_executed_modules=[],
) -> torch.jit.ScriptModule:

Some of these APIs need to just simply be renamed in FX. Others like Device, Input need to be replaced. The underlying ground work is already complete but the code path in FX needs to be changed to accept these new classes.

MVP - M:

Input and Device replace InputSpec
Other API names are standardized on TS naming

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of FX and TS shared components #1372

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Integration of FX and TS shared components #1372

narendasan Sep 23, 2022 Collaborator

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Internal Implementation

Design

Extensions Required to Core API implementations

Implementation Phases

Prototype

MVP 1.3.0

Replies: 5 comments · 5 replies

narendasan Oct 4, 2022 Collaborator Author

ncomly-nvidia Oct 7, 2022

narendasan Oct 11, 2022 Collaborator Author

ncomly-nvidia Oct 11, 2022

ncomly-nvidia Oct 4, 2022

ncomly-nvidia Jan 25, 2023

Christina-Young-NVIDIA Feb 7, 2023 Collaborator

gs-olive Feb 9, 2023 Collaborator

Collection Support in FX

Supporting Collections in FX Input [Small/Medium]

narendasan Feb 21, 2023 Collaborator Author

narendasan Feb 21, 2023 Collaborator Author

Unifying APIs cont.

MVP - M:

narendasan
Sep 23, 2022
Collaborator

MVP `1.3.0`

Replies: 5 comments 5 replies

narendasan
Oct 4, 2022
Collaborator Author

narendasan Oct 11, 2022
Collaborator Author

ncomly-nvidia
Oct 4, 2022

ncomly-nvidia
Jan 25, 2023

Christina-Young-NVIDIA
Feb 7, 2023
Collaborator

gs-olive Feb 9, 2023
Collaborator

narendasan Feb 21, 2023
Collaborator Author

narendasan
Feb 21, 2023
Collaborator Author