Casting Semantics #412

narendasan · 2021-03-29T19:46:04Z

narendasan
Mar 29, 2021
Collaborator

Casting

There are some cases where users may want the control the specific precision layers and tensors exist at. PyTorch has APIs to do this by casting tensors and the weights of modules. We need to determine a way to map PyTorch casting semantics to TensorRT in a way that is understandable.

Tensor Casting

Tensor casting means casting a tensor that is a input or product of an operation.

PyTorch API

import torch

x = torch.randn(100)
y = x.half #x.to(torch.half)

TorchScript Representation

Casting operations like this in the course of a TorchScript graph are represented by aten::to

Handling Casting in TRTorch

For tensor casting TensorRT provides the IdentityLayer. Presumably all we need to do is have a converter that takes aten::to and maps it to this layer, then applies the right setLayerPrecision on this layer.

Layer Casting

Layer casting means casting the weight tensors of a module

PyTorch API

import torch
from torch import nn
from nn import functional as F

class Model(nn.Module):
 	def __init__(self):
    	super(nn.Module)

      
      self.conv1 = nn.Conv2d(3, 3, 3)
      self.pooling1 = nn.max_pool2d(2, 2)
      self.conv2 = nn.Conv2D(5, 5, 5)
      self.pooling2 = nn.max_pool2d(2, 2)
    
    	# conv2 in fp16
      self.conv2 = self.conv2.half() # Keep this in FP16 in TRTorch, should trigger layer precision
      
  def forward(x: Tensor):
  		z = self.pooling1(F.relu(self.conv1(x)))
      z = z.half() # z.to(torch.half) # explicit cast, handled by to operator and identity layer 
      return self.pooling2(F.relu(self.conv2(z)))

TorchScript Representation

Casting operations on modules directly effect the owned tensors of the module and nothing else. When you run model.half() you are really just casting the weight tensors to FP16 in the same way as you do above manually. Therefore there is not any TorchScript representation other than the fact that the weights are in FP16. (Need to fully verify this)

Handling Casting in TRTorch

Since the only signal we have about explicit layer precisions in TRTorch is the type of the weights, we therefore need converters to inspect and respect the layer precision of the weights that they convert. This would augment the converter contract to add this responsibility. Ideally we could find some way to automate it or make it a one liner. I think we have Torch to TensorRT datatype casting supported already so something like

layer->setLayerPrecision(weight_tensor.dtype);

Might be sufficient.

Necessary Precision API changes

I think the current op precision API is misleading and confusing. It makes people believe that networks will run in one type instead of what it is really doing which is enabling new precisions to be selected. We should change the API to be something closer to a set of additional enabled data types, while still clearly conveying to people that FP32 will always be an option unless they specify strict_types

~~op_precision~~ enabled_precisions -> set(torch.half, torch.int8) #Under assumption that fp32 is enabled by default

With this new API, we would need to add the following check in the ConversionCtx, we would also need to change the behavior that adds FP16 whenever people enable INT8

if strict_types and size of set != 1 then throw error

We should also take this chance to add fuller coverage for TensorRT APIs including adding Tensor Layer options and allowing users to specify input data type

enabled_precisions: (),
strict_types: False,
disable_tf32: False,
input_data_type: torch.Float,
tensor_layout:

narendasan · 2021-03-29T19:46:20Z

narendasan
Mar 29, 2021
Collaborator Author

cc: @peri044

0 replies

peri044 · 2021-04-26T17:13:55Z

peri044
Apr 26, 2021
Collaborator

Some additional detail:

Explicit Precision Control

Case 1 : Layer Casting

Pytorch graph:

class Cast(torch.nn.Module):
    def __init__(self):
        super(Cast, self).__init__()
        self.conv1 = nn.Conv2d(3, 3, 3)
        self.pool = nn.MaxPool2d(3, 1, 1)
        self.conv1 = self.conv1.half()

    def forward(self, x):
        out = self.conv1(x)
        out = self.pool(out)
        return out

Torchscript graph:

graph(%x.1 : Tensor):
  %2 : int[] = prim::Constant[value=[3, 3]]()
  %3 : bool = prim::Constant[value=0]() # /home/dperi/Downloads/py3/lib/python3.6/site-packages/torch/nn/modules/pooling.py:163:57
  %4 : int[] = prim::Constant[value=[0, 0]]()
  %5 : int[] = prim::Constant[value=[1, 1]]()
  %6 : int = prim::Constant[value=1]() # /home/dperi/Downloads/py3/lib/python3.6/site-packages/torch/nn/modules/conv.py:395:45
  %self.conv1.bias : Half(3, strides=[1], requires_grad=0, device=cuda:0) = prim::Constant[value= 0.0480  0.1194 -0.0651 [ CUDAHalfType{3} ]]()
  %self.conv1.weight : Half(3, 3, 3, 3, strides=[27, 9, 3, 1], requires_grad=0, device=cuda:0) = prim::Constant[value=<Tensor>]()
  %9 : Tensor = aten::_convolution(%x.1, %self.conv1.weight, %self.conv1.bias, %5, %4, %5, %3, %4, %6, %3, %3, %3, %3)
  %10 : Tensor = aten::max_pool2d(%9, %2, %5, %5, %5, %3) # /home/dperi/Downloads/py3/lib/python3.6/site-packages/torch/nn/functional.py:659:11
  return (%10)

.half() on a module converts its weights into FP16.

Current limitations:

a) In the above graph, the conv1 is set to FP16. This means, the weights and bias of conv1 are already converted to FP16 in the Torchscript graph.

x (fp16) -> (Conv1.half()) -> output

In TRT, if we set the builder precision to FP32, but provide FP16 input to x, then the conv1 still runs in FP32. The TRT graph looks like this

x(fp16) -> Reformat (fp32) -> (Conv1) -> Reformat(fp16) ->output (fp16)

This doesn't satisfy our requirement that conv1 should be running in FP16.

b) If we set the layer precision to FP16 (conv1->setPrecision(fp16)), but the builder precision to FP32, the compilation fails with this error

fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder

c) If we set the layer precision to FP16 and builder precision to FP16, then this makes all the other layers in network to run in FP16 as well if possible. This is an overkill.

For ex: The input graph is

x (fp16) -> Conv1 -> max_pool -> output

Here both Conv1 and max_pool runs in FP16. But we might want max_pool to run in FP32.

One solution is to set builder precision to FP16, conv1 precision to FP16 and max_pool precision to FP32. We also need need strict_types to be true here.

TRTorch handling:

enabled_precision flag

This can be a set of precisions that are enabled during inference.

If strict_types is True, enabled_precision can only have 1 value in the set, otherwise throw error.

If strict_type is False, enabled_precision can take on multiple values. For eg: {fp16, fp32}.

The builder precision should exactly take on values from this enabled_precision set.

For example, if enabled_precision = {fp32, fp16}, the TRT network builder would have

cfg->setFlag(nvinfer1::BuilderFlag::kFP32);
cfg->setFlag(nvinfer1::BuilderFlag::kFP16);

Here are possible cases and our solutions :
If the graph is input -> Conv1.half() -> max_pool -> output

a) enabled_precision={fp32} and strict_types=True, then the user provided .half() would be ignored. conv1 and max_pool would run in FP32, cast the conv weights to FP32 and throw a warning.
b) enabled_precision={fp32, fp16}. (implied strict_type=False ), then the user provided .half() would be considered. So conv1 and max_pool would run in FP16 (if they are fastest) else they would run in FP32.
conv1-> setPrecision(fp16) # This is already done by enabled_precision but still be explicit.
Once we detect half weights, we explicitly setOutputType -> half explicitly. Check if enabled precision also has fp16 here. If not, this is a) and .half() would be ignored, cast it to FP32 and throw a warning

c) enabled_precision={fp16} and strict_types=False, then the user_provided .half() would be considered. conv1 and max_pool would run in FP16 (if they are fastest) else they would run in FP32. setOutputType -> half explicitly instead and check enabled_precision.

d) enabled_precision={fp16} and strict_types=True, then the user provided .half() is enforced. conv1 and maxpool will run in FP16.

e) enabled_precision={fp32} and strict_types=False, then this case is a) the user provided .half() would be ignored. conv1 and max_pool would run in FP32 and throw a warning.

When should we enable setPrecision for any layer ?

When a user has .half() , Check the datatypes for the inputs consumed by the layer. Also check enabled _precision and then only add layer->setPrecision().

So in all converters, macro which does the above functionality.

Correcting the cast is one macro (for weights and bias).
Another macro which actually does the cast based on input datatypes.

pytorch : x (fp16) -> max_pool - runs in fp16

tensorrt : no explicit signal.

a) ignore_pytorch_casting=True -> This would mean TRT would run the max_pool in FP32. So don't follow pytorch and rely on TRT to set the precision/datatypes for layers.

b) ignore_pytorch_casting=False -> This would mean based on pytorch datatypes for intermediate tensors using ITensor()->getType() set the following layers to specific datatypes.

c) Tell users to use .float() or .half() on their output tensors. If a layer should be in FP32/FP16, the input should be in FP32/FP16.

Default behavior :

ignore_pytorch_casting=True. Doesn't matter if they have .float() or to(half) in tensors

If ignore_pytorch_casting=False, follow tensor casting. conv.half() ->output.float() -> max_pool (FP32)
all layer precisions will be determined by inputs of tensors coming into the layer.

Case 2: Tensor Casting

Tensor casting is expressed in Torchscript representation using aten::to operators.

The primary use case for supporting aten::to operators in TRTorch is to handle boolean masks (which get converted to ones and zeros int or float values).

Pytorch graph :

class Cast(torch.nn.Module):
    def __init__(self):
        super(Cast, self).__init__()

    def forward(self, x):
        y = torch.tensor(3, dtype=torch.float16)
        out = x.to(torch.float16) + y
        return out

Torchscript representation:

graph(%x.1 : Tensor):
  %2 : Half(requires_grad=0, device=cpu) = prim::Constant[value={3}]()
  %3 : int = prim::Constant[value=1]() # :0:0
  %4 : int = prim::Constant[value=5]() # cast_tensor.py:21:34
  %5 : None = prim::Constant() # :0:0
  %6 : bool = prim::Constant[value=0]() # :0:0
  %7 : Tensor = aten::to(%x.1, %4, %6, %6, %5) # cast_tensor.py:22:14
  %out.1 : Tensor = aten::add(%7, %2, %3) # cast_tensor.py:22:14
  return (%out.1)

Since the output type of aten::to layer is FP16 here, the builder precision has to be set to FP16, otherwise the error would be

fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder

In TRT source code, it checks the following

If the layer output type is set to either FP16 or INT8, the builder precision flag also need to be set correspondingly, otherwise it results in the above error.
If the layer output type is set to for eg: INT32, the network can be constructed using FP32 builder precision flag.

Solution:

So whenever we find any layer output type to be FP16/INT8 using aten::to explicitly, we override the builder precision flag to use corresponding precision if strict types is not True. This way, the precision of all other layers might also get affected.

Possible cases for the above network:

a) if enabled_precisions={fp32} and strict_types=False, we override the builder precision to fp16.

b) if enabled_precision={fp32} and strict_types=True, default TRT error is thrown

fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder

c) if enabled_precision={fp32, fp16} (implied strict_types=False), the network can use FP16 wherever required or fallback to FP32 for all layers.

d) if enabled_precision={fp16}, strict_types=False, the network can use FP16 or FP32 for all layers.

e) if enabled_precision={fp16}, strict_types=True, the network is using only FP16 for all layers.

Case 3: Setting input datatypes in TRTorch

Currently in TRTorch, the input datatype is set based on builder precision. This is not sufficient when the network has multiple inputs and one of the datatypes is either INT (maybe a shape input) or FP16.

ONNX handling :

In ONNX graphs, the input datatypes are inferred from the nodes in the graph.

The problem with Torchscript graphs is the inputs in the graph are torch::jit::Value. We cannot get datatype information from this data structure.
torch::jit::Value::kind() can give either FloatType or IntType but not exact precision (FP32/FP16/INT8/INT32).

Solution:

The solution would be like how TRT used to manually configure inputs in UFF (deprecated).

We would have a input_datatypes field in compile_spec, which would have the datatypes for each input.

LOG_INFO("Number of inputs in the graph: " << input_tensors.size());
TRTORCH_CHECK(input_data_types.size() == input_tensors.size());
for (size_t i = 0; i < input_tensors.size(); i++) {
	auto trt_in = ctx->net->addInput(name.c_str(), input_data_types[i], dims.input_shape);
}

0 replies

narendasan · 2021-06-28T21:21:24Z

narendasan
Jun 28, 2021
Collaborator Author

Brainstorming the new trtorch.Input class to handle format and dtype:

import torch
import trtorch


...

'''
trt_mod = trtorch.compile(ts_mod, {
    "input_shapes": [[1,2,2,2], {
        "min": (1,2,2,2),
        "opt": (3,2,2,2),
        "max": (121, 20, 20, 20)
    }]
})

trt_mod = trtorch.compile(ts_mod, {
    "input_shapes": [[1,2,2,2], {
        "min": (1,2,2,2),
        "opt": (3,2,2,2),
        "max": (121, 20, 20, 20)
    }]
    "input_dtypes": [torch.int32, torch.float32]
    "input_tensor_formats": [torch.contigous_format, torch.channel_last]
})





trt_mod = trtorch.compile(ts_mod, {
    "inputs" : [trtorch.Input((1,2,2,2,), dtype=torch.int32, format=torch.contiguous_format), 
                trtorch.Input({
                    "min": (1,2,2,2),
                    "opt": (3,2,2,2),
                    "max": (121, 20, 20, 20)
                })],
                
})'''

# We selected this option 
trt_mod = trtorch.compile(ts_mod, {
    "inputs" : [trtorch.Input(shape=(1,2,2,2,), dtype=torch.int32, format=torch.contiguous_format), 
                trtorch.Input(min_shape=(1,2,2,2), opt_shape=(3,2,2,2), max_shape=(121, 20, 20, 20), format=torch.channel_last)],
                
})

trt_mod = trtorch.compile(ts_mod, {"inputs": [trtorch.Input(shape=(1,2,2,2)), trtorch.Input(shape=(3,2,2,2))]})

class Input:
    InputRange(Tuple() or Dict())

    def __init__(self, shape, dtype=torch.float32, format=torch.contiguous_format):
    
     # We selected this option: 
    def __init__(self, shape=None, min_shape=None, opt_shape=None, max_shape=None, dtype=torch.float32, format=torch.contiguous_format):
        if not shape and (not min_shape and not opt_shape and not max_shape):
            throw Error()

# Example on how to use tensors as example input for shape, type and format inference 
trt_mod = trtorch.compile(ts_mod, {
    "inputs" : [torch.Tensor((1,2,2,2), dtype=torch.int32, memory_format=torch.contiguous_format)],
                
})

struct Input {
    Input(trtorch::InputRange shape);
    Input(std::vector<uint64_t> shape);
    Input(std::vector<uint64_t> min_shape, std::vector<uint64_t> opt_shape, std::vector<uint64_t> max_shape);
    Input(trtorch::InputRange shape, trtorch::DataType dtype=trtorch::DataType::kFloat32, trtorch::Format format=trtorch::Format::kNCHW);
    Input(std::vector<uint64_t> min_shape, std::vector<uint64_t> opt_shape, std::vector<uint64_t> max_shape dtype=trtorch::DataType::kFloat32, trtorch::Format format=trtorch::Format::kNCHW);
}

auto in1 = trtorch::Input((1,2,2,2))


auto spec = CompileSpec({Input})


trt_mod = trtorch::compile(ts_mod, )


auto in_shape1 = trtorch::Input(trtorch::InputRange({1,2,2,2},{1,2,2,2,},{.2.2.2.2.} ))


auto in_shape1 = trtorch::Input({1,2,2,2,});
auto in_shape1 = trtorch::Input(/*min_shape=*/{1,2,2,2},/*opt_shape=*/{1,2,2,2,},/*max_shape=*/{.2.2.2.2.});
auto in_shape1 = trtorch::Input({1,2,2,2,}, torch::kChar);
auto in_shape1 = trtorch::Input(/*min_shape=*/{1,2,2,2},/*opt_shape=*/{1,2,2,2,},/*max_shape=*/{.2.2.2.2.}, torch::kChar);

auto spec = trtorch::CompileSpec({in_shape1});
spec.enabled_types.push(trtorch::DataType::kFloat16);

auto spec = trtorch::compile(ts_mod, {{trtorch::Input()}});
auto spec = trtorch::compile(ts_mod, {{{1,1,1,1}}});

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Casting Semantics #412

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Casting Semantics #412

narendasan Mar 29, 2021 Collaborator

Casting

Tensor Casting

PyTorch API

TorchScript Representation

Handling Casting in TRTorch

Layer Casting

PyTorch API

TorchScript Representation

Handling Casting in TRTorch

Necessary Precision API changes

Replies: 3 comments

narendasan Mar 29, 2021 Collaborator Author

peri044 Apr 26, 2021 Collaborator

Explicit Precision Control

Case 1 : Layer Casting

Current limitations:

TRTorch handling:

enabled_precision flag

When should we enable setPrecision for any layer ?

Case 2: Tensor Casting

Solution:

Case 3: Setting input datatypes in TRTorch

ONNX handling :

Solution:

narendasan Jun 28, 2021 Collaborator Author

narendasan
Mar 29, 2021
Collaborator

narendasan
Mar 29, 2021
Collaborator Author

peri044
Apr 26, 2021
Collaborator

narendasan
Jun 28, 2021
Collaborator Author