-
❓ QuestionHi, I saw that one of the lowering pass TRTorch has is lowering linear to mm + add. I'm wondering what the reason behind this is. Does TensorRT provide better performance with matmul layer + elementwise sum layer than fully connected layer? Or breaking it down help the fusion process in TensorRT? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#optimize-layer One reason to do this in TRTorch is that we usually keep our converter library light. Any operations that can be expressed as a composition of existing converters would be done so, unless we know for sure there would be big performance penalty. |
Beta Was this translation helpful? Give feedback.
-
Going to move this thread to discussions 😄 |
Beta Was this translation helpful? Give feedback.
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#optimize-layer
It says "A new development is encouraged to use MatrixMultiply in preference to FullyConnected layers for consistency of interface. Matrix multiplication is generally significantly faster in FP16 Tensor Cores compared to FP32."
Fully connected layers are expressed as convolutions. I'm not sure if there would be any perf difference for general layer dimensions that are used in final classification layers.
One reason to do this in TRTorch is that we usually keep our converter library light. Any operations that can be expressed as a composition of existing converters would be done so, unless we know for su…