Building Refittable Engines by Default #3204
Replies: 4 comments 3 replies
-
Not sure if it matters but there is a sentence under |
Beta Was this translation helpful? Give feedback.
-
DesignLet's define three types of engines: According to the design of TRT 10, as shown in the diagram above, the transition of the three types of engines is one-way, i.e., A->B->C. We cannot make C go back to B or B go back to A other than re-build the engine. LimitationFor engine caching, we save and load B type engines, which means if it can only be refitted once, because when we refit B, it will transit to C. For now, TRT doesn't support B->A. API usageWe have three related args:
Since engine caching saves B type engines, we return different types of engines if use cached engine or not: When not using cached engine (building from scratch):
When using cached engine:
Note that:
|
Beta Was this translation helpful? Give feedback.
-
New Plan is
|
Beta Was this translation helpful? Give feedback.
-
Another caveat: TRT has 4 refit related flags while building an engine:
Only |
Beta Was this translation helpful? Give feedback.
-
Building Refittable Engines by Default
TL;DR
Build engines which can have their weights changed after build time by default vs. current behavior today. Restructures the compilation pipeline such that instead of
lowering -> partitioning -> compilation
we dolowering -> partitioning -> compilation -> refit
Goal(s)
Refittables engines are more flexible
kREFIT_IDENTICAL
engines are as optimized as standard enginesWe can build engines without weights in them
Usecases
Proposed APIs / UX
Users just want a weight stripped engine. They can call
convert_exported_program_to_trt_engine
specifyingstrip_engine_weights=True
to get weight-stripped engine. It is also supported if the engine is loaded from engine cache. [After they are now TRT users so we dont care as much about post compilation workflows][Most interesting] We want to utilize weight stripping to have a lighter weight engine cache. The implementation of weight-stripped engine is opaque to users. However, if users specify
kREFIT
orkREFIT_IDENTICAL
, they would be considered as different engine and cached twice.Users want a stripped weights compiled program. They just need to call
torch.compile()
ortorch_trt.dynamo.compile()
with strip_engine_weights=True. If running the compiled program with inputs immediately, all the results will always be zeros. Callingrefit_module_weights()
will make weights back [After they are still TorchTRT users so we care about post compilation workflows]Example Workflow
Compilation Options for
dynamo.compile
:Today:
Limitations
There are some ops which are not refittable:
cumsum
andembedding_bag
are two examples.There are 2 options:
cumsum
is not refittable) - Might require having validators updated to be able to return a reason why op is not supported.Internal Implementation
Design
There will now be 3 classes of engines:
graph to net def to building weight stripped engine to refit
would be as fast as currentgraph to net def to building engine with weights
weight_name_cache
or rebuild the net def, either should be faster than building from scratchExtensions Required to Core API implementations
Use-case 3 open questions:
TorchTensorRTModule
kREFIT_IDENTICAL
?Data Structures
Implementation Phases
Prototype -
MVP
(<TARGET RELEASE VERSION>)
lowering -> partitioning -> compilation -> refit
pipeline along side standardExtension Phase 1
(<TARGET RELEASE VERSION>)
Extension Phase 2
(<TARGET RELEASE VERSION>)
lowering -> partitioning -> compilation -> refit
as defaultimmutable_weights
Beta Was this translation helpful? Give feedback.
All reactions