Building Refittable Engines by Default #3204

narendasan · 2024-10-02T18:02:31Z

narendasan
Oct 2, 2024
Collaborator

Building Refittable Engines by Default

TL;DR

Build engines which can have their weights changed after build time by default vs. current behavior today. Restructures the compilation pipeline such that instead of
lowering -> partitioning -> compilation we do lowering -> partitioning -> compilation -> refit

Goal(s)

Refittables engines are more flexible
- kREFIT_IDENTICAL engines are as optimized as standard engines
We can build engines without weights in them
- The cache becomes way smaller
- Any built engine is now cache-able

Usecases

Proposed APIs / UX

Users just want a weight stripped engine. They can call convert_exported_program_to_trt_engine specifying strip_engine_weights=True to get weight-stripped engine. It is also supported if the engine is loaded from engine cache. [After they are now TRT users so we dont care as much about post compilation workflows]
[Most interesting] We want to utilize weight stripping to have a lighter weight engine cache. The implementation of weight-stripped engine is opaque to users. However, if users specify kREFIT or kREFIT_IDENTICAL, they would be considered as different engine and cached twice.
Users want a stripped weights compiled program. They just need to call torch.compile() or torch_trt.dynamo.compile() with strip_engine_weights=True. If running the compiled program with inputs immediately, all the results will always be zeros. Calling refit_module_weights() will make weights back [After they are still TorchTRT users so we care about post compilation workflows]

Example Workflow

Compilation Options for dynamo.compile:

Today:

make_refittable
reuse_cached_engines =

exp_program = torch.export.export(
    pyt_model, args=inputs, dynamic_shapes={"x": {0: batch}}
)
stripped_gm = torch_trt.dynamo.compile(exp_program, strip_engine_weights=True, ...)
refitted_gm = refit_module_weights(stripped_gm, exp_program)
exp_program2 = ...
refitted_gm2 = refit_module_weights(stripped_gm, exp_program2)

Limitations

There are some ops which are not refittable: cumsum and embedding_bag are two examples.
There are 2 options:

Users can keep the engines refittable, but these ops must run in PyTorch
- We need to update dryrun to report why some ops will run in PyTorch (i.e. cumsum is not refittable) - Might require having validators updated to be able to return a reason why op is not supported.
  - (node: Node, settings: CompilationSettings) -> (bool, Optional[str])
Users can go back to the old pipeline which means no engine caching, rebuild the engine from scratch, the engine cannot be refit, but the ops will run in TensorRT

Internal Implementation

Design

There will now be 3 classes of engines:

weight strip + refittable (strip_weights + kREFIT) - should move towards this being the default
weight strip + refittable with original weights (strip_weights + kREFIT_IDENTICAL)
non_refittable

With empty cache:
- We expect that going from graph to net def to building weight stripped engine to refit would be as fast as current graph to net def to building engine with weights
- We should be able to reuse the net def from building the weight stripped engine to speed up refit
With populated cache:
- We should either be able to use the weight_name_cache or rebuild the net def, either should be faster than building from scratch

Extensions Required to Core API implementations

Use-case 3 open questions:

Do we need to track the weight status of a TorchTensorRTModule
How do we know if a user is trying to refit with different weights that used for compilation (in the case the engine was built with kREFIT_IDENTICAL?

Data Structures

Implementation Phases

Prototype -

We add the strip_engine_weights setting
We add settings for kREFIT vs kREFIT_IDENTICAL, we create distinct cache entries based on this choice
Don't change defaults

MVP `(<TARGET RELEASE VERSION>)`

We add the lowering -> partitioning -> compilation -> refit pipeline along side standard

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Support usecase 3

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

Switch to lowering -> partitioning -> compilation -> refit as default
Fixed engines would be enabled via a setting like immutable_weights

HolyWu · 2024-10-04T04:38:04Z

HolyWu
Oct 4, 2024

Not sure if it matters but there is a sentence under Known Issues at https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html: There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.

2 replies

zewenli98 Oct 4, 2024
Collaborator

Not sure if it matters but there is a sentence under Known Issues at https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html: There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.

@HolyWu Thanks for the info. I think it makes sense that there is a performance gap between REFIT on and off. Do you know if there's a performance gap between engine built with REFIT_IDENTICAL + refit and non-refittable engine?

HolyWu Oct 4, 2024

I have no information about that.

zewenli98 · 2024-10-04T23:01:56Z

zewenli98
Oct 4, 2024
Collaborator

Design

Let's define three types of engines:
A: weight-included, refittable engine
B: weight-stripped, refittable engine
C: weight-included, non-refittable engine

According to the design of TRT 10, as shown in the diagram above, the transition of the three types of engines is one-way, i.e., A->B->C. We cannot make C go back to B or B go back to A other than re-build the engine.

Limitation

For engine caching, we save and load B type engines, which means if it can only be refitted once, because when we refit B, it will transit to C. For now, TRT doesn't support B->A.

API usage

We have three related args:
immutable_weights: bool: Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, strip_engine_weights and refit_identical_engine_weights will be ignored.

strip_engine_weights: bool: Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.

refit_identical_engine_weights: bool: Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.

Since engine caching saves B type engines, we return different types of engines if use cached engine or not:

When not using cached engine (building from scratch):

if immutable_weights:
    return C engine
else:
    save B engine in engine caching
    if strip_engine_weights:
        return B engine
    else:
        return A engine

When using cached engine:

if immutable_weights:
    Since non-refittable engine doesn't support engine caching, it will go normal path to build
else:
    load B engine in engine caching
    if strip_engine_weights:
        return B engine
    else:
        return C engine

Note that:

Since non-refittable engine doesn't support engine caching, it will go normal path to build.
The last return is different. Newly built engine returns A engine while cached engine returns C engine. So, if users want to refit an engine multiple times, they have to disable engine caching to get A engine, because B can only be refitted once and C cannot be refitted.

1 reply

narendasan Oct 8, 2024
Collaborator Author

If we hold the original serialized weight stripped engine, cant we just refit that over and over instead of refitting the live engine?

narendasan · 2024-10-09T17:43:13Z

narendasan
Oct 9, 2024
Collaborator Author

New Plan is

We always build refittable engines. Engines are either REFIT or REFIT_IDENTICAL, users can opt out using immutable_engine_weights or something?
1. immutable_engine_weights will not be cached
2. We want to turn on engine caching by default
3. make refittable changes to some setting to chose between REFIT and REFIT_IDENTICAL (choose the right default
We will add a setting for stripping engine weights to both .compile and .convert, with a note that these engines are only refittable once (need to load serialized engine to refit again)

0 replies

zewenli98 · 2024-10-09T23:06:23Z

zewenli98
Oct 9, 2024
Collaborator

Another caveat:

TRT has 4 refit related flags while building an engine:

REFIT
REFIT_IDENTICAL
REFIT_PROXY (for trt cloud)
REFIT_INDIVIDUAL

Only REFIT is exposed to the engine API, so even if we build an engine with REFIT_IDENTICAL, engine.refittable is still false. That means we cannot use engine.refittable API to tell if an engine is really refittable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Refittable Engines by Default #3204

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Building Refittable Engines by Default #3204

narendasan Oct 2, 2024 Collaborator

Building Refittable Engines by Default

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Example Workflow

Limitations

Internal Implementation

Design

Extensions Required to Core API implementations

Data Structures

Implementation Phases

Prototype -

MVP (<TARGET RELEASE VERSION>)

Extension Phase 1 (<TARGET RELEASE VERSION>)

Extension Phase 2 (<TARGET RELEASE VERSION>)

Replies: 4 comments · 3 replies

HolyWu Oct 4, 2024

zewenli98 Oct 4, 2024 Collaborator

HolyWu Oct 4, 2024

zewenli98 Oct 4, 2024 Collaborator

Design

Limitation

API usage

When not using cached engine (building from scratch):

When using cached engine:

narendasan Oct 8, 2024 Collaborator Author

narendasan Oct 9, 2024 Collaborator Author

zewenli98 Oct 9, 2024 Collaborator

narendasan
Oct 2, 2024
Collaborator

MVP `(<TARGET RELEASE VERSION>)`

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

Replies: 4 comments 3 replies

HolyWu
Oct 4, 2024

zewenli98 Oct 4, 2024
Collaborator

zewenli98
Oct 4, 2024
Collaborator

narendasan Oct 8, 2024
Collaborator Author

narendasan
Oct 9, 2024
Collaborator Author

zewenli98
Oct 9, 2024
Collaborator