Multi node support in Torch-TRT #3069

apbose · 2024-08-07T07:24:30Z

apbose
Aug 7, 2024
Collaborator

Multi GPU compilation support

TL;DR

An important use case for Torch-TRT is to support multi-GPU and multi-node. The goal is to boost the performance using data parallelism and tensor parallelism to compile the model using multi gpu. There are different ways to do this- fully sharded data parallelism, sequence parallelism and tensor parallelism. Data parallelism examples can be found in /examples/distributed_inference/data_parallel_gpt2 and /examples/distributed_inference/data_parallel_stable_diffusion. This RFC focuses on tensor parallelism, an efficient model parallelism method for large model compilation.

Goal

Accelerate the model compilation time using tensor parallelism

Implementation stages

Development of tensor parallel (TP) inference examples.

The compilation of the model with tensor parallelism is agnostic to the franework as well as the model has been sharded properly by the network. The following two frameworks have been explored so far

torch.distributed.tensor.parallel
Below is an example TP using torch.distributed.tensor.parallel

https://github.com/pytorch/TensorRT/pull/3047/files#diffd70d4b88b03c3208178bf10992dfbb8e8ddb4ec6db247bbf42e0c2e706b0d5a5R5-R83

from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel, parallelize_module
class ToyModel(torch.nn.Module()):
      ......
tp_model = ToyModel().to("cuda")
_world_size = int(os.environ["WORLD_SIZE"])
device_mesh = init_device_mesh(device_type="cuda", mesh_shape=(_world_size,))
_rank = device_mesh.get_rank()
tp_model = parallelize_module(
    module=tp_model,
    device_mesh=device_mesh,
    parallelize_plan={
        "in_proj": ColwiseParallel(input_layouts=Shard(0)),
        "out_proj": RowwiseParallel(output_layouts=Shard(0)),
        "in_proj2": ColwiseParallel(input_layouts=Shard(0)),
        "out_proj2": RowwiseParallel(output_layouts=Shard(0)),
    },
)
backend = "torch_tensorrt"
tp_model = torch.compile(
    tp_model,
    backend=backend,
    options={
        "truncate_long_and_double": True,
        "enabled_precisions": {torch.float32, torch.float16},
        "use_python_runtime": True,
        "min_block_size": 1
    },
    dynamic=False,
)

We start with this, primary reason being that torch.distributed is more stable and has no dependency on external libs. Due to this it does not introduce torch.compile graph breaks. The forward function is more in control. However the main limitation is that the sharding layout is manually set and will change from model to model.

We run the above using
torchrun --nproc_per_node=2 tensor_parallel.py

Megatron LM

from megatron import get_args, initialize_megatron, set_jit_fusion_options
from megatron.model import GPTModel
from megatron.training import train

# Initialize megatron
initialize_megatron(extra_args_provider=None, args_defaults={})
model = GPTModel(
        num_layers=args.num_layers,
        hidden_size=args.hidden_size,
        num_attention_heads=args.num_attention_heads,
        max_position_embeddings=args.max_position_embeddings
    )
backend = "torch_tensorrt"
tp_model = torch.compile(
    tp_model,
    backend=backend,
    options={
        "truncate_long_and_double": True,
        "enabled_precisions": {torch.float32, torch.float16},
        "use_python_runtime": True,
        "min_block_size": 1
    },
    dynamic=False,
)

Megatron-LM uses a JSON configuration file to set various parameters such as model size, sequence length, and parallelism settings.
This is a future path to be explored since it causes dynamic graph breaks in torch.compile. It however supports the mainstream LLM models via its configuration file.

Wrapping the NCCL ops in torch TensorRT converter library

NCCL (NVIDIA Collective Communications Library) is a library designed to optimize collective communication operations across multiple GPUs and nodes. It is developed by NVIDIA and is widely used in distributed deep learning and high-performance computing to handle communication between GPUs efficiently.

Since the NCCL collective communication is not supported in torch TRT, the above torch.distributed causes graph breaks, leading to slower compilation time as compared to torch in the first iteration while forming the TRT Engine. The following are the operations to be supported:

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html

Below are some examples through which the NCCL ops can be supported in torch TRT using NCCL based plugin. It uses TensorRT plugin registered under the namespace TRT_LLM_PLUGIN_NAMESPACE, which wraps the NCCL operations

https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/functional.py#L3701-L4003

Implemeting the above directly from TRT-LLM library

There can be two methods

Get the plugin code from tensorrt-llm and build the shared library from there

try:
allgather_plg_creator = trt.get_plugin_registry().get_plugin_creator(
        "AllGather", "1", TRT_LLM_PLUGIN_NAMESPACE
 )
except Exception as e:
    _LOGGER.warning(
        "Unable to load .so"
    )
if(allgather_plg_creator is not None)
@dynamo_tensorrt_converter(<all_gather_nccl_target>)
 def aten_ops_nccl_send_plugin(
        ctx: ConversionContext,
        target: Target,
        args: Tuple[Argument, ...],
        kwargs: Dict[str, Argument],
        name: str,
    ) -> Union[TRTTensor, Sequence[TRTTensor]]:
        allgather_plg_creator = trt.get_plugin_registry().get_plugin_creator(
        "AllGather", "1", TRT_LLM_PLUGIN_NAMESPACE
        )

Install tensorrt-llm in torchTRT environment and load the plugin directly from there

def _load_plugin_lib():
    winmode = 0 if platform.system() == "Windows" else None
    handle = ctypes.CDLL(plugin_lib_path(),
                         mode=ctypes.RTLD_GLOBAL,
                         winmode=winmode)
    try:
        handle.initTrtLlmPlugins.argtypes = [ctypes.c_void_p, ctypes.c_char_p]
        handle.initTrtLlmPlugins.restype = ctypes.c_bool
    except AttributeError as err:
        raise ImportError('TensorRT-LLM Plugin is unavailable') from err
    assert handle.initTrtLlmPlugins(None,
                                    TRT_LLM_PLUGIN_NAMESPACE.encode('utf-8'))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi node support in Torch-TRT #3069

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Multi node support in Torch-TRT #3069

apbose Aug 7, 2024 Collaborator

Multi GPU compilation support

TL;DR

Goal

Implementation stages

Development of tensor parallel (TP) inference examples.

Wrapping the NCCL ops in torch TensorRT converter library

Implemeting the above directly from TRT-LLM library

Replies: 0 comments

apbose
Aug 7, 2024
Collaborator