[Issue]: Why does reducing stack size decrease kernel launch overhead? #1216

csehydrogen · 2024-06-18T09:08:28Z

Problem Description

When we use indirect function call in the kernel function as in below:

rccl/src/device/common.h

Line 348 in 53dcfcc

ncclDevFuncTable[ncclShmem.work.header.funcIndex]();

RCCL limits the size of stack to 512:

rccl/src/init.cc

Line 1876 in 53dcfcc

CUDACHECKIGNORE(cudaDeviceSetLimit(cudaLimitStackSize, stackSize));

which is 1024 by default if you does not modify via hipDeviceSetLimit API.

At first I thought the modification is not that important, but it indeed has impact on performance.
With the stack size of 512, collective communication on small data takes around ~10us.
With the stack size of 1024, however, it takes around ~244us, which is more than x20 latency.

The code line is introduced with #684, but there is no explanation on the situation.
Does anybody know why the stack size has an impact on kernel launch overhead?

I also made the following minimal working example which you may use to reproduce the issue.

#include <cstdio>
#include <hip/amd_detail/amd_hip_runtime.h>
#include <hip/hip_runtime.h>
#include <chrono>

#define CHECK_HIP(res) \
  do { \
    hipError_t err = (res); \
    if (err != hipSuccess) { \
      fprintf(stderr, "HIP Error (%s:%d): %s (%s)\n", __FILE__, __LINE__, \
              hipGetErrorName(err), hipGetErrorString(err)); \
      exit(EXIT_FAILURE); \
    } \
  } while (0)

__device__ void subkernel0(int *a) { *a = 0xdeadbee0;}
__device__ void subkernel1(int *a) { *a = 0xdeadbee1;}
__device__ void (*subkernels[])(int *a){subkernel0, subkernel1};

__global__ void mainkernel(int *a) {
  subkernels[0](a);
}

size_t measure() {
  int warmup = 30, niter = 100;
  size_t elapsed = 0;
  for (int i = -warmup; i < niter; ++i) {
    int *a;
    CHECK_HIP(hipMalloc(&a, sizeof(int)));
    CHECK_HIP(hipMemset(a, 0, sizeof(int)));

    CHECK_HIP(hipDeviceSynchronize());
    auto start = std::chrono::high_resolution_clock::now();
    mainkernel<<<1, 1>>>(a);
    CHECK_HIP(hipDeviceSynchronize());
    auto end = std::chrono::high_resolution_clock::now();

    int b;
    CHECK_HIP(hipMemcpy(&b, a, sizeof(int), hipMemcpyDeviceToHost));
    if (b != 0xdeadbee0) {
      printf("Error: b = %x\n", b);
      exit(1);
    }
    CHECK_HIP(hipFree(a));
    if (i >= 0) {
      elapsed += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    }
  }
  return elapsed / niter;
}

int getStackSize() {
  size_t curStackSize;
  CHECK_HIP(hipDeviceGetLimit(&curStackSize, hipLimitStackSize));
  return curStackSize;
}

int main() {
  for (int i = 500; i <= 700; ++i) {
    CHECK_HIP(hipDeviceSetLimit(hipLimitStackSize, i));
    printf("Stack size: %d, Time: %zu ns\n", getStackSize(), measure());
    //printf("%d,%zu\n", getStackSize(), measure());
  }
  return 0;
}

Operating System

Ubuntu 20.04.5 LTS (Focal Fossa)

CPU

AMD EPYC 7413 24-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

thananon · 2024-07-17T16:51:13Z

Hi @csehydrogen , can you still reproduce this with the latest ROCm and RCCL?

I took the reproducer and unable to reproduce it on MI300X (similar time for different stack sizes). I am trying to find MI100 machine to run on.

thananon · 2024-07-17T17:05:28Z

reproduced with MI250X.

Stack size: 684, Time: 12183 ns
Stack size: 685, Time: 12438 ns
Stack size: 686, Time: 12286 ns
Stack size: 687, Time: 12272 ns
Stack size: 688, Time: 12223 ns
Stack size: 689, Time: 218785 ns
Stack size: 690, Time: 218671 ns
Stack size: 691, Time: 218574 ns
Stack size: 692, Time: 218499 ns

Will create internal ticket for this.

csehydrogen · 2024-07-18T02:40:05Z

I no longer have access to MI100. I was unable to reproduce it on MI350X with rocm-6.1.2.
Seems like it's a pre-MI300 problem.

thananon · 2024-07-18T14:25:52Z

Thank you for confirmation. We are aware of this issue on MI100/MI200 series and working on a fix. In the meantime, we recommend to NOT use indirect function call in RCCL.

haripriya-amd assigned thananon Jul 17, 2024

ppanchad-amd added the Under Investigation label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Why does reducing stack size decrease kernel launch overhead? #1216

[Issue]: Why does reducing stack size decrease kernel launch overhead? #1216

csehydrogen commented Jun 18, 2024

thananon commented Jul 17, 2024 •

edited

Loading

thananon commented Jul 17, 2024

csehydrogen commented Jul 18, 2024

thananon commented Jul 18, 2024

[Issue]: Why does reducing stack size decrease kernel launch overhead? #1216

[Issue]: Why does reducing stack size decrease kernel launch overhead? #1216

Comments

csehydrogen commented Jun 18, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

thananon commented Jul 17, 2024 • edited Loading

thananon commented Jul 17, 2024

csehydrogen commented Jul 18, 2024

thananon commented Jul 18, 2024

thananon commented Jul 17, 2024 •

edited

Loading