Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Why does reducing stack size decrease kernel launch overhead? #1216

Open
csehydrogen opened this issue Jun 18, 2024 · 4 comments
Open
Assignees

Comments

@csehydrogen
Copy link

Problem Description

When we use indirect function call in the kernel function as in below:

ncclDevFuncTable[ncclShmem.work.header.funcIndex]();

RCCL limits the size of stack to 512:

rccl/src/init.cc

Line 1876 in 53dcfcc

CUDACHECKIGNORE(cudaDeviceSetLimit(cudaLimitStackSize, stackSize));

which is 1024 by default if you does not modify via hipDeviceSetLimit API.

At first I thought the modification is not that important, but it indeed has impact on performance.
With the stack size of 512, collective communication on small data takes around ~10us.
With the stack size of 1024, however, it takes around ~244us, which is more than x20 latency.

The code line is introduced with #684, but there is no explanation on the situation.
Does anybody know why the stack size has an impact on kernel launch overhead?

I also made the following minimal working example which you may use to reproduce the issue.

#include <cstdio>
#include <hip/amd_detail/amd_hip_runtime.h>
#include <hip/hip_runtime.h>
#include <chrono>

#define CHECK_HIP(res) \
  do { \
    hipError_t err = (res); \
    if (err != hipSuccess) { \
      fprintf(stderr, "HIP Error (%s:%d): %s (%s)\n", __FILE__, __LINE__, \
              hipGetErrorName(err), hipGetErrorString(err)); \
      exit(EXIT_FAILURE); \
    } \
  } while (0)

__device__ void subkernel0(int *a) { *a = 0xdeadbee0;}
__device__ void subkernel1(int *a) { *a = 0xdeadbee1;}
__device__ void (*subkernels[])(int *a){subkernel0, subkernel1};

__global__ void mainkernel(int *a) {
  subkernels[0](a);
}

size_t measure() {
  int warmup = 30, niter = 100;
  size_t elapsed = 0;
  for (int i = -warmup; i < niter; ++i) {
    int *a;
    CHECK_HIP(hipMalloc(&a, sizeof(int)));
    CHECK_HIP(hipMemset(a, 0, sizeof(int)));

    CHECK_HIP(hipDeviceSynchronize());
    auto start = std::chrono::high_resolution_clock::now();
    mainkernel<<<1, 1>>>(a);
    CHECK_HIP(hipDeviceSynchronize());
    auto end = std::chrono::high_resolution_clock::now();

    int b;
    CHECK_HIP(hipMemcpy(&b, a, sizeof(int), hipMemcpyDeviceToHost));
    if (b != 0xdeadbee0) {
      printf("Error: b = %x\n", b);
      exit(1);
    }
    CHECK_HIP(hipFree(a));
    if (i >= 0) {
      elapsed += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    }
  }
  return elapsed / niter;
}

int getStackSize() {
  size_t curStackSize;
  CHECK_HIP(hipDeviceGetLimit(&curStackSize, hipLimitStackSize));
  return curStackSize;
}

int main() {
  for (int i = 500; i <= 700; ++i) {
    CHECK_HIP(hipDeviceSetLimit(hipLimitStackSize, i));
    printf("Stack size: %d, Time: %zu ns\n", getStackSize(), measure());
    //printf("%d,%zu\n", getStackSize(), measure());
  }
  return 0;
}

Operating System

Ubuntu 20.04.5 LTS (Focal Fossa)

CPU

AMD EPYC 7413 24-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@thananon
Copy link
Contributor

thananon commented Jul 17, 2024

Hi @csehydrogen , can you still reproduce this with the latest ROCm and RCCL?

I took the reproducer and unable to reproduce it on MI300X (similar time for different stack sizes). I am trying to find MI100 machine to run on.

@thananon
Copy link
Contributor

reproduced with MI250X.

Stack size: 684, Time: 12183 ns
Stack size: 685, Time: 12438 ns
Stack size: 686, Time: 12286 ns
Stack size: 687, Time: 12272 ns
Stack size: 688, Time: 12223 ns
Stack size: 689, Time: 218785 ns
Stack size: 690, Time: 218671 ns
Stack size: 691, Time: 218574 ns
Stack size: 692, Time: 218499 ns

Will create internal ticket for this.

@csehydrogen
Copy link
Author

I no longer have access to MI100. I was unable to reproduce it on MI350X with rocm-6.1.2.
Seems like it's a pre-MI300 problem.

@thananon
Copy link
Contributor

Thank you for confirmation. We are aware of this issue on MI100/MI200 series and working on a fix. In the meantime, we recommend to NOT use indirect function call in RCCL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants