-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL kernel long time to start execution after enqueue on iGPU #735
Comments
Arshad, Default configuration (no GPU Frequency throttling) demonstrates that performance observations are correct. With your scenario and your hardware i7-13700H CPU and integrated GPU mentioned end-to-end timing is in expected range. VTune logs show buffer submission timing on ~37us. Scenario design, logs show that there is single clEnqueueNDRangeKernel() followed by clFinish() batching is excluded, it triggers workload submission to iGPU through the i915 KMD and hard host synchronization at clFinish call every iteration. VTune instrumentation adds ~5us overhead for CPU/GPU event view timestamps.. Reducing clock to non-default and low frequency value, performance score is affected by several factors. With low GPU clock command stream parsing is extended, same as main kernel execution from 2us to 9us in linear factor. Additionally, on ramp-down there could be timing coincidence with potential render-standby RC6 enter and exit triggered by next clEnque/clFinish call followed by compute context switch. With forced low GPU clock both RC6 and GPU context switch which operates in GPU domain would be more time expensive. Performance observation and impact is valid and expected from Compute Runtime UMD driver perspective. To remove RC6 and batching / context switch effect which add abbreviations to your experiment, please redesign scenario with non-blocking clFlush() instead of clFinish() and disable GPU render standby RC6 in BIOS or on i915 module load time with via modprobe config file: |
Hi, code: ./drivers/gpu/drm/i915/gt/intel_rc6.c $ cat /sys/class/drm/card0/power/rc6_enable Seems like enable_rc6 option is removed since kernel 4.16 (see https://patchwork.freedesktop.org/patch/191386/). Regards, |
Please use BIOS settings to disable RC6 GPU standby capabilities. |
What kernel i915 driver version is used? According to this, only out-of-tree driver (from Intel package repository) supports low latency submission: https://dgpu-docs.intel.com/driver/kernel-driver-types.html EDIT: OOT KMD requires also compute runtime from same repo, or building compute-runtime with |
Hi,
I am experimenting with kernel execution on iGPU for solutions tailored for extremely low-latency workloads, where response times are critical down to the microsecond. While monitoring performance, I've noticed a significant delay from when kernels are queued to when they actually start executing. To isolate this issue, I implemented a basic OpenCL kernel to add two arrays of 256 bytes, queuing it with clEnqueueNDRangeKernel(). The API-level execution takes approximately 156 microseconds per call. However, using VTune's 'GPU Offload' analysis, I see that the kernel execution begins only after 131 microseconds and then runs for about 9 microseconds. This is with iGPU running at 300 Mhz frequency.
I observed that the GPU's operational frequency affects this latency significantly. At a low frequency of 300 MHz, the queue time is much longer compared to when the GPU is at its maximum frequency of 1300 MHz (queued duration drops to 33 microseconds, with a total time of 36 microseconds per call). Despite this improvement at higher frequencies, the initial delay before execution remains concerningly high at 33 microseconds, especially since there are no other compute tasks running simultaneously.
Given this scenario, I'm seeking insights into what might be causing these delays in compute execution startup. Additionally, what strategies could potentially reduce this latency, allowing the kernel to execute nearly immediately after being queued?
CPU Specs: 13th Gen Intel(R) Core(TM) i7-13700H (TDP 45W) on an Asus PN64-E
OS: Ubuntu 22.04
ICD version: 23.52.28202.52-821~22.04
Thank you for any advice or insights you can provide!
Best regards,
Arshad
Screenshot:
iGPU running at Low Frequency 300 Mhz - (~150 us duration per call, actual kernel execution ~9 us)
iGPU running at Higher Frequency 1300 Mhz - (~37 us duration per call, actual kernel execution ~2 us)
The text was updated successfully, but these errors were encountered: