Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improve GPU vector interface #82

Open
wants to merge 1 commit into
base: CMSSW_10_4_X_Patatrack
Choose a base branch
from

Conversation

makortel
Copy link

Spurred by my earlier dislike on the interface of

namespace GPU {
template <class T> struct SimpleVector {

and the recent discussion with @felicepantaleo about
template <class T, int maxSize> struct VecArray {

I started to think whether we could improve the interface of a "GPU vector" a bit.

In StackOverflow I came across a pattern where a "GPU class" is split into two

  • A class owning the GPU memory
  • A wrapper class holding only raw pointers to the GPU memory so that it can be passed to the kernels by value

In this PR I toyed with these ideas for a GPU vector implementation (I hope the unit test is enough to demonstrate how it is used, I'm sure it can be improved further).

I feel the pattern of passing the "structs of device pointers" by value to the kernels would simplify the code as we could avoid doing cudaMalloc for the struct itself.

@felicepantaleo @VinInn @fwyzard @rovere

@cmsbot
Copy link

cmsbot commented Jun 13, 2018

A new Pull Request was created by @makortel (Matti Kortelainen) for CMSSW_10_2_X_Patatrack.

It involves the following packages:

HeterogeneousCore/CUDAUtilities

The following packages do not have a category, yet:

HeterogeneousCore/CUDAUtilities
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbot can you please review it and eventually sign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Author

One could actually go one step further to make the interface "safer" wrt. synchronization (i.e. trying to avoid to remember explicit synchronization calls by making them more "automated) by

  • Making the 'GPUVector` to only own the memory (and provide data transfer functions for the helper classes)
  • Add a "host wrapper class" (let's call it "host view" from now on, similarly the current GPUVectorWrapper<T> -> "device view")
    • Constructing/asking for a "host view" automatically transfers the size from GPU to CPU
      • Async variant can also be provided, user is then responsible to cudaStreamSynchronize before using the "host view"
      • Also a "no update" variant can be provided for cases where it is guaranteed that the device vector size has not changed since the last construction of "host view"

I'm not really sure if hiding the transfers and synchronizations this way would make the code actually clearer. I can provide an example if there is interest.

@fwyzard
Copy link

fwyzard commented Jun 29, 2018

Validation summary

Reference release CMSSW_10_2_0_pre5 at 30c7b03
Development branch CMSSW_10_2_X_Patatrack at 10d59f2
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_2_0_pre5-PU25ns_102X_upgrade2018_realistic_v1-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre5-102X_upgrade2018_realistic_v1-v1/GEN-SIM-DIGI-RAW

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_0_pre5-PU25ns_102X_upgrade2018_realistic_v1-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre5-102X_upgrade2018_realistic_v1-v1/GEN-SIM-DIGI-RAW

  • reference DQM plots for reference release, workflow 10824.5
  • DQM plots for development release, workflow 10824.5
  • DQM plots for development release, workflow 10824.8 are missing
  • DQM plots for development release, workflow 10824.7
  • DQM plots for development release, workflow 10824.9
  • DQM plots for testing release, workflow 10824.5
  • DQM plots for testing release, workflow 10824.8 are missing
  • DQM plots for testing release, workflow 10824.7
  • DQM plots for testing release, workflow 10824.9
  • DQM comparison for reference workflow 10824.5
  • DQM comparison for workflow 10824.8
  • DQM comparison for workflow 10824.7
  • DQM comparison for workflow 10824.9

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_2_0_pre5-PU25ns_102X_upgrade2018_realistic_v1-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_0_pre5-102X_upgrade2018_realistic_v1-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/798e35cfd563696abc727c08a6c31c28bbabe374/log .

@fwyzard
Copy link

fwyzard commented Jun 29, 2018

Before spending more time on this, I think we should evaluate if Unified Memory works well enough, as it would probably render these utility classes obsolete.
@makortel , @felicepantaleo , @VinInn let me know if you would still find this useful, even just to play with, and I will merge it.

@makortel
Copy link
Author

I agree the evaluation of the Unified Memory is more important than testing this "toy" in action, exactly because the Unified Memory would make many things much simpler. Although maybe even with Unified Memory we want to have a specific vector-like class (or classes separating the ownership and a "view-like" usage) if we want to avoid memory allocations caused by copying.

(and anyway I intended this PR more for discussion than merging as-is)

@fwyzard fwyzard force-pushed the CMSSW_10_2_X_Patatrack branch 2 times, most recently from 48d4372 to a721b31 Compare August 17, 2018 20:51
@fwyzard fwyzard force-pushed the CMSSW_10_2_X_Patatrack branch 2 times, most recently from 5200bc1 to cf2d1bb Compare August 30, 2018 07:24
fwyzard pushed a commit that referenced this pull request Nov 1, 2018
Address code review comments, including modernisation of code
@fwyzard fwyzard changed the base branch from CMSSW_10_2_X_Patatrack to CMSSW_10_4_X_Patatrack November 15, 2018 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants