Skip to content

Latest commit

 

History

History
47 lines (32 loc) · 2.17 KB

README.md

File metadata and controls

47 lines (32 loc) · 2.17 KB

cuBLAS Demos

This repo contains cuBLAS demos from several sources of documentation.

src/

  • cublas_acc_device calls cublasSswap from an OpenACC device kernel. It is from Section 6.2 of PGI's Fortran CUDA Library Interfaces, v. 2017.

  • cublas_stream calls cublasDgemm from the host using multiple streams. It is from OLCF's tutorial, Concurrent Kernels II: Batched Library Calls. Note that it uses a custom Fortran interface to the C cuBLAS v2 functions. It appears that, when the tutorial was written, NVIDIA did not provide a Fortran interface to cuBLAS v2.

  • cublas_stream_no_c is a version of cublas_stream that uses NVIDIA's current (v2017) Fortran interfaces to cuBLAS v2. It was written by me, Ron Rahaman.

  • cublas_batch calls cublasDgemmBatched to launch multiple dgemm operations with one call.
    It is also from OLCF's tutorial, Concurrent Kernels II: Batched Library Calls. Like cublas_stream, it uses a custom Fortran interface to the C cuBLAS v2 functions.

  • cublas_batch_no_c is a version of cublas_batch that uses NVIDIA's current (v2017) Fortran interfaces to cuBLAS v2. It was written by me, Ron Rahaman.

  • cublas_batch_acc is a version of cublas_batch_no_c that uses OpenACC data directives for host/device data transfers. It is intended to demonstrate the use of cuBLAS batched dgemm in a code that uses OpenACC for everything else. It was written by me, Ron Rahaman.

data/

Testbeds

The following testbeds were used to gather the results contained here:

  • neddy is a GPU node provided by JLSE at ANL. It contains 1x NVIDIA P100 GPU. Details are found here.

Results

  • cublas_stream_times.p100.csv contains timings from running cublas_stream_no_c on neddy.

  • cublas_batch_times.p100.csv contains timings from running cublas_batch_no_c on neddy.