This repo contains cuBLAS demos from several sources of documentation.
-
cublas_acc_device calls
cublasSswap
from an OpenACC device kernel. It is from Section 6.2 of PGI's Fortran CUDA Library Interfaces, v. 2017. -
cublas_stream calls
cublasDgemm
from the host using multiple streams. It is from OLCF's tutorial, Concurrent Kernels II: Batched Library Calls. Note that it uses a custom Fortran interface to the C cuBLAS v2 functions. It appears that, when the tutorial was written, NVIDIA did not provide a Fortran interface to cuBLAS v2. -
cublas_stream_no_c is a version of cublas_stream that uses NVIDIA's current (v2017) Fortran interfaces to cuBLAS v2. It was written by me, Ron Rahaman.
-
cublas_batch calls
cublasDgemmBatched
to launch multiple dgemm operations with one call.
It is also from OLCF's tutorial, Concurrent Kernels II: Batched Library Calls. Like cublas_stream, it uses a custom Fortran interface to the C cuBLAS v2 functions. -
cublas_batch_no_c is a version of cublas_batch that uses NVIDIA's current (v2017) Fortran interfaces to cuBLAS v2. It was written by me, Ron Rahaman.
-
cublas_batch_acc is a version of cublas_batch_no_c that uses OpenACC data directives for host/device data transfers. It is intended to demonstrate the use of cuBLAS batched dgemm in a code that uses OpenACC for everything else. It was written by me, Ron Rahaman.
The following testbeds were used to gather the results contained here:
- neddy is a GPU node provided by JLSE at ANL. It contains 1x NVIDIA P100 GPU. Details are found here.
-
cublas_stream_times.p100.csv contains timings from running cublas_stream_no_c on neddy.
-
cublas_batch_times.p100.csv contains timings from running cublas_batch_no_c on neddy.