cuBLAS Demos

This repo contains cuBLAS demos from several sources of documentation.

src/

cublas_acc_device calls cublasSswap from an OpenACC device kernel. It is from Section 6.2 of PGI's Fortran CUDA Library Interfaces, v. 2017.
cublas_stream calls cublasDgemm from the host using multiple streams. It is from OLCF's tutorial, Concurrent Kernels II: Batched Library Calls. Note that it uses a custom Fortran interface to the C cuBLAS v2 functions. It appears that, when the tutorial was written, NVIDIA did not provide a Fortran interface to cuBLAS v2.
cublas_stream_no_c is a version of cublas_stream that uses NVIDIA's current (v2017) Fortran interfaces to cuBLAS v2. It was written by me, Ron Rahaman.
cublas_batch calls cublasDgemmBatched to launch multiple dgemm operations with one call.
It is also from OLCF's tutorial, Concurrent Kernels II: Batched Library Calls. Like cublas_stream, it uses a custom Fortran interface to the C cuBLAS v2 functions.
cublas_batch_no_c is a version of cublas_batch that uses NVIDIA's current (v2017) Fortran interfaces to cuBLAS v2. It was written by me, Ron Rahaman.
cublas_batch_acc is a version of cublas_batch_no_c that uses OpenACC data directives for host/device data transfers. It is intended to demonstrate the use of cuBLAS batched dgemm in a code that uses OpenACC for everything else. It was written by me, Ron Rahaman.

The following testbeds were used to gather the results contained here:

neddy is a GPU node provided by JLSE at ANL. It contains 1x NVIDIA P100 GPU. Details are found here.

cublas_stream_times.p100.csv contains timings from running cublas_stream_no_c on neddy.
cublas_batch_times.p100.csv contains timings from running cublas_batch_no_c on neddy.