Does --verify option works? #20

maxmp1031 · 2019-03-22T09:14:46Z

Hi, I'm a student interested in GPU data processing.

I appreciate your great effort for implementing RDMA using GPU Direct. It is very useful for studying that research area.

By the way, while trying to run the project, I have met some problems.

The current case is:
Trying to send data from SSD to GPU directly, without using smartio
Below is my make command line
cmake .. -DCMAKE_BUILD_TYPE=Release -Dno_smartio=true -Dno_smartio_samples=true -Dno_smartio_benchmarks=true

And I tried to run nvm-latency-benchmark program with below options

./nvm-latency-bench --input test.in --verify --ctrl=/dev/libnvm0 --blocks 1000 --count 1 --iterations=1000 --queue 'no=1,depth=1'
./nvm-latency-bench --input test.in --verify --ctrl=/dev/libnvm0 --blocks 1000 --count 1 --iterations=1000 --queue 'no=1,depth=1 --gpu 0'

When I run the program, both shows
Verifying buffers... FAIL
Unexpected runtime error: Memory buffer differ from file content

Can you please give some advice?

Thank you.

enfiskutensykkel · 2019-03-22T12:09:58Z

Hi,

Thank you for the interest.

The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option).

You can also use the --output option (with or without --verify) to dump what was read from the disk to file.

maxmp1031 · 2019-03-22T13:38:21Z

Thank you for your kind answer.

I have misunderstood that the verify option compares data which is stored in SSD and data which is loaded in the GPU memory.

What I actually want to know is, how we can access the data in the GPU memory.
I want to verify whether the data stored in SSD and in the GPU memory are the same.
And then I want to try to implement simple applications using the data in the GPU memory.

Currently, I am studying your latency benchmark. Regarding the latency benchmark, can you kindly advise me how to access the data in the GPU memory?

Thank you

enfiskutensykkel · 2019-03-22T14:04:23Z

Hi,

Combining --verify and --gpu are the right options for this.

With the --gpu option, the program will allocate the memory buffer the disk writes to or reads from on the gpu. In this path in the code, the --input option will do an extra cudaMemcpy which loads GPU memory before the benchmark, and the --verify option (and/or --output option) will do an cudaMemcpy from GPU memory in order to verify that the memory content is the same. Without --input, the buffer will be memset to zero. The most convenient way of verifying in my opinion is to use the --input, --verify and --write options, in this case nvm-latency-bench will load file content in to memory, write it to the disk, then read it back from the disk, and finally compare it with the original file content loaded in memory.

If you use the --gpu option in addition to --input, --verify and --write, then nvm-latency-bench does the following:

Allocate a RAM buffer and read file content in to that buffer
Allocate a buffer on the GPU and do cudaMemcpy to copy from the RAM buffer to the memory chunk on the GPU.
Write data to disk from GPU memory (the disk reads from GPU memory directly)
Read data back from the disk in to GPU memory (the disk writes directly to GPU memory)
Create a new buffer in RAM and do cudaMemcpy from the GPU buffer to that.
Compare the two RAM memory buffers to verify that the content is the same.

I don't see from your first post that you compiled with CUDA support. The status messages when running the cmake command should confirm where the driver is located.

In order for above to work, you need to point cmake to the Nvidia driver so that building the kernel module can find the necessary symbols from nv-p2p.h for calling the GPUDirect RDMA API. Where the driver source is located depends on your system and how you installed CUDA. It is also possible to download the local run-file installer and extract the source. Make sure that you run make in the driver folder first, so that cmake can locate the Module.symvers file. Please let me know what distro you are using and how you installed CUDA if you have difficulties with this step.

P.S. You should also have a look at the nvm-cuda-bench example if you're interested in having the CUDA kernel itself initiate disk reads/writes and accessing that memory.

maxmp1031 · 2019-03-22T14:10:23Z

Hi,

Thank you for the interest.

The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option).

You can also use the --output option (with or without --verify) to dump what was read from the disk to file.

I also want to discuss the benchmark binary with --verify option, We have tried to perform nvm-latency-benchmark with both --verify and --write option. We describe detailed settings as the following command.

nvm-latency-bench --input test.in --write --verify --ctrl /dev/libnvm0 --bytes 4096 --count 100000 --iterations=1 --queue 'no=1' --info --gpu 0 --output out.out
Where 'test.in' is an input file and 'out.out is an output file.

By the way, the function "verifyTransfer" still returns the exception. In order to check output contents, we also use the nvm-latency-benchmark with --output option. However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?

We also tried the read_blocks sample program with the write option in order to verify the write operation. The detailed setting is described as follow :

./nvm-read-blocks --write test.in --ctrl /dev/libnvm0 --block 1 --output out2.out

In this case, we checked that output file 'out2.out' shows the same data written in the input file 'test.in'. I want to know the difference between read_blocks and latency-benchmark program in terms of the write operation.

Thanks for your help.

enfiskutensykkel · 2019-03-22T14:16:52Z

However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?

Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using?

It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment.

maxmp1031 · 2019-03-22T14:21:09Z

However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?

Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using?

It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment.

I am using Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz CPU.

That might be the reason of this problem?

enfiskutensykkel · 2019-03-22T14:26:20Z

Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue.

maxmp1031 · 2019-03-22T15:18:37Z

Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue.

Oh... Thank you for your kind answers.

I might need to buy a new processor to achieve actually what I want.

If you don't mind, can you tell me how much better than sending data from PCI disk directly to GPU is than going through the main memory?

I really want to know how much promising that sending data directly is.

Thank you again.

enfiskutensykkel · 2019-03-22T15:51:57Z

I might need to buy a new processor to achieve actually what I want.

Before you run off to buy that, please also check that you have a GPU that is able to do GPUDirect. In my experience, most Nvidia Quadro or Tesla GPUs are able to do this, while GeForce/GTX GPUs are not.

It depends on your workload, reading disk data into main memory and then copying it to GPU memory is slow with cudaMemcpy. It is also possible to memory-map the file (with mmap) and register that memory with CUDAs unified memory modeul using cudaHostRegisterMemory and fault it in to GPU memory, but that is difficult to control and also not the most efficient. If you are able to do it peer-to-peer, especially with a large PCIe network, writing and reading directly between peering devices can yield very low latency and high bandwidth.

But, as I said, it depends heavily on the scenario. Most NVMe drives are x4, and unable to provide a high bandwidth because of that. If your workload or use case allows it, it is also possible to pipeline the disk I/O for your CUDA program by reading from disk ahead of time. In this case, using GPUDirect offers very little benefit.

So to answer your question, I made this primarily to see if it was possible to do. If you require very low latency or have sporadic disk access that is not easily predicted ahead of time, then this approach will have some benefit. I'm currently in the process of testing with multiple disks in order to fully saturate the x16 PCIe link to a GPU and I'm also experimenting with doing work on the GPU at the same time (which will affect the GPU memory latency), but for the results I already see, I'm able to achieve maximum disk bandwidth and very low command completion latencies by by-passing the kernel's block-device implementation alone. With the Intel Optane 900P and Intel Optane P4800x disks and Quadro P600 and Quadro P620s I see up to 2.7 GB/s for reads and around 6-7 microsecond command completion latencies, even for accessing GPU memory.

sureshd-fm · 2019-04-14T16:29:36Z

Thank you very much for such a nice library.
I am curious if you have you had any success in using GPUDirect on GTX or RTX GPUs? The information seems sparse, and would like to know your experience.
Thanks

enfiskutensykkel · 2019-04-14T16:36:42Z

Thank you! As far as I know, GTX does not support GPUDirect RDMA, only Quadros and Teslas do.

enfiskutensykkel added the question label Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does --verify option works? #20

Does --verify option works? #20

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019 •

edited

Loading

sureshd-fm commented Apr 14, 2019 •

edited

Loading

enfiskutensykkel commented Apr 14, 2019

Does --verify option works? #20

Does --verify option works? #20

Comments

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019

maxmp1031 commented Mar 22, 2019

enfiskutensykkel commented Mar 22, 2019 • edited Loading

sureshd-fm commented Apr 14, 2019 • edited Loading

enfiskutensykkel commented Apr 14, 2019

enfiskutensykkel commented Mar 22, 2019 •

edited

Loading

sureshd-fm commented Apr 14, 2019 •

edited

Loading