-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does --verify option works? #20
Comments
Hi, Thank you for the interest. The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option). You can also use the --output option (with or without --verify) to dump what was read from the disk to file. |
Thank you for your kind answer. I have misunderstood that the verify option compares data which is stored in SSD and data which is loaded in the GPU memory. What I actually want to know is, how we can access the data in the GPU memory. Currently, I am studying your latency benchmark. Regarding the latency benchmark, can you kindly advise me how to access the data in the GPU memory? Thank you |
Hi, Combining With the --gpu option, the program will allocate the memory buffer the disk writes to or reads from on the gpu. In this path in the code, the --input option will do an extra cudaMemcpy which loads GPU memory before the benchmark, and the --verify option (and/or --output option) will do an cudaMemcpy from GPU memory in order to verify that the memory content is the same. Without --input, the buffer will be memset to zero. The most convenient way of verifying in my opinion is to use the --input, --verify and --write options, in this case nvm-latency-bench will load file content in to memory, write it to the disk, then read it back from the disk, and finally compare it with the original file content loaded in memory. If you use the --gpu option in addition to --input, --verify and --write, then nvm-latency-bench does the following:
I don't see from your first post that you compiled with CUDA support. The status messages when running the cmake command should confirm where the driver is located. In order for above to work, you need to point cmake to the Nvidia driver so that building the kernel module can find the necessary symbols from P.S. You should also have a look at the nvm-cuda-bench example if you're interested in having the CUDA kernel itself initiate disk reads/writes and accessing that memory. |
I also want to discuss the benchmark binary with --verify option, We have tried to perform nvm-latency-benchmark with both --verify and --write option. We describe detailed settings as the following command. nvm-latency-bench --input test.in --write --verify --ctrl /dev/libnvm0 --bytes 4096 --count 100000 --iterations=1 --queue 'no=1' --info --gpu 0 --output out.out By the way, the function "verifyTransfer" still returns the exception. In order to check output contents, we also use the nvm-latency-benchmark with --output option. However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem? We also tried the read_blocks sample program with the write option in order to verify the write operation. The detailed setting is described as follow : ./nvm-read-blocks --write test.in --ctrl /dev/libnvm0 --block 1 --output out2.out In this case, we checked that output file 'out2.out' shows the same data written in the input file 'test.in'. I want to know the difference between read_blocks and latency-benchmark program in terms of the write operation. Thanks for your help. |
Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using? It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment. |
I am using Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz CPU. That might be the reason of this problem? |
Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue. |
Oh... Thank you for your kind answers. I might need to buy a new processor to achieve actually what I want. If you don't mind, can you tell me how much better than sending data from PCI disk directly to GPU is than going through the main memory? I really want to know how much promising that sending data directly is. Thank you again. |
Before you run off to buy that, please also check that you have a GPU that is able to do GPUDirect. In my experience, most Nvidia Quadro or Tesla GPUs are able to do this, while GeForce/GTX GPUs are not. It depends on your workload, reading disk data into main memory and then copying it to GPU memory is slow with But, as I said, it depends heavily on the scenario. Most NVMe drives are x4, and unable to provide a high bandwidth because of that. If your workload or use case allows it, it is also possible to pipeline the disk I/O for your CUDA program by reading from disk ahead of time. In this case, using GPUDirect offers very little benefit. So to answer your question, I made this primarily to see if it was possible to do. If you require very low latency or have sporadic disk access that is not easily predicted ahead of time, then this approach will have some benefit. I'm currently in the process of testing with multiple disks in order to fully saturate the x16 PCIe link to a GPU and I'm also experimenting with doing work on the GPU at the same time (which will affect the GPU memory latency), but for the results I already see, I'm able to achieve maximum disk bandwidth and very low command completion latencies by by-passing the kernel's block-device implementation alone. With the Intel Optane 900P and Intel Optane P4800x disks and Quadro P600 and Quadro P620s I see up to 2.7 GB/s for reads and around 6-7 microsecond command completion latencies, even for accessing GPU memory. |
Thank you very much for such a nice library. |
Thank you! As far as I know, GTX does not support GPUDirect RDMA, only Quadros and Teslas do. |
Hi, I'm a student interested in GPU data processing.
I appreciate your great effort for implementing RDMA using GPU Direct. It is very useful for studying that research area.
By the way, while trying to run the project, I have met some problems.
The current case is:
Trying to send data from SSD to GPU directly, without using smartio
Below is my make command line
cmake .. -DCMAKE_BUILD_TYPE=Release -Dno_smartio=true -Dno_smartio_samples=true -Dno_smartio_benchmarks=true
And I tried to run nvm-latency-benchmark program with below options
When I run the program, both shows
Verifying buffers... FAIL
Unexpected runtime error: Memory buffer differ from file content
Can you please give some advice?
Thank you.
The text was updated successfully, but these errors were encountered: