-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher abstraction level? #196
Comments
Hi Rene, The main design goal of VexCL was to make GPGPU developement as painless as possible, but since VexCL is based on standard API(s), and does not hide the underlying objects from its users, it should be possible to use more advanced features of its backends.
What exactly do you have in mind? OpenCL 2.0 pipes, or organizing individual kernels/tasks into a dataflow-like structure? In the former case, I have no idea how to expose the functionality in a generic way, so for now the only choice is writing custom kernels. And in the latter one could in principle use OpenCL command queues and events to explicitly describe dependencies between tasks. There is no direct support for this in VexCL at this point either.
I think you can already do this if you dedicate one command queue for computations and the other for data transfer.
Could you please open an issue with a specific problem? |
Excuse me, I was not detailed enough. I meant do you plan to Support Events somehow so that it is possible to synchronize Queues.
No I didn't meant C++ drawbacks in vexcl, but in marrow. VexCL is very nice and I already learned a lot. I already did some small changes like supporting pinned memory in the device vector (it's as twice as fast on my NVidia Card) and using the map() function of the opencl::device_vector. But before modifying the opencl::kernel I want to be sure you didn't plan to modify it with synchronization events :) |
One way I can think of making this possible, is to introduce a thin wrapper for the expression being assigned to. Say, something like let(x, null, eventA) = sin(y) * cos(y); // submit the kernel, return its event in eventA
let(z, eventA, null) = x * y; // submit the kernel, make sure it starts after eventA is done This would only make sense if the kernels are submitted to separate queues withing the same context. That is, If you are willing to work on this, I can provide further implementation hints. |
I think this constructor of |
Yes of course using pinned Memory is already possible. I just wrote a wrapper which acts like the map() function, but adds a functor parameter.
So I could write sth. like this: |
If I understand correctly, the wrapper than works with an user event which is set to complete after the kernel itself has been finished. And it is not possible to forward the given event to |
Isn't it the same as this->load(buffer.map(), buffer.size()); ? |
lol yes, took me 2mins to think about it. Seems I'm thinking much too complicated. |
We could introduce a method for the |
This way I could also calculate a histogram with a custom kernel within the processing queue and wait for the histo_ready_event in the download queue to transfer it back to host while the processing queue is doing other things. Sounds like a good extension for me |
This only works with vector expressions (no additive expressions, no multiexpressions) for now. refs #196
This only works with vector expressions (no additive expressions, no multiexpressions) for now. refs #196
This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196
This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196
This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196
After some thought I decided not to implement Instead, I've provided thin backend-independent wrappers for events, markers, and barriers. Using those one can create a dependency graph of arbitrary complexity in a generic way. See example in tests/events.cpp. There, Same approach could be used to make transforms overlap with compute. |
Looks good and understandable. Is the finish() required? |
That finish is only there to make sure x has some predictable values before Cheers,
|
Yes of course, but "x = 2;" is a second kernel and can only start if x already was completely filled with 1 if I understood those asynchronuous kernels correctly. So I think it's superfluous to call it. |
Y=x has no such restriction and may start before x=1 finished. On my |
f9baf22 lets the user explicitly chose what queue to submit a transfer operation into. tests/events.cpp shows an example of using several queues to overlap compute and transfer and using events to make sure the data is actually ready. |
Okay ;), strange I didn't miss it but it's of course useful. |
It won't work for Boost.Compute backend, which is stricter than Khronos C++ bindings re const correctness.
That could be a good idea, I'll think about it. I did not want to introduce extra checks into the original implementation of the read/write methods, but the cost should be negligible w.r.t. the OpenCL/CUDA API calls. |
After doing some things and tests with vexcl, I am very soon in the position that I need more. Especially I'm thinking about stream processing and pipelining, overlapped transfer and more and I found there is a lot to do to make this work correctly.
Searching the internet I found a good publication about an algorithmic framework called marrow (source @ bitbucket), which handles such things.
But after a quick look I saw a lot of C++ drawbacks, e.g. there is no const correctness at all.
Dennis, do you think there is a way of combining these both worlds?
The text was updated successfully, but these errors were encountered: