Higher abstraction level? #196

Reneg973 · 2016-03-16T06:31:33Z

After doing some things and tests with vexcl, I am very soon in the position that I need more. Especially I'm thinking about stream processing and pipelining, overlapped transfer and more and I found there is a lot to do to make this work correctly.
Searching the internet I found a good publication about an algorithmic framework called marrow (source @ bitbucket), which handles such things.
But after a quick look I saw a lot of C++ drawbacks, e.g. there is no const correctness at all.

Dennis, do you think there is a way of combining these both worlds?

ddemidov · 2016-03-16T19:23:00Z

Hi Rene,

The main design goal of VexCL was to make GPGPU developement as painless as possible, but since VexCL is based on standard API(s), and does not hide the underlying objects from its users, it should be possible to use more advanced features of its backends.

stream processing and pipelining

What exactly do you have in mind? OpenCL 2.0 pipes, or organizing individual kernels/tasks into a dataflow-like structure? In the former case, I have no idea how to expose the functionality in a generic way, so for now the only choice is writing custom kernels. And in the latter one could in principle use OpenCL command queues and events to explicitly describe dependencies between tasks. There is no direct support for this in VexCL at this point either.

overlapped transfer

I think you can already do this if you dedicate one command queue for computations and the other for data transfer. vex::backend::device_vector class can enqueue read/write operations to a specific command queue.

a lot of C++ drawbacks, e.g. there is no const correctness at all.

Could you please open an issue with a specific problem?

Reneg973 · 2016-03-16T21:18:22Z

Excuse me, I was not detailed enough. I meant do you plan to Support Events somehow so that it is possible to synchronize Queues.
E.g. my processing Queue enqueued some vexcl kernels. The download queue needs to wait on the last kernel event before it reads the data back to PC. The upload queue needs to wait until first kernel is finished until it is free to upload new data into the Input buffer of kernel 1.
Further on kernel 1 should get signaled to start processing automatically if uploading was finished. But uploading can only start if kernel 1 is already finished and last kernel must wait as Long as the downloading Queue reads the output buffer.

Could you please open an issue with a specific problem?

No I didn't meant C++ drawbacks in vexcl, but in marrow. VexCL is very nice and I already learned a lot.

I already did some small changes like supporting pinned memory in the device vector (it's as twice as fast on my NVidia Card) and using the map() function of the opencl::device_vector.

But before modifying the opencl::kernel I want to be sure you didn't plan to modify it with synchronization events :)

ddemidov · 2016-03-17T06:09:53Z

One way I can think of making this possible, is to introduce a thin wrapper for the expression being assigned to. Say, something like vex::let(lhs_expr, event_wait_list = null, event = null). It would return a wrapper struct with an assignment operator that would delegate the work to the usual assign_expression function while providing some extra information. Then the following would work:

let(x, null, eventA) = sin(y) * cos(y); // submit the kernel, return its event in eventA
let(z, eventA, null) = x * y; // submit the kernel, make sure it starts after eventA is done

This would only make sense if the kernels are submitted to separate queues withing the same context. That is, x and z should be associated with separate queue. Or we could also provide an overload of the let() function taking a queue to submit the assignment into.

If you are willing to work on this, I can provide further implementation hints.

ddemidov · 2016-03-17T06:18:25Z

I already did some small changes like supporting pinned memory in the device vector (it's as twice as fast on my NVidia Card) and using the map() function of the opencl::device_vector.

I think this constructor of vex::vector should support allocating pinned memory directly with the appropriate mem_flags.

Reneg973 · 2016-03-17T19:00:52Z

Yes of course using pinned Memory is already possible. I just wrote a wrapper which acts like the map() function, but adds a functor parameter.

void write(const cl::CommandQueue &q, const std::function<void(T* device, size_t size)>& fnUpload)
{
  assert(q.getInfo<CL_MEM_FLAGS>() & CL_MEM_ALLOC_HOST_PTR);  // should be used with pinned memory
  T* ptr = reinterpret_cast<T*>(q.enqueueMapBuffer(buffer, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0, size() * sizeof(T)));
  fnUpload(ptr, size());
  q.enqueueUnmapMemObject(buffer, ptr);
}

So I could write sth. like this:
buffer.write(q, std::bind(&CData::Load, this)); and the data loaded from harddisk is directly written into device mem.

Reneg973 · 2016-03-17T19:08:34Z

It would return a wrapper struct with an assignment operator that would delegate the work to the usual assign_expression function while providing some extra information.

If I understand correctly, the wrapper than works with an user event which is set to complete after the kernel itself has been finished. And it is not possible to forward the given event to
q.enqueueNDRangeKernel(K, m_offset, g_size, w_size); within
void kernel::operator()(const cl::CommandQueue& q)
Am I right?

ddemidov · 2016-03-17T19:09:27Z

So I could write sth. like this:
buffer.write(q, std::bind(&CData::Load, this)); and the data loaded from harddisk is directly written into device mem.

Isn't it the same as

this->load(buffer.map(), buffer.size());

?

Reneg973 · 2016-03-17T19:16:05Z

lol yes, took me 2mins to think about it. Seems I'm thinking much too complicated.

ddemidov · 2016-03-17T19:16:11Z

If I understand correctly, the wrapper than works with an user event which is set to complete after the kernel itself has been finished. And it is not possible to forward the given event to

We could introduce a method for the backend::kernel class that would set the events to take into account, similar to how its possible to configure the grid size with kernel::config. And we should also add event parameters to assign_expression().

Reneg973 · 2016-03-17T19:24:08Z

This way I could also calculate a histogram with a custom kernel within the processing queue and wait for the histo_ready_event in the download queue to transfer it back to host while the processing queue is doing other things. Sounds like a good extension for me

This only works with vector expressions (no additive expressions, no multiexpressions) for now. refs #196

This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196

ddemidov · 2016-03-18T20:52:59Z

After some thought I decided not to implement vex::let() assignment wrapper. It would not cover all kinds of kernels vexcl launches (vector expressions, additive expressions, multiexpressions, reductions, etc). It would also require a lot of changes in the core code, some of which would bring an overhead even for the users not using the new functionality.

Instead, I've provided thin backend-independent wrappers for events, markers, and barriers. Using those one can create a dependency graph of arbitrary complexity in a generic way. See example in tests/events.cpp. There, x is associated with queue q1, y is associated with q2, where both queues belong to the same context. Without the barrier at line 24 the test fails (with AMD platform; NVIDIA seems to serialize the kernel execution anyway). Barrier makes q2 wait for completion of x update in q1.

Same approach could be used to make transforms overlap with compute.

Reneg973 · 2016-03-19T08:01:24Z

Looks good and understandable. Is the finish() required?

ddemidov · 2016-03-19T08:34:22Z

That finish is only there to make sure x has some predictable values before
the experiment.

Cheers,
Denis
On Mar 19, 2016 11:01, "Rene" [email protected] wrote:

Looks good and understandable. Is the finish()
https://github.com/ddemidov/vexcl/blob/master/tests/events.cpp#L20
required?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#196 (comment)

Reneg973 · 2016-03-19T08:42:37Z

Yes of course, but "x = 2;" is a second kernel and can only start if x already was completely filled with 1 if I understood those asynchronuous kernels correctly. So I think it's superfluous to call it.

ddemidov · 2016-03-19T08:50:21Z

Y=x has no such restriction and may start before x=1 finished. On my
machine x may in fact get same memory area as in the last run, which may still
contain twos. So the test may accidentally pass (without the barrier).

ddemidov · 2016-03-20T19:30:29Z

f9baf22 lets the user explicitly chose what queue to submit a transfer operation into. tests/events.cpp shows an example of using several queues to overlap compute and transfer and using events to make sure the data is actually ready.

Reneg973 · 2016-03-20T21:02:13Z

Okay ;), strange I didn't miss it but it's of course useful.
Afaik the command_queue vector in your copy functions can also be const, and you could merge both write/read functions by using a default empty vector

ddemidov · 2016-03-20T21:14:59Z

the command_queue vector in your copy functions can also be const

It won't work for Boost.Compute backend, which is stricter than Khronos C++ bindings re const correctness.

you could merge both write/read functions by using a default empty vector

That could be a good idea, I'll think about it. I did not want to introduce extra checks into the original implementation of the read/write methods, but the cost should be negligible w.r.t. the OpenCL/CUDA API calls.

ddemidov added the enhancement label Mar 17, 2016

ddemidov added a commit that referenced this issue Mar 18, 2016

Initial attempt at exposing events to the user

faf1af6

This only works with vector expressions (no additive expressions, no multiexpressions) for now. refs #196

ddemidov added a commit that referenced this issue Mar 18, 2016

Initial attempt at exposing events to the user

f75ad71

This only works with vector expressions (no additive expressions, no multiexpressions) for now. refs #196

ddemidov added a commit that referenced this issue Mar 18, 2016

Expose events, markers, and barriers to the user

1e51803

This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196

ddemidov added a commit that referenced this issue Mar 18, 2016

Expose events, markers, and barriers to the user

7c16219

This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196

ddemidov added a commit that referenced this issue Mar 18, 2016

Expose events, markers, and barriers to the user

9084216

This should be enough to organize dependency graph in a generic way, without changing the existing API. see #196

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Higher abstraction level? #196

Higher abstraction level? #196

Reneg973 commented Mar 16, 2016

ddemidov commented Mar 16, 2016

Reneg973 commented Mar 16, 2016

ddemidov commented Mar 17, 2016

ddemidov commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

ddemidov commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

ddemidov commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

ddemidov commented Mar 18, 2016

Reneg973 commented Mar 19, 2016

ddemidov commented Mar 19, 2016

Reneg973 commented Mar 19, 2016

ddemidov commented Mar 19, 2016

ddemidov commented Mar 20, 2016

Reneg973 commented Mar 20, 2016

ddemidov commented Mar 20, 2016

Higher abstraction level? #196

Higher abstraction level? #196

Comments

Reneg973 commented Mar 16, 2016

ddemidov commented Mar 16, 2016

Reneg973 commented Mar 16, 2016

ddemidov commented Mar 17, 2016

ddemidov commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

ddemidov commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

ddemidov commented Mar 17, 2016

Reneg973 commented Mar 17, 2016

ddemidov commented Mar 18, 2016

Reneg973 commented Mar 19, 2016

ddemidov commented Mar 19, 2016

Reneg973 commented Mar 19, 2016

ddemidov commented Mar 19, 2016

ddemidov commented Mar 20, 2016

Reneg973 commented Mar 20, 2016

ddemidov commented Mar 20, 2016