-
Notifications
You must be signed in to change notification settings - Fork 14
device-linear-multifrag execution mode #648
Comments
The basic idea looks very reasonable. But before implementation, I'd try to measure possible gain by running some queries on GPU with different fragment sizes. Changing fragment size you can run the same query with multi-fragment kernels and single-fragment kernels by setting a big enough fragment size (and maybe setting config flag?). Then you'd be able to decide if the implementation complexity is worth it. |
@akroviakov, could you please post the preliminary data here as well? |
To see the benefit and compare to the current multifrag, one must be able to (1) linearize on GPU one-by-one and (2) modify some kernel (I'm thinking of HT builder) to make it somewhat "linear". I'm currently working on linearizing on GPU which needs something like |
Can you run the same query on the same data imported with different fragment sizes? E.g. you can import 100M rows with fragment_size=1'000'000 and fragment_size=100'000'000. The first one would use multi frag kernel with 100 fragments and the other one can be run with a single fragment |
Quick data on a simple group by 10'000'000 rows:
The difference simply due to a memory layout is indeed not noticeable. |
I'm rooting for the approach. I see no reason to linearize the data on the CPU. I also think kernel data sizes should be arbitrary. Here are some thoughts on the topic. Interleaved executionThe multi-fragment execution approach potentially allows for interleaving data copying with the execution. Having all the data linearized assumes (though not necessarily, but we imply it when we say the kernel will be simplified) that the iteration space for threading is now the whole linearized memory chunk (instead of a single fragment as it is now). The proposed approach can employ the technique as well but at a coarse granularity: interleaving will happen between kernels. The tradeoff affects latency and is probably not a considerable point since any latency-sensitive (small data) query is perfectly fine on the CPU. Memory management on re-fragmentingFirst, let's assume the data is small enough (for the other case see below). In such a case there can be a situation when we would need to move an additional fragment to or from the device. A naïve approach would reallocate the device buffer and move the data again to preserve linearity. My take on it is twofold:
For larger datasets, a proportion on GPU would mean rather how many kernels we schedule to GPU while each of them may occupy "all" (probably half to allow interleaved execution?) the available memory. In this case, there's no need for fine-grained fragment management. |
This is an expected result since for a simple group-by query you replace a global loop with more work for each thread essentially saving several always-taken branch instructions. The benefit of simplicity as we discussed may come for those checks (or rather their absence) in the HT build. |
Does it mean linear-multifrag can be beneficial only when we have proper runtime/codegen to remove multifrag costs we don't need when running on a single linear chunk? Then I'd suggest working on this first to get an opportunity to measure the real benefit of big linear chunks through experiments with fragment size. The separate issue I guess is the efficiency of data fetch to GPU devices. Currently, we can have unnecessary copies of fragment data to CPU memory prior to copying it to GPU even when we fetch a single fragment. In many cases fragment is not zero-copy accessible on CPU and therefore we linearize it just to copy it to GPU. If we could change it to copy zero-copy accessible fragment parts directly to GPU then we could save some time and CPU memory. After that, the solution might be extended to multi-frag linearized fetches. |
I think so. I don't see much reason for a linear chunk to be significantly superior in most common cases. It does save us some overhead on branching, but that's about it. Joins, however, is a different story. The issue with data fetch is an evident one, I agree. |
A few thoughts... On GPU execution, we already support multi-fragment kernels. Therefore, we could have arbitrary sized fragments (yes there is some outer loop overhead but that seems negligible, and let's ignore copy overheads for now) for GPU now except that we need per-fragment metadata to properly size data structures - though, because we do not have any concept of sharding we actually don't size data structures for anything more than the full data. So, multi-frag execution of arbitrary size on GPU should be doable today quite easily. As an aside, linearization cost is very high for varlen data. JoinColumnIterator was created to support joins on varlen data across multiple fragments. It was never finished, though, because the payload of the hash table really needs the ability to do the index. At the end of the day, though, it is just a multi-index and I cannot imagine the overhead of adding to two integers is much higher than the overhead of adding to one. On CPU, we do not support multi-fragment kernels because it seems redundant. On GPU you have kernel launch overhead, on CPU not so much. That being said, it probably would not be too hard to add multi-fragment capability. But, I think that is probably misguided. What you really want is to be able to assign each kernel some set of rows to compute - those rows may come from 10 fragments if the kernel is simple, or from < 1 fragment if the kernel is complex (e.g. sub-fragments). The metadata problem is as above, right? You only care about whole-table stats, not per-fragment stats at the end of the day. So, if you could (1) arbitrarily slice storage and assign it to kernels and (2) parametrize the kernels to generically take a set of rows to process, you could have both multi-fragment kernels and sub-fragments with just one code path (and again, likely minimal overhead - but would be interesting to test). There is one place where per-fragment metadata can be incredibly useful, though, and that is in fragment skipping. If we ingest data in date order, for example (lots of data is generated in date order - think of point of sale records for one) and then query on a date range, we can avoid even loading a lot of data if we have per-fragment metadata. Since we have to compute metadata anyway, it may make sense to leave the concept of fragment in storage and use the per-fragment metadata for fragment skipping on filter queries. I don't know how prevalent this is in our use cases or benchmarks, though. |
In relation to the idea: if it has to be sent to GPU anyways and we know the total size of what we are sending, we can assemble data linearly without any additional cost, can't we?
We iterate there using operator++ which checks if we are inside the chunk for each row for each column. Isn't such a branch a performance penalty for GPU? Maybe it concerns only this unfinished iterator.
That is the biggest issue, we have columns that are chunked (ArrowStorage ChunkedArray), we read CSVs in 20MB blocks. That means that e.g., some of 8GB (20Mil. rows) Taxi dataset partition columns get sliced into ~407 arrays (of size 48.5-50k rows). With parquet it is a bit better, but still ~153 arrays (~131k rows per array). Getting arbitrary slices (i.e., fragments) would at the end mean redundant linearized materialization of these arrays. Currently we materialize fragments that require linearization (so most of the fragments). Any fragment load onto GPU also materializes, if it requires linearization, on CPU (memory levels). The big plan is to move away from fragments and operate directly on these arrow chunks which will be zero-copy for CPU, so no actual need to materialize data on CPU. But on GPU we have to materialize data anyways, so why not walk over chunks and directly assemble into a linear GPU buffer (i.e., set of rows of arbitrary size), we can stick to fixed size types for now in this execution mode. This way CPU (with multifragment or "multichunk") still has good granularity, and GPU can easily slide through its set of rows. This way both CPU and GPU will be in their optimal setting for a step/pipeline execution, regardless of the fragment/chunk size.
this is supposed to be part of implementing this execution mode for GPU.
Aren't we doing it when deciding which fragments to skip on dispatching somewhere in |
It's true, we have to adjust offsets for varlen data. But we pay this price anyway even for CPU now because arrow chunks don't match fragments and we make an adjusted copy of the offset buffer on fetch.
Actually, we do support it. When the aggregation result is so big that the reduction kills any gain of multiple kernels, we run a single multi-fragment kernel on CPU.
Yes, this is the goal. An additional challenge here is to use zero-copy fetch as much as possible, i. e. follow arrow chunks for imported tables, or ResultSets when working with execution results. Following existing arrow chunks in ArrowStorage can be quite tricky though. Different columns might have different chunks in our arrow data because some columns have the original split, while others were linearized during nulls/data transformations. The resulting batches would actually depend on the set of rows we want to use in a query. The easiest solution is to always use the most granular split we have for all columns. I guess it should be OK for CPU unless chunks become extremely small. |
Well, one issue I can think of is you lose your notion of where you are in the original dataset in "storage". Consider |
I don't follow - why can't we just maintain Arrow chunks of whatever size is optimal, and slice them globally using some notion of "virtual fragment" when we fetch data for the query? |
I agree that having a linear buffer for a column on GPU, where we place loaded fragments one after another doesn't make a lot of sense in case there is
I was thinking about sticking to the current multifrag infrastructure of fragment bookkeeping, but adding a LINEAR flag to the kernel, that would tell the GPU to treat it as one fragment. So given an index of a qualifying tuple, we can easily get fragID and offset, because we know where we have started, fragments were loaded in their sequential order and only the last fragment can be underfilled. I would also like to bring another aspect of linearity on GPU. This is GPU sharing between different HDK instances while in heterogeneous mode. Heterogeneous mode implies that we want to have more fragments than there are logical CPU cores to actually share the workload, right? So the more logical cores a CPU has, the more we put GPU at a disadvantage as it now has more outer loop iterations. This seems negligible, but up to some point, for example in a system with 64 logical cores and 128 fragments, 3 out of 4 taxi queries (40Mil. rows) are running up to 2x slower on GPU than when we have 16 fragments. Now consider the following scenario: same dataset, HDK instance (1) is running many queries on GPU (even if partially), HDK instance (2) also wants to run queries on GPU. And as expected, the more data proportion we run on GPU on (1), the slower we are ( Do you think this is caused by what we are discussing here in the context of linearity on device, or could it be something else? |
I don't know why |
As of now, HDK's heterogeneity looks like this:
The currently disabled multifragment option does the following:
Why is current multifragment good:
Why is it not that good:
JoinColumnIterator& operator++()
and related for loops). This likely eats some performance from theexecutePlan
stage, which is often the costliest.Basic idea:
The best case for the kernel is if all columns are linear (simple loops over dense data). All GPU-assigned fragments have to be materialized anyways, so why not place them linearly?
We can add a new execution mode where fragments (regardless of their position in host memory) are assembled linearly on GPU, this way, the fragments become transparent for the GPU kernel, it will treat them as one big fragment. We still know the address of each fragment and can get it back to CPU, but we cannot free individual fragments, only if we do it for the whole linear buffer. Such behavior is not different from just a big fragment size.
Why is it supposed to be better?
What's the catch?
This won't bring any benefit for CPU as it is expensive and redundant (with regards to memory) to linearize fragments, also likely pointless anyways, since each thread processes one fragment (which is linear) at a time.
When does this execution mode make sense:
Fetching chunks takes noticeably less time than
executePlan
, possible savings inexecutePlan
might justify a small increase in latency caused by chunk placement.Summary:
GPUs will be able to treat any number of fragments as one big fragment of arbitrary size, which aligns with the goal of heterogeneity and GPU's preferred programming model. But there's a price: memory management granularity + bandwidth. The question is if the price is justified? After all, it will be just an option, so if the bandwidth is the bottleneck, one could just switch to the current way of multifragment execution. But if one knows that a workload needs only a handful of columns, why not get the most out of the hardware?
The purpose of the related work is to see if it brings noticeable benefits at all, if not, then we will at least know where not to look into.
Any tips, suggestion and critic is welcomed.
The text was updated successfully, but these errors were encountered: