-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large reduced result size #136
Comments
Good question. I should have clarified it in the documentation.
Actually, this is not the case. As long as you have "semantically associative" reducing function, you can return a different type. Also, since the basecase of op(x, y) = op!(deepcopy(x), y) is associative. For example, julia> reduce(append!, Map(identity), 1:10; init=OnInit(() -> Float64[]))
10-element Array{Float64,1}:
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0 since Note that I use See also: https://tkf.github.io/Transducers.jl/dev/examples/tutorial_parallel/#Example:-ad-hoc-histogram-1 (Actual implementation of There is also an interface for using different reducing function for basecases and for combining them. But that's not documented yet. (See |
I looked in the documentation and I didn't quite "get" it. Looking at my lower-level code and trying to abstract it, if I try to abstract my various parallel loops, I see the following pattern:
This method maximizes parallelism and minimizes memory allocations. The interface to such a generic map/reduce loop would require:
Naturally the result of the merge_into functions must not depend on the invocation order. This describes the behavior when using multiple threads. When also using multiple processes, then once all threads in a process are In this case (also using multiple processes), we also need to add:
I also find myself having a second type of parallel loops where the overall result is an array, or a set of arrays, possibly memory-mapped to a file. Each iteration/step writes into different entries of the array. There's no If the result array is just a normal in-memory array, then one simply needs to run all the iterations/steps on threads in the current process (still allocating the
Of course the compute_step function must be written such that each element of the target in-memory array(s) will only be written by one such function call. If the result array is memory mapped from a file, we can distribute the work across multiple processes. This requires an additional
Again the compute_step function must be written such that each element of the target in-memory array(s) will only be written by one such function call. Side note: I'm worried about resource management in the distributed scenarios. File descriptors are a finite resource the GC isn't really aware of. I'd love to have additional:
Which would be invoked when each process is done with its work. But Julia doesn't seem to allow me to put anything useful in such functions. There's not even a hint that I know of to tell the GC "this object is probably not used by anyone anymore, please try to collect it soon to release non-memory resources associated with it". At any rate, I found these patterns to be generic - they cropped up in many places in code base (Python, horrible, don't ask...). At the same time, these patterns are certainly in a lower abstraction level than the classic map/reduce, fold, etc. For example, they refer to a specific process/threads topology so the threaded and the distributed API is different from the threaded API: explicit global data, constructing the memory-mapped arrays result per process, etc. I find it difficult to figure out from the parallel Transducers documentation how to map these patterns to the API, if that is possible at all. Would the Transducers library reduce the pain of creating an API for such patterns? Or would it make more sense to directly implement these on top of the language primitives ( |
What you describe here is implemented in Transducers.jl's reduce. I guess By the way, in Julia, the granularity of thread programming is task, not thread. So you can (or have to) let the
You can just use side-effect in reducing function. I don't think you need a separate API for this as long as you can ensure each index is set only once. See
I think Referenceables.jl used in the NDReducibles.jl examples is an interesting way to ensure this property.
My impression is that these are lower (or equivalent) level APIs than transducers and reduce. Also, I think it'd be better to factor out most of your API as a separate "resource pool manager" and try not to mix it with data-parallel part. Then, maybe it can be combined with Transducers.jl by acquiring per-basecase (per-task) resource in |
Does that mean that Transducers perform a dynamically scheduled reduction tree between the threads (not necessarily a balanced tree) and that when using multiple processes, the invoking process uses its own multiple threads in a final dynamically scheduled reduction tree to obtain the final result?
This may be the core of why it is unclear to me how to map my scenarios to existing APIs. To fully optimize the code I’d like to explicitly manage per-process, per-thread and of course per-task state. Per-process state is shared by tasks running on multiple threads in parallel, per-thread state is shared by tasks running serially on the same thread, and per-task state should be allocated only once and reset for each task. It isn’t clear to me how to create such a setup.
Agreed.
You mean, the part that deals with the per-process/thread/task state? Sure, that sounds reasonable. That should bridge the gap between what I have in mind and all sort of existing APIs. I need to figure out the exact API that makes sense here... |
The end-result is correct but it's a bit different in that I don't do any scheduling in Transducers.jl and just let function _reduce(rf, init, xs)
if issmall(xs)
return foldl(rf, init, xs) # basecase
else
left, right = halve(xs)
task = @spawn _reduce(rf, init, right)
a = _reduce(rf, init, left)
b = fetch(task)
return combine(rf, a, b)
end
end
Yes, exactly. BTW, just FYI, I think it's conceivable that a |
The simple code described above is fine. It is an example of a "static scheduling" (balanced binary tree). If one is super-picky about optimization, then it is possible to squeeze some additional performance if one chops the loop to "small" regions, processes them in parallel (in arbitrary order), and merges their results as they arrive (in whatever order). The resulting reduction tree wouldn't be balanced, and will include merges between non-adjacent sub-ranges. I agree this is probably an overkill for many usages as the static scheduling is pretty fast. Same issues occur when some of the reductions are done on other worker processes - merging the results returning from each such sub-process need not necessarily be done in-order (adjacent ranges) but could be done opportunistically. Again this would greatly complicate the code; the actual saving might not be worth it in many cases, but might be justified in others. |
No. It is not static scheduling since Julia's scheduler is depth-first. See:
Thanks to depth-first scheduling, the tree does not have to be balanced. That's why I can implement efficient deterministically terminatable parallel reduce with such small amount of code. But if you want to control over much lower scheduling detail, I understand that the current API is not enough. Maybe you are interested in: RFC: Make an abstraction for custom task parallel schedulers (GSoC proposal) - Internals & Design - JuliaLang
Note that this requires that the reducing function to be commutative (and also associative). If you can impose this constrain, maybe JuliaLang/julia#34185 is a nice resource. |
I was imprecise. When I said “static” I meant the order and structure of the reduction tree, not the timing of the operations. That is, by “dynamic” I meant performing the reductions in an arbitrary order. This is especially important when merging the results from worker processes. The last slow such process would require additional log(number of worker) reductions after it is done when using a balanced in-order reductions tree. It only requires a single final reduction when using out-of-order reductions. |
Thanks for the explanation. It clarifies a lot. Indeed, I was regarding log(number of worker) to be "not a big deal". But I understand that it can be a big deal when you have unpredictably varying workload per element.
Can't you still do this with two final reductions, without assuming that the reducing function to be commutative? I think you just have to track what is on the left and what is on the right. This kind of dynamic/non-deterministic scheduler is not implemented yet. But that's an interesting addition. |
It has to be the full log(n) because you can't perform in advance any reductions on the path from the slow process result to the root (process w/ its neighbor leaf, the result with its neighbor merge of two process result, etc.). In my case, >16 machines, 5 reductions, that's not too bad, but still stings a bit; Amdahl is an unforgiving taskmaster :-) |
I don't think so. Consider a map-reduce
where
After |
You are right, I was wrong. I was thinking about an inflexible reduction tree (the trivial code you posted above). But if you only enforce the constraint the reductions are in-order, you can "rotate" the tree so that a late-arriving result would indeed only need two reductions. |
Do I understand correctly that the reduce function must take two instances of some type and return a new instance of the same type?
I have cases where the operations work on large data (e.g., vectors with tens or or hundreds of thousands of elements). In such a case, it is useful to have an abstraction of an accumulator, which is initialized to a zero state, and can sum into itself the results from multiple steps.
For example, this allows allocating one such accumulator per thread, which would collect all the results of the steps that executed on that thread. Once these are complete, then the threads can perform a parallel reduction tree ending with the final result in the accumulator of the 1st thread, which would be returned.
This also scales when combining multi-processing and multi-threading (a hypothetical
dtreduce
). The final result from each worker thread would be sent back to the invoking process, and summed into its accumulator (this can also be done in parallel using multiple threads).This would minimize both garbage collection churn and data movement between processes for optimal performance.
The text was updated successfully, but these errors were encountered: