-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: a queue thingy for batching work with multiple threads + tunable throughput #34185
Conversation
Is this a threaded lazy unordered map? If there is going to be
Out of curiosity, why not stitch those functions together and turn them into a pipeline using |
put!(channel, f(something(x))) | ||
return false | ||
end | ||
fetch(task) && break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to @spawn
if you fetch(task)
right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since I reasoned through this so hopefully I don't screw it up 😅
This is basically our trick to allow for migration of work between threads for each work item
(though I don't know what the status is w.r.t. the scheduler actually being able to do this right now). The inner spawn for each work item keeps chunks of work from being locked to the specific thread of the outer spawn, while the outer spawn provides the actual parallelism and is throughput-limited in the desired manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I didn't know this trick.
put!(channel, f(initial[1])) | ||
state = initial[2] | ||
isdone = false | ||
next = () -> begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to do this with another Channel
("share memory by communicating")? Is this for performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to do this with another Channel
Ah, that does sound like a better idea! I'll try it out
Kind of. I guess the main detail is that it's throughput-limited by picking a given channel size; the channel is otherwise filled up as eagerly as possible.
You definitely could just fuse together a single large pipeline and then pass it to this thing. The benefit of stringing them together is just being able to manually tune a desired resource profile (esp. w.r.t. memory consumption) for each stage. |
Hmm... I guess I still don't understand this. If you fuse the pipeline then there is no need to sacrifice the memory consumption as the intermediate objects can be GC'ed right away. Why would you want to tune something when you can have ideal memory consumption for free? You can also limit the CPU usage in the threaded map which recessives a fused pipeline (so no extra memory consumption). |
I would rather borrow from the C++ parallel STL, which gives you execution policies as arguments to functions, semantically this is still a |
I think I half agree. But I'd argue that, for higher order functions like |
I started experimenting "unordered reduce" route in JuliaFolds/Transducers.jl#112 |
I think maybe this makes more sense if you don't necessarily know what your total resource pool/workload is going to be up front (e.g. if your computation backs a service or whatever), and you have the possibility of producers queueing up results far faster than consumers can consume them, especially if the producers are computationally cheap + produce large intermediate artifacts compared to consumers. If you have a sense of the "granularity" of the workload/resource pool (e.g. you know upfront you can allocate a new node with I recently saw dask/distributed#2602 (linked from that backpressure blog post that's been making the rounds), and was like "hey, that's the thing I'm very, very naively trying to allow callers to manually configure via this |
I get that backpressure is important but I still don't get why you want to introduce it at intermediate processing levels. Isn't it like you are introducing the problem (possible excessive memory use due to buffering) and then solving it right away (by bounded I find thinking in terms of consumer/producer is not really an optimal mindset when it comes to data parallel computation. This may just be my skewed view, but it is probably a byproduct of stretching iterator framework to parallel computation. I think this is a very limiting approach as iterators are inherently "stateful." Rather, I think switching to the transducer approach is much better as parallelism is pretty straightforward since you do fusion/composition on the function side as opposed to the data side. For example, you can automatically get a program that respect data locality if you fuse the pipeline. On the other hand, you may be moving around data constantly across CPUs or machines if you do it with the consumer/producer approach. I'd imagine working it around would require you to build a pretty non-trivial scheduling system. Having said that, transducer-based approaches like JuliaFolds/Transducers.jl#112 do not disallow finer throttling. You can always shove |
The latest Transducers.jl has using Transducers
makeinput() = Channel() do ch
for i in 1:3
put!(ch, i)
end
end
output = channel_unordered(Map(x -> x + 1), makeinput())
output = channel_unordered(x + 1 for x in makeinput()) # equivalent You can also combine filter and flatten like |
Awesome stuff @tkf! I'm perfectly happy with this living in Transducers, seems a natural fit :) |
This is a
draft(EDIT: Do'h...I forgot to actually switch to draft mode and it seems like GitHub won't let you switch once opened) RFC PR because I'm not sure whether it should be in Base or not (or even in this file); I'll bother with actually adding tests etc. if the consensus is that it is appropriate to live here. Otherwise, I'll just plop it into its own package.I'm not sure what to actually name this; we've been calling it
batch_channel
but I kind of hate that name. Jameson called it a fair-scheduled queue, so maybe there's a better name that can be derived from that.Anyway, I've found this function to be a pretty nice little self-contained abstraction for stringing together throughput-limited, multithreaded work pipelines. It really makes building batch-process pipelines convenient, for example see the following psuedocode:
Is there any desire to make this available as a built-in threading abstraction? Regardless, would love help bikeshedding the name 😁
Thanks to @vtjnash + @SimonDanisch for tweaks to the original implementation to render this amenable to thread migration (once that's a thing) and @ararslan for replacing my little destructuring hack with proper
Some
usage.