Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow batch() option to not process empty windows? #75

Open
ddebrunner opened this issue Mar 2, 2016 · 9 comments
Open

Allow batch() option to not process empty windows? #75

ddebrunner opened this issue Mar 2, 2016 · 9 comments
Labels

Comments

@ddebrunner
Copy link
Contributor

The current batch implementation will process empty windows when the window is the last N seconds.

I wonder if this is the desired or expected behaviour. I see batching as breaking a stream into batches of tuples, thus processing an empty one doesn't seem to make much sense.

Batch could have an option to allow processing of empty windows, defaulting to false, or just not support it.

Throughts?

@dlaboss
Copy link
Member

dlaboss commented Mar 2, 2016

Doesn't feel like it's worth the addition... though maybe that's just because I haven't had to live with it as it is :-)

I imagine some uses would need to know that nothing was captured during the last N sec window and react as appropriate for the app (e.g., log, publish, generate a special tuple).

Seems clear/natural to expect the batcher is called every N sec with whatever tuples arrived during that time, hence the list may be empty.

Seems simple enough for uses that don't care about that case, if it doesn't fallout naturally from their processing, to just make the check. e.g.,

TStream s = s.last(10, SECONDS).batch(
                                      tuples -> { if (tuples.isEmpty()) return null;
                                      ... process the batch
                                      };

In a common case where "process the batch" involves a math3 analytic (e.g., SUM), maybe we define those analytics to return null if the input list is empty thereby eliminating having to write an explicit check?

@ddebrunner
Copy link
Contributor Author

I somewhat agree with @dlaboss , but while it's simple to have the isEmpty() it doesn't help reduce cpu cost of potentially many schedules of a task that ends up doing nothing.

Maybe it's a YAGNI, and maybe it's more than just a boolean, but potentially a mode, e.g.:

  • Always process empty partition
  • Only process the first in a sequence of empty partition.
  • Never process empty partition
  • ...

Maybe also it's a property of the window, so it applies to any window operation. If it's empty then do nothing for any window operation. This might also be easier to add in the future without breaking compatibility or having to have multiple overloads of batch().

@vdogaru
Copy link
Contributor

vdogaru commented Mar 2, 2016

I agree that implementing this as an empty partition policy is easy to add without breaking compatibility, but I think of handling empty windows as an attribute of the aggregate function rather than of the window.

Assume I want to support multiple aggregate functions triggered by the same window. To do that without extending Quarks, I have to create identical windows for each aggregate function. If this becomes unacceptable (e.g. high memory consumption if windows are large), and the alternative is to update the current window implementation to support a list of aggregates per window, then a per-window policy would then apply to all the aggregates.

@ddebrunner
Copy link
Contributor Author

@vdogaru
?> To do that without extending Quarks, I have to create identical windows for each aggregate function.

Are you saying that is the case now, or if the empty partition policy was per window?

@vdogaru
Copy link
Contributor

vdogaru commented Mar 2, 2016

I am saying that adding an empty partition policy per window would force all the aggregates triggered by that window to share the window's policy.

@ddebrunner
Copy link
Contributor Author

Right, but that doesn't force each aggregation to have its own window. Only ones with different policies.

@wmarshall484
Copy link
Contributor

that doesn't force each aggregation to have its own window

Currently, each time TWindow.aggregate is called, it creates a new Aggregate<T,U,K> oplet which contains its own Window. So in that sense, each aggregation does have its own window.

@ddebrunner
Copy link
Contributor Author

@wmarshall484 A mere implementation detail, from the api they use a single window.

The implementation could be changed to have a single oplet for a window.

@ddebrunner
Copy link
Contributor Author

Another situation that may not matter for Quarks being on the edge, but when there are thousands of partitions processing empty partitions can add up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants