-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle poisoned work items by marking them as failed #1175
base: main
Are you sure you want to change the base?
Handle poisoned work items by marking them as failed #1175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @ItalyPaleAle, I like these changes as a first step in helping us deal with poison messages more generally. Some of the cases we normally deal with are on the work item fetch side (failing to deserialize received work items, for example) but my hope is that we can at least borrow some of the ideas here in separate PRs to also address those cases.
I added some comments about things I think we should fix or discuss before merging. They should be pretty minor.
FYI @davidmrdavid since this is a topic you've also spent a lot of time thinking about. Note that the motive in this case is for a different class of poison messages than what you were thinking about, but there might be some common pieces we can consider reusing.
FYI @sebastianburckhardt as well.
EventIds.OrchestrationPoisoned, | ||
InstanceId, | ||
ExecutionId, | ||
Details, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check for null here? Passing null
to WriteEvent
for any parameter will cause the entire log event to be lost. If I'm reading the code correctly, this comes from the WorkItemPoisonedException.Message
property, which doesn't have any guards against null values.
ExecutionId, | ||
Name, | ||
TaskEventId, | ||
Details, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about null checks.
/// <returns>Cloned object</returns> | ||
public OrchestrationRuntimeState Clone() | ||
{ | ||
return new OrchestrationRuntimeState(this.Events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is technically a deep clone operation. The state of the two objects may potentially be different. For example, if the object being cloned as a set of history events in both the NewEvents
and PastEvents
lists, then the cloned object will instead have all the history event objects only in the PastEvents
list.
I suggest either we carefully document that this is not an exact clone or give this method a different name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you reckon the issue is with the description/comment only, or should the behavior be changed too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion. Here's what I think the code would need to be to do a proper deep clone*.
var cloned = new OrchestrationRuntimeState(this.PastEvents)
{
CompressedSize = this.CompressedSize,
Size = this.Size,
Status = this.Status,
};
foreach (HistoryEvent e in this.NewEvents)
{
cloned.AddEvent(e);
}
return cloned;
*this is not technically a "full deep clone" since we're not cloning the history events and I expect some history events may be mutable. Might be worth mentioning that in the comments.
-1, | ||
// Guaranteed to be not null because of the "when" clause in the catch block | ||
scheduledEvent!.EventId, | ||
poisonedException.Message, string.Empty), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to also populate the failureDetails
parameter of the TaskFailedEvent
constructor since there will be some consumers that use that instead of the (legacy) reason
+ details
values.
orchestratorMessages: Array.Empty<TaskMessage>(), | ||
timerMessages: Array.Empty<TaskMessage>(), | ||
continuedAsNewMessage: null, | ||
instanceState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update instanceState
to also reflect the fact that this instance is now in a failed state?
When orchestrations and activity tasks fail with un-retriable errors (by any worker), they should not be abandoned; instead, they should be marked as completed and with state failed.
This is implemented by catching exceptions of type
WorkItemPoisonedException
which can be raised downstream, for example by a persistence store provider. When that happens, instead of abandoning the work item, the dispatcher marks it as completed.