Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle poisoned work items by marking them as failed #1175

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ItalyPaleAle
Copy link
Member

When orchestrations and activity tasks fail with un-retriable errors (by any worker), they should not be abandoned; instead, they should be marked as completed and with state failed.

This is implemented by catching exceptions of type WorkItemPoisonedException which can be raised downstream, for example by a persistence store provider. When that happens, instead of abandoning the work item, the dispatcher marks it as completed.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @ItalyPaleAle, I like these changes as a first step in helping us deal with poison messages more generally. Some of the cases we normally deal with are on the work item fetch side (failing to deserialize received work items, for example) but my hope is that we can at least borrow some of the ideas here in separate PRs to also address those cases.

I added some comments about things I think we should fix or discuss before merging. They should be pretty minor.

FYI @davidmrdavid since this is a topic you've also spent a lot of time thinking about. Note that the motive in this case is for a different class of poison messages than what you were thinking about, but there might be some common pieces we can consider reusing.

FYI @sebastianburckhardt as well.

EventIds.OrchestrationPoisoned,
InstanceId,
ExecutionId,
Details,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for null here? Passing null to WriteEvent for any parameter will cause the entire log event to be lost. If I'm reading the code correctly, this comes from the WorkItemPoisonedException.Message property, which doesn't have any guards against null values.

ExecutionId,
Name,
TaskEventId,
Details,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about null checks.

/// <returns>Cloned object</returns>
public OrchestrationRuntimeState Clone()
{
return new OrchestrationRuntimeState(this.Events)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is technically a deep clone operation. The state of the two objects may potentially be different. For example, if the object being cloned as a set of history events in both the NewEvents and PastEvents lists, then the cloned object will instead have all the history event objects only in the PastEvents list.

I suggest either we carefully document that this is not an exact clone or give this method a different name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you reckon the issue is with the description/comment only, or should the behavior be changed too?

Copy link
Member

@cgillum cgillum Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion. Here's what I think the code would need to be to do a proper deep clone*.

var cloned = new OrchestrationRuntimeState(this.PastEvents)
{
    CompressedSize = this.CompressedSize,
    Size = this.Size,
    Status = this.Status,
};

foreach (HistoryEvent e in this.NewEvents)
{
    cloned.AddEvent(e);
}

return cloned;

*this is not technically a "full deep clone" since we're not cloning the history events and I expect some history events may be mutable. Might be worth mentioning that in the comments.

-1,
// Guaranteed to be not null because of the "when" clause in the catch block
scheduledEvent!.EventId,
poisonedException.Message, string.Empty),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to also populate the failureDetails parameter of the TaskFailedEvent constructor since there will be some consumers that use that instead of the (legacy) reason + details values.

orchestratorMessages: Array.Empty<TaskMessage>(),
timerMessages: Array.Empty<TaskMessage>(),
continuedAsNewMessage: null,
instanceState);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update instanceState to also reflect the fact that this instance is now in a failed state?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants