Handle poisoned work items by marking them as failed #1175

ItalyPaleAle · 2024-11-11T17:32:41Z

When orchestrations and activity tasks fail with un-retriable errors (by any worker), they should not be abandoned; instead, they should be marked as completed and with state failed.

This is implemented by catching exceptions of type WorkItemPoisonedException which can be raised downstream, for example by a persistence store provider. When that happens, instead of abandoning the work item, the dispatcher marks it as completed.

cgillum

Thanks for this PR @ItalyPaleAle, I like these changes as a first step in helping us deal with poison messages more generally. Some of the cases we normally deal with are on the work item fetch side (failing to deserialize received work items, for example) but my hope is that we can at least borrow some of the ideas here in separate PRs to also address those cases.

I added some comments about things I think we should fix or discuss before merging. They should be pretty minor.

FYI @davidmrdavid since this is a topic you've also spent a lot of time thinking about. Note that the motive in this case is for a different class of poison messages than what you were thinking about, but there might be some common pieces we can consider reusing.

FYI @sebastianburckhardt as well.

cgillum · 2024-11-11T23:09:14Z

src/DurableTask.Core/Logging/StructuredEventSource.cs

+                    EventIds.OrchestrationPoisoned,
+                    InstanceId,
+                    ExecutionId,
+                    Details,


Do we need to check for null here? Passing null to WriteEvent for any parameter will cause the entire log event to be lost. If I'm reading the code correctly, this comes from the WorkItemPoisonedException.Message property, which doesn't have any guards against null values.

cgillum · 2024-11-11T23:09:27Z

src/DurableTask.Core/Logging/StructuredEventSource.cs

+                    ExecutionId,
+                    Name,
+                    TaskEventId,
+                    Details,


Same question about null checks.

cgillum · 2024-11-11T23:10:42Z

src/DurableTask.Core/OrchestrationRuntimeState.cs

+        /// <returns>Cloned object</returns>
+        public OrchestrationRuntimeState Clone()
+        {
+            return new OrchestrationRuntimeState(this.Events)


I don't think this is technically a deep clone operation. The state of the two objects may potentially be different. For example, if the object being cloned as a set of history events in both the NewEvents and PastEvents lists, then the cloned object will instead have all the history event objects only in the PastEvents list.

I suggest either we carefully document that this is not an exact clone or give this method a different name.

Do you reckon the issue is with the description/comment only, or should the behavior be changed too?

I don't have a strong opinion. Here's what I think the code would need to be to do a proper deep clone*.

var cloned = new OrchestrationRuntimeState(this.PastEvents) { CompressedSize = this.CompressedSize, Size = this.Size, Status = this.Status, }; foreach (HistoryEvent e in this.NewEvents) { cloned.AddEvent(e); } return cloned;

*this is not technically a "full deep clone" since we're not cloning the history events and I expect some history events may be mutable. Might be worth mentioning that in the comments.

cgillum · 2024-11-11T23:13:48Z

src/DurableTask.Core/TaskActivityDispatcher.cs

+                            -1,
+                            // Guaranteed to be not null because of the "when" clause in the catch block
+                            scheduledEvent!.EventId,
+                            poisonedException.Message, string.Empty),


We need to also populate the failureDetails parameter of the TaskFailedEvent constructor since there will be some consumers that use that instead of the (legacy) reason + details values.

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

cgillum · 2024-11-11T23:28:24Z

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

+                    orchestratorMessages: Array.Empty<TaskMessage>(),
+                    timerMessages: Array.Empty<TaskMessage>(),
+                    continuedAsNewMessage: null,
+                    instanceState);


Do we need to update instanceState to also reflect the fact that this instance is now in a failed state?

ItalyPaleAle added 2 commits November 8, 2024 12:59

Handle poisoned activity task

6222ab3

Handle poisoned orchestration work items

c77207c

cgillum requested changes Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle poisoned work items by marking them as failed #1175

Handle poisoned work items by marking them as failed #1175

ItalyPaleAle commented Nov 11, 2024

cgillum left a comment

cgillum Nov 11, 2024

cgillum Nov 11, 2024

cgillum Nov 11, 2024

ItalyPaleAle Nov 11, 2024

cgillum Nov 12, 2024 •

edited

Loading

cgillum Nov 11, 2024

cgillum Nov 11, 2024

Handle poisoned work items by marking them as failed #1175

Are you sure you want to change the base?

Handle poisoned work items by marking them as failed #1175

Conversation

ItalyPaleAle commented Nov 11, 2024

cgillum left a comment

Choose a reason for hiding this comment

cgillum Nov 11, 2024

Choose a reason for hiding this comment

cgillum Nov 11, 2024

Choose a reason for hiding this comment

cgillum Nov 11, 2024

Choose a reason for hiding this comment

ItalyPaleAle Nov 11, 2024

Choose a reason for hiding this comment

cgillum Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

cgillum Nov 11, 2024

Choose a reason for hiding this comment

cgillum Nov 11, 2024

Choose a reason for hiding this comment

cgillum Nov 12, 2024 •

edited

Loading