Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHA-RL4b5][Room monitoring] Contributors detach on channel FAILED state #405

Open
sacOO7 opened this issue Nov 15, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@sacOO7
Copy link

sacOO7 commented Nov 15, 2024

┆Issue is synchronized with this Jira Task by Unito

@sacOO7 sacOO7 changed the title [CHA-RL4b5][Room monitoring] contributors detach on channel FAILED state not atomic [CHA-RL4b5][Room monitoring] Contributors detach on channel FAILED state Nov 15, 2024
@sacOO7
Copy link
Author

sacOO7 commented Nov 15, 2024

Also, not sure about the comment

// We'll make a best effort at detaching all the other channels

Feels like a soft detach, where RoomStatus will be inconsistent with underlying channel states.
Because in this case, we don't care if detach fails for one or more remaining contributors.
I think operation should be retried till all channels go into Detached or Failed state 🤔
WDYT

@sacOO7
Copy link
Author

sacOO7 commented Nov 15, 2024

Documentation does mention that user intervention required for failed state ->

/**
* The room is currently detached and will not attempt to re-attach. User intervention is required.
*/
Failed = 'failed',

But Room Failed is inconsistent with underlying channel states. Impl. for the same is reflected at multiple places in room-lifecycle. Above is one of the cases.

Currently, we mark RoomStatus as Failed when one of the channels goes into the Failed state. Ideally, this shouldn't be the case. RoomStatus should represent deterministic channel states.

@sacOO7
Copy link
Author

sacOO7 commented Nov 15, 2024

  • Attached = 'attached' -> all channels in ATTACHED state

  • Detached = 'detached' -> all channels in DETACHED or FAILED state

  • Suspended = 'suspended' -> This state represents ongoing retry operation till one of the channel goes into FAILED state.

  • Failed = 'failed' -> This currently represents when one of the channels enters the Failed state. Ideally, this shouldn't happen. Imagine one channel is in a Failed state and another in a Suspended state. If users don't explicitly ATTACH, Suspended channels will automatically reattach after reconnection, while channels in the Failed state will not reattach. It shouldn't be the case that a Failed room is receiving messages on some channels but not others. RoomStatus should be set to Failed only when all channels go into the Failed or Detached state. This ensures no messages are received in the Failed state. Although not an urgent use-case, it will show deterministic behavior.

  • Currently, when the Room state is Failed, we don't always try full runDownChannels ( make sure channels go into either Detached or Failed state ):

  1. Partial runDownChannels (current case, room monitoring) -> Link
  2. Partial runDownChannels ( windown for suspended channels)-> Link
  3. Full runDownChannels ( first failed attach )-> Link
  4. Full runDownChannels ( attach inside retry )-> Link

Currently, we exhibit inconsistent behavior for the Room Failed state.

  • Released = 'released' -> Represents channels in either detached or failed state. Also, prohibits from performing Attach and Detach operations.

@sacOO7
Copy link
Author

sacOO7 commented Nov 17, 2024

Btw, I just went through the spec, and found out cases when channel goes in Failed state.

  1. When Connection Enters Failed State (RTL3a) - In this case, all channels enter Failed state, along with given channel. Partial runDownChannels impl. assumes this is the case, so there' no need to make sure all channels are into FAILED/ DETACHED state.

  2. Lack of permissions during channel ATTACH (RTL4e) - During channel ATTACH, if user doesn't have enough permissions related to the channel, then channel can go into FAILED state. Impl. for the same in ably-java and ably-go.
    Full runDownChannels impl. assumes this is the case, all remaining channels are forced into DETACHED/FAILED state.

  3. Automatic Token renewal with lack of permissions for one or more channels - Please note that, ATTACHED channel can also go into Failed state if renewed token doesn't have permission to support the same. In such a case, Partial runDownChannels can lead to inconsistent states for given channels.

  • So, it's better to support Full runDownChannels( atomic internal op - CHA-RL7a1) for all cases.

@AndyTWF
Copy link
Collaborator

AndyTWF commented Nov 18, 2024

Currently, we perform the normal _doChannelWindDown instead of executing it inside _mtx.runExclusive.

An oversight, which we can fix.

I think operation should be retried till all channels go into Detached or Failed state 🤔

Seems reasonable - and shouldn't be too hard.

RoomStatus should be set to Failed only when all channels go into the Failed or Detached state. This ensures no messages are received in the Failed state. Although not an urgent use-case, it will show deterministic behaviour.

I think that we could argue either way here - what's more important, to tell the user straight away that something has failed, or perhaps have them wait X seconds for the detach procedure to complete (assuming some bad network conditions)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants