Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11780. Increase client write retry when SCM is in safe mode #7470

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ArafatKhan2198
Copy link
Contributor

What changes were proposed in this pull request?

Root Cause:

  • DataNode Registration Delays: Each DataNode requires approximately 30 seconds to register with the leader SCM due to the heartbeat interval.
  • SCM Restart and Leadership Retention: In scenarios where the restarted SCM retains leadership after an election, the SCM loses its in-memory state. DataNodes and pipelines must re-register, leading to delays in exiting safe mode.
  • Dependency on HealthyPipelineSafeModeRule: This rule requires DataNodes to report pipeline health, which can be delayed due to slow DataNode registration, network latency, or the time needed for pipelines to stabilize.
  • These factors combined caused the SCM to take slightly over a minute to exit safe mode, impacting write operations during this transition.

Current Mechanism:

  • The handleSubmitRequestAndSCMSafeModeRetry method manages write requests (e.g., block allocation or key creation) during SCM safe mode by:
  • Catching the "SCM in safe mode" exception.
  • Retrying the operation after a defined wait interval.
  • Allowing limited retries to wait for the SCM to exit safe mode.

Proposed Change:

  • Config Update: Increase BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS to 6000 ms and BLOCK_ALLOCATION_RETRY_COUNT to 15.
  • This extends the total wait time for retries from ~25–30 seconds to ~75 seconds.
  • Impact: This ensures that write operations are not prematurely failed during scenarios where SCM takes longer to exit safe mode, improving client resilience.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11715

How was this patch tested?

@ArafatKhan2198 ArafatKhan2198 changed the title HDDS-11780 Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes. Nov 22, 2024
@adoroszlai adoroszlai changed the title HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes. HDDS-11780. Increase client write retry when SCM is in safe mode Nov 22, 2024
public static final int BLOCK_ALLOCATION_RETRY_COUNT = 5;
public static final int BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS = 3000;
public static final int BLOCK_ALLOCATION_RETRY_COUNT = 15;
public static final int BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS = 5000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need have wait for 1 sec and retry for 90 times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants