-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11243. SCM SafeModeRule Support EC. #7008
Conversation
@@ -199,6 +201,11 @@ public void onMessage(final ContainerReportFromDatanode reportFromDatanode, | |||
// list | |||
processMissingReplicas(datanodeDetails, expectedContainersInDatanode); | |||
containerManager.notifyContainerReportProcessing(true, true); | |||
if (reportFromDatanode.isRegister()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the CONTAINER_REPORT
is completed, we send the message to CONTAINER_REGISTRATION_REPORT
to ensure that the container count is accurate.
@errose28 @siddhantsangwan Can you help review this pr? Thank you very much! |
@slfan1989 Thanks for taking this up, I was earlier thinking of fixing this myself. I'll review the PR soon. |
@siddhantsangwan Can you help review this pr? Thank you very much! The unit test errors are not caused by our changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slfan1989 I've reviewed this partly. Have some comments below.
if (replicationConfig != null && replicationConfig instanceof ECReplicationConfig) { | ||
ECReplicationConfig ecReplicationConfig = (ECReplicationConfig) replicationConfig; | ||
int data = ecReplicationConfig.getData(); | ||
if (uuids != null && uuids.size() > data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Ratis, just one replica per container is required. So for EC, data number of Datanodes should be sufficient. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. For EC, the amount of data we have is already sufficient. I will improve the code.
if (ratisContainerMap.containsKey(containerID)) { | ||
ratisContainerDNsMap.computeIfAbsent(containerID, key -> Sets.newHashSet()); | ||
ratisContainerDNsMap.get(containerID).add(datanodeUUID); | ||
if (!reportedConatinerIDSet.contains(containerID)) { | ||
Set<UUID> uuids = ratisContainerDNsMap.get(containerID); | ||
if (uuids != null && uuids.size() >= 1) { | ||
ratisContainerWithMinReplicas.getAndAdd(1); | ||
reportedConatinerIDSet.add(containerID); | ||
getSafeModeMetrics() | ||
.incCurrentContainersWithOneReplicaReportedCount(); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really understand this change. It seems correct, but is there any reason this logic isn't the same as before? Why do we need to track Datanodes in a set for Ratis containers? Is it because ratisContainerDNsMap
and reportedConatinerIDSet
are going to be used somewhere else as well? Or is it done this way just so it's similar to the EC logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the question!
I didn't really understand this change. It seems correct, but is there any reason this logic isn't the same as before?
The previous logic was correct. I made this modification for two reasons:
- To align with the EC's logic and improve code readability.
- To facilitate the retrieval of additional data in future pr. For example, this will allow users not only to understand the progress but also to identify which
containers
have not reported and whichDataNodes
are included in the reported containers.
Why do we need to track Datanodes in a set for Ratis containers? Is it because ratisContainerDNsMap and reportedConatinerIDSet are going to be used somewhere else as well?
The type of ratisContainerDNsMap
is Map<Long, Set<UUID>>
, where the key is the ContainerId
. The reason for using a Set
as the value is to avoid retaining duplicate DN information, as we may encounter the same DN registering multiple times.
Or is it done this way just so it's similar to the EC logic?
Here's one reason; it has already been explained in the previous comment.
Can we modify it this way? The original code contains some insufficient information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan Can you help review this PR again? Thank you very much!
I improved some of the code, made it less repetitive, and added some comments.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/ContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
long ratisCutOff = (long) Math.ceil(ratisMaxContainer * safeModeCutoff); | ||
long ecCutOff = (long) Math.ceil(ecMaxContainer * safeModeCutoff); | ||
|
||
getSafeModeMetrics().setNumContainerWithOneReplicaReportedThreshold(ratisCutOff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set EC metrics as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve this part of the code.
private void reInitializeRule() { | ||
containerMap.clear(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like most of the code inside this method is the same as before. If possible, let's refactor this to avoid repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve this part of the code.
@@ -75,10 +79,18 @@ public void setNumContainerWithOneReplicaReportedThreshold(long val) { | |||
this.numContainerWithOneReplicaReportedThreshold.set(val); | |||
} | |||
|
|||
public void setNumContainerWithECDataReplicaReportedThreshold(long val) { | |||
this.numContainerWithECDataReplicaReportedThreshold.incr(val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use set() instead of incr().
Thank you very much for reviewing this PR! I will respond to your questions as soon as possible. |
@@ -1695,6 +1695,15 @@ | |||
</description> | |||
</property> | |||
|
|||
<property> | |||
<name>hdds.scm.safemode.reported.datanode.pct</name> | |||
<value>0.90</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 90% is too much and a significant difference from previous config, If there is a cluster say with under-utilized DN's on which which there is no data in ~~30-40% of total DN's , Safemode would still wait for these to be registered. IMO DatanodeSafemodeRule is to ensure there are datanodes available for a write to go through. We already do the check to see if enough containers are available for reading in the containerSafemodeRule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for helping review the code! From my personal perspective, I believe we should still have an optional configuration to control this. You made a valid point—0.9
might be a relatively large value, but if only one DN is registered and the rule passes, it seems a bit too lenient. We set the default value to 0.1
. Do you think that would be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 0.1 sounds good, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the Datanode safe mode rule is meant to ensure writes work. So that means we only need one Datanode to be present as Ozone still allows single replica writes.
@errose28 @siddhantsangwan @sadanand48 @adoroszlai Thank you all very much for paying attention to this PR! To facilitate a better review of this PR, I'll summarize some additional information for your reference. Background: In our production environment, we use EC (Erasure Coding) strategy, and we have written a lot of EC data. Solution Process: Our hope is that once the SCM meets the safe mode criteria, it can switch to become the leader SCM, and users' access will no longer report errors. We found that there are some issues with the two rules for safe mode.
Through our familiarity and understanding of the code, we found that
In DataNodeSafeModeRule, the default condition for exiting the rule is that only one DataNode is registered. I think it would be better if we could configure a proportional parameter to control when we can exit. Therefore, we added the parameter |
if (pipeLineDnSet.contains(dnUUID) || !registeredDnSet.contains(dnUUID)) { | ||
registeredDnSet.add(dnUUID); | ||
registeredDns = registeredDnSet.size(); | ||
unRegisteredDn.remove(dnUUID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 Regarding the issue we discussed together in HDDS-11481, I plan to add a variable here to store unRegisteredDn
and display it in the status. Do you think this approach is acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using the the persisted pipeline membership to determine which nodes have not been seen yet? That should work. It won't catch the cases where a new DN not in any pipelines has not registered yet but it at least provides more information.
I'm not sure that putting all nodes back in the unregistered list on refresh it the correct behavior though, since nodes that have already registered should remain accounted for by the rule on refresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your message! I have made adjustments to this part of the logic. During re-registration, only DNs that are not in the registeredDnSet
will be placed in the unRegisteredDn
.
@siddhantsangwan A friendly ping! Supporting EC Container recognition is crucial for SafeMode. I’d like to continue contributing to gain recognition for this change. What additional tasks can I pursue? I would appreciate any guidance or suggestions you can provide. |
@slfan1989 thanks for updating! I'll be able to review this and have a discussion with you next week, probably on Tuesday. |
* @param isEcContainer true, means ECContainer, false, means not ECContainer. | ||
*/ | ||
private void recordReportedContainer(long containerID, boolean isEcContainer) { | ||
if (!reportedContainerIDSet.contains(containerID)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get rid of reportedContainerIDSet
and just use the ratis and ec maps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! We can indeed remove reportedContainerIDSet, and I will improve the code.
return 1; | ||
} | ||
|
||
private void initContainerDNsMap(long containerID, Map<Long, Set<UUID>> containerDNsMap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming this to putInContainerDNsMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve the code.
|
||
private AtomicLong containerWithMinReplicas = new AtomicLong(0); | ||
private Set<Long> reportedContainerIDSet = new HashSet<>(); | ||
private Map<Long, ContainerInfo> ratisContainerMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really ratisContainerMap
and ecContainerMap
to be maps at all? As far as I can see, all we need are the container IDs. We can then just use container manager to get the container info object when needed. There's probably at least a couple of GBs of overhead for storing references to billions of ContainerInfo
objects in the map, which we don't really need. It can just be a set of container IDs.
Going a step further, I feel like we don't need the List of ContainerInfo
objects that's being passed into the constructor of this class. Ultimately, all we need is a mapping from container id to container info for all container IDs. So the constructor should either have that as an argument, or just the container manager, since the container manager can simply be used to get any information we need about the containers in the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion is also very reasonable. I’ve been using ContainerInfo
solely to retrieve the minimum replica count of the container. I can optimize ratisContainerMap
and ecContainerMap
so that these two variables only store the mapping between ContainerID
and its corresponding minimum replica count.
ReplicationConfig replicationConfig = container.getReplicationConfig(); | ||
|
||
if (checkContainerState(containerState) && container.getNumberOfKeys() > 0) { | ||
if (replicationConfig instanceof RatisReplicationConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more intuitive to do something like:
container.getReplicationType().equals(HddsProtos.ReplicationType.RATIS)
} | ||
} | ||
} | ||
|
||
private void reInitializeRule() { | ||
containerMap.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was also clearing this map but the new code isn't; can you check if we need to clear the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I carefully readed the code, and indeed we should also preserve the logic for cleaning up the Map. I have added the relevant logic in the initializeRule
method.
Thanks @slfan1989 for working on this.
Converted the PR to draft until then. Also removed some from requested reviewer list. I don't think having 8 people review the same patch makes sense. |
5db5824
to
1c16cb2
Compare
@siddhantsangwan Thank you very much for reviewing the code! I have made the changes according to your suggestions and added unit tests. The unit tests cover common EC types, such as EC-3-2-1024K and EC-6-3-1024K. I would appreciate it if you could find some time to review this PR again. I have submitted the code to my personal repository, and the CI(https://github.com/slfan1989/ozone/actions/runs/11771095727) shows that all checks have passed. I have now changed the status of this PR to "Ready for Review." |
@nandakumar131 can you please review as well? |
@adoroszlai Thank you very much for reviewing this PR! This improvement is very important to us. Currently, when we restart the SCM, it cannot determine whether the EC Container has finished reporting because, similar to the Ratis 3-replica Container, the SCM considers the Container ready as soon as just one replica reports successfully. This results in an issue where we are unable to promote the SCM to leader when it has just restarted and has already exited safe mode. This PR has been in use internally for several months, and I personally believe it has met expectations. Currently, we have fully transitioned our internal Ozone cluster to the EC-6-3-1024K strategy (meaning there is almost no 3-replica data in the cluster, with only a small amount, less than 10PB, as exceptions). This decision was driven by cost considerations, as we have already stored over 100PB of data. I sincerely hope we can continue to push this PR forward. If there are any suggestions for improvement, I will continue to make the necessary changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to suggest relatively simple code changes to reduce duplication in ContainerSafeModeRule
: create an instance for each replication type. See: adoroszlai@fb815f3
@adoroszlai Thank you very much for the code modifications you provided! I am currently reviewing this part of the code and optimizing the PR based on your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slfan1989 thanks for your sustained efforts. I've left some more comments below, mostly regarding maintaining unnecessary data in memory. At high scale, these memory optimisations will make a big difference!
private double maxContainer; | ||
|
||
private AtomicLong containerWithMinReplicas = new AtomicLong(0); | ||
private Map<Long, Integer> ratisContainerMinReplicaMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map is not needed as far as I can tell. It can just be a Set of Ratis container ids (long), since the min replica count for ratis containers is always 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion! I have improved the relevant code.
|
||
private AtomicLong containerWithMinReplicas = new AtomicLong(0); | ||
private Map<Long, Integer> ratisContainerMinReplicaMap; | ||
private Map<Long, Set<UUID>> ratisContainerDNsMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why maintain a mapping for ratis containers? If we use the set of container id that I mentioned above, we can simply remove a container id from the set when a datanode reports having a replica of that container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have improved the code based on the method you suggested.
private AtomicLong containerWithMinReplicas = new AtomicLong(0); | ||
private Map<Long, Integer> ratisContainerMinReplicaMap; | ||
private Map<Long, Set<UUID>> ratisContainerDNsMap; | ||
private Map<Long, Integer> ecContainerMinReplicaMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to maintain this mapping either, as far as I can tell. Whenever we have the container id and need the replication factor of that container, we can use a container manager method to get it. That'll be a constant time (O(1)) lookup for container manager on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the logic has also been improved.
@slfan1989 I'm not sure about the Datanode safe mode rule related improvements in this pull request. Logically it's a separate change from adding EC safe mode support, and so it should have a different jira and pull request. I feel it requires more thinking and I'm not sure how carefully others have reviewed it. It's best suited for a different PR. So in the interest of time, I suggest removing those changes from this PR and introducing them in a separate PR. That way, we'll be able to merge this PR sooner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commits look good to me. With this, all the EC container safe mode related changes are ready to be merged IMO. If we can remove the datanode rule related changes and have a green CI run, we'll be able to merge the PR.
@siddhantsangwan Thank you for your continued improvement suggestions! I will remove the datanode rule related changes in this PR. |
@slfan1989 test failure looks related, can you take a look? Also please let me know once it's ready for a final review. |
@siddhantsangwan I have fixed the errors in the unit tests and am waiting for the CI to pass. Once that’s done, I will ask you to help with another review. Thank you again! |
e3b150c
to
8291802
Compare
@siddhantsangwan I have rechecked the code, and this version is the final one. I have also rebased the code. Could you please review it again? Thank you very much! |
Please try to avoid force-push when updating the PR. Here are some great articles that explain why: https://developers.mattermost.com/blog/submitting-great-prs/#4-avoid-force-pushing |
@adoroszlai Thank you for providing this information! I will pay attention to this detail in future development to avoid issues caused by force-pushing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the tests.
...dds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/safemode/TestSCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
...dds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/safemode/TestSCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
...dds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/safemode/TestSCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending green CI.
Merged, thanks everyone! |
Thank you all for your support in helping us complete this PR. I greatly appreciate everyone’s valuable feedback throughout the process. @siddhantsangwan's professionalism left a strong impression on me—many thanks once again. I’d also like to extend my gratitude to @adoroszlai for their continued assistance. I’ve learned a lot throughout this process. |
What changes were proposed in this pull request?
We aim for SCM to immediately switch to leader once it exits safe mode. Currently, due to certain issues, we need to wait for at least one full container report from a DataNode before proceeding with the switch.
Currently, SCM SafeMode has the following issues:
DataNodeSafeModeRule
cannot effectively verify the registration status of DataNodes. In most cases, as long as there are more than one DataNode, this rule passes. Therefore, we need to strengthen this rule.ContainerSafeModeRule
does not support verification of EC (Erasure Coding) Containers.EC
Containers differ significantly fromRATIS/THREE
Containers because EC Containers require determining how many replicas are needed based on the EC type. For instance, forEC-6-3-1024K
, we need to ensure that the Container reports having all 6 replicas before it can provide services.This PR aims to enhance and improve the above two points.
For code Improve:
For the registration of Datanodes, we need to obtain the complete list of Datanodes from SCM. This list can be retrieved from the Pipeline. I pass PipelineManager as a parameter into DataNodeSafeModeRule to calculate the number of Datanodes.
Enhance replica validation for EC containers. Obtain the required replicas based on ECReplicationConfig. Consider container reporting complete only when sufficient replicas have been reported.
Modify the message sending location of ContainerSafeModeRule.
ozone/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMDatanodeProtocolServer.java
Lines 241 to 256 in 311245b
There are some issues in this part of the code. The handling of
NODE_REGISTRATION_CONT_REPORT
andCONTAINER_REPORT
is asynchronous. There is a scenario whereNODE_REGISTRATION_CONT_REPORT
processing completes, butCONTAINER_REPORT
processing does not. This still leads to insufficient EC replicas issue.I adjusted the sending position of NODE_REGISTRATION_CONT_REPORT (requiring the message to be sent only after CONTAINER_REPORT processing completes) and introduced a new type, CONTAINER_REGISTRATION_REPORT, to distinguish it.
Page display:
What is the link to the Apache JIRA
JIRA: HDDS-11243: SCM SafeModeRule Support EC.
How was this patch tested?
Junit Test
&Production environment validation