-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable out-of-service taint in FAR #92
Conversation
Hi @k-keiichi-rh. Thanks for your PR. I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting the PR, and for your first contribution to FAR. This one would be a very nice enhancement to FAR.
Left small nits and one unit test has failed (it seems like there was a too-short timeout).
=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).
Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot
which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.
One more thing to raise/add is whether we want to do a validation test of the Kubernetes version (similar to what SNR does) since the out-of-service taint is fairly new in the community and not supported in old versions.
We can use the same approach of SNR we discussed in here. There are the following cases after the
I will add this topic to my todo list. Thank you for sharing it. |
SGTM |
c2e03f9
to
9e34b6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing my last comments!
I have added some more comments :) Mostly minor nits on phrasing, missing log, consts, and simulating the deletion of pod and VA.
Please add a new commit after each review so it will be easier to review the changes between the commits and the last review.
I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different: SNR:
FAR:
|
I may not understand your point correctly. So please let me confirm it just in case.
In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node. So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?
If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again. |
I thought we already do this, but just double checked the code, and we don't. /cc @mshitrit fyi
|
@slintes: GitHub didn't allow me to request PR reviews from the following users: fyi. Note that only medik8s members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I agree with you. I will fix it to the out-of-service taint remediation and check if there is no side effect by the change.
As far as I checked the effect of the out-of-service taint, putting the taint is not an issue and has no side effect. In the "After rebooting" phase of SNR, the failed node has both the normal NoExecute taint and the NoSchedule taint and we expect that there are no stateful workloads on the node. So the out-of-service taint won't do anything. |
Ok, then my concerns were wrong, and it that makes the SNR topic much less urgent. Sorry for the noise and thanks for the discussion! |
for the SNR related discussion let's continue here: medik8s/self-node-remediation#159 |
9e34b6a
to
fd8b075
Compare
@razo7 Thank you for taking your time and your review. By the way, are my replies to your comments visible? |
No, I can't see your replies since you haven't submitted your review. Please see here on how to submit them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I have not submitted my review.
I reflected all your comments and there is no questions for the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly straightforward PR :) Needs rebase though.
Some comments inline.
pkg/utils/resources.go
Outdated
@@ -72,3 +72,31 @@ func DeleteResources(ctx context.Context, r client.Client, nodeName string) erro | |||
|
|||
return nil | |||
} | |||
|
|||
func IsResourceDeletionCompleted(r client.Client, nodeName string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please pass a Context
to this function and use it in the API calls, similar to the function above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do this.
pkg/utils/resources.go
Outdated
pods := &corev1.PodList{} | ||
if err := r.List(context.Background(), pods); err != nil { | ||
log.Error(err, "failed to get pod list") | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to return an error here, for being able to differentiate between "something went wrong" and "pods not deleted yet" where this function is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I will change it.
// remove out-of-service taint when using OutOfServiceTaint remediation | ||
if far.Spec.RemediationStrategy == v1alpha1.OutOfServiceTaintRemediationStrategy { | ||
r.Log.Info("Removing OutOfService taint", "Fence Agent", far.Spec.Agent, "Node Name", req.Name) | ||
if !utils.IsResourceDeletionCompleted(r.Client, req.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this check? Are we sure that always all Pods get the DeletionTimestamp? What about Pods which tolerate the taint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this check?
I think the ResourceDeletionRemediationStrategy tries to forcefully delete all of pods explicitly. We have the way to check the terminating pods are deleted by checking the result of the deletion. So I am 100% sure we don't need this check.
However, in OutOfServiceTaintRemediationStrategy, I have 1% doubt if the terminating pods are deleted.
If NHC identifies the node becomes healthy, the control-plane or kubelet deletes the terminating pods. So we can expect there is no terminating pod in this stage and may not need this check.
However we can't control the behavior of control-plane or kubelet. And we can just expect the terminating pods are deleted indirectly by them compared to the ResourceDeletionRemediationStrategy.
This remaining 1% was the reason why I thought we need this check.
But I may be thinking about it too much. So I will drop this change.
Are we sure that always all Pods get the DeletionTimestamp?
What about Pods which tolerate the taint?
The current out-of-service taint focuses on only the terminating pods which has the DeletionTimestamp to enable workloads to failover to another node. If we can confirm if there is no terminating pod, it means we can move all workloads on the failed node to another node. If we can not, we need to improve the out-of-service taint code in k8s.
908f415
to
92ac6a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will review the e2e test tomorrow, 2 comments inline
24fe752
to
9fe49ca
Compare
@slintes Thank you for the comments again. I reflected your comments. |
@k-keiichi-rh fyi, we have a CI outage at the moment, e2e tests are expected to fail until further notice 🙁 |
/test all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is working again.
I left one remark in the reconcile code, and there is an issue in the e2e test.
Beside that, lgtm :)
there is duplicated code in the e2e test, but we clean it up in a follow up in order to get this in for the next release... |
9fe49ca
to
45d1fb9
Compare
/retest |
/lgtm |
/test 4.15-openshift-e2e |
/test 4.14-openshift-e2e |
/test 4.15-openshift-e2e |
@frajamomo: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/unhold |
This PR is adding a new remediation strategy based on kubernetes/enhancements#1116
The following is the new remediation strategy for the out-of-service taint:
=> Ensure that any workloads are not executed after rebooting the failed node
=> After rebooting, there are no stateless workloads which were not evicted by the taint in the failed node
=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).
[ToDo]
ECOPROJECT-1326