Enable out-of-service taint in FAR #92

k-keiichi-rh · 2023-10-12T02:14:56Z

This PR is adding a new remediation strategy based on kubernetes/enhancements#1116

The following is the new remediation strategy for the out-of-service taint:

One of the nodes failed
FAR adds NoExecute taint to the failed node
=> Ensure that any workloads are not executed after rebooting the failed node
FAR reboots the failed node via the Fence Agent
=> After rebooting, there are no stateless workloads which were not evicted by the taint in the failed node
FAR sets the out-of-service taint
=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).
After the failed node becomes healthy, the NoExecute taint in Step 2 and the out-of-service taint in Step 5 are removed and the node becomes schedulable again.

[ToDo]

Add unit tests
Add comments to clarify the changes of this patch
Add e2e test cases
Add a validation test of the Kubernetes version (Validate k8s version for out-of-service-taint remediation strategy self-node-remediation#104)

ECOPROJECT-1326

openshift-ci · 2023-10-12T02:15:08Z

Hi @k-keiichi-rh. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

razo7

Thank you for submitting the PR, and for your first contribution to FAR. This one would be a very nice enhancement to FAR.
Left small nits and one unit test has failed (it seems like there was a too-short timeout).

=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).

Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.

One more thing to raise/add is whether we want to do a validation test of the Kubernetes version (similar to what SNR does) since the out-of-service taint is fairly new in the community and not supported in old versions.

api/v1alpha1/fenceagentsremediation_types.go

controllers/fenceagentsremediation_controller.go

pkg/utils/taints.go

controllers/fenceagentsremediation_controller.go

k-keiichi-rh · 2023-10-17T21:00:26Z

=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).

Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.

We can use the same approach of SNR we discussed in here.

There are the following cases after the reboot action in FAR:

The failed node is rebooted and it becomes healthy again
=> The out-of-service taint doesn't take effect. The taint is ignored. The node can report its status to the control plane and the control plane can delete the stateful workloads instead of the out-of-service taint.
The failed node is rebooted, but it keeps unhealthy(the node can not report its status to the control plane)
=> The out-of-service taint takes effect.
=> The taint will trigger deleting the stateful workloads.
The failed node is not rebooted due to either failing to do power off or failing to do then power on
=> The out-of-service taint doesn't take effect because it's not added to the failed node.
=> FAR is checking the result of executing the fence agent. If it's failed, the exponential backoff is triggered

One more thing to raise/add is whether we want to do a validation test of the Kubernetes version (similar to what SNR does) since the out-of-service taint is fairly new in the community and not supported in old versions.

I will add this topic to my todo list. Thank you for sharing it.

razo7 · 2023-10-25T13:59:55Z

We can use the same approach of SNR we discussed in medik8s/self-node-remediation#17 (comment).

SGTM

razo7

Thank you for addressing my last comments!

I have added some more comments :) Mostly minor nits on phrasing, missing log, consts, and simulating the deletion of pod and VA.
Please add a new commit after each review so it will be easier to review the changes between the commits and the last review.

pkg/utils/resources.go

controllers/fenceagentsremediation_controller_test.go

controllers/fenceagentsremediation_controller.go

slintes · 2023-10-28T10:39:55Z

=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).

Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.

We can use the same approach of SNR we discussed in here.

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

SNR:

remediation starts, node reboot is triggered
some time expires
node reboot completed
some more time expires until safeTimeToAssumeNodeRebooted is reached
taint is added only when node is still unhealthy

FAR:

remediation starts, node reboot is triggered
the taint is added immediately
some time expires
node reboot completed
node might be healthy now but has the taint already. I understand this should be avoided.

k-keiichi-rh · 2023-11-02T02:34:12Z

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

I may not understand your point correctly. So please let me confirm it just in case.

SNR:

remediation starts, node reboot is triggered

some time expires

node reboot completed

some more time expires until safeTimeToAssumeNodeRebooted is reached

taint is added only when node is still unhealthy

In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node.

So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?

FAR:

remediation starts, node reboot is triggered

the taint is added immediately

some time expires

node reboot completed

node might be healthy now but has the taint already. I understand this should be avoided.

If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again.
So we can avoid the situation that the healthy node will keep having the out-of-service taint after rebooting.

slintes · 2023-11-02T10:58:09Z

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

I may not understand your point correctly. So please let me confirm it just in case.

SNR:

remediation starts, node reboot is triggered

some time expires

node reboot completed

some more time expires until safeTimeToAssumeNodeRebooted is reached

taint is added only when node is still unhealthy

In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node.

So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?

I thought we already do this, but just double checked the code, and we don't.
Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".
Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

/cc @mshitrit fyi

FAR:

remediation starts, node reboot is triggered

the taint is added immediately

some time expires

node reboot completed

node might be healthy now but has the taint already. I understand this should be avoided.

If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again. So we can avoid the situation that the healthy node will keep having the out-of-service taint after rebooting.

openshift-ci · 2023-11-02T10:58:13Z

@slintes: GitHub didn't allow me to request PR reviews from the following users: fyi.

Note that only medik8s members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

I may not understand your point correctly. So please let me confirm it just in case.

SNR:

remediation starts, node reboot is triggered

some time expires

node reboot completed

some more time expires until safeTimeToAssumeNodeRebooted is reached

taint is added only when node is still unhealthy

In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node.

So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?

I thought we already do this, but just double checked the code, and we don't.
Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".
Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

/cc @mshitrit fyi

FAR:

remediation starts, node reboot is triggered

the taint is added immediately

some time expires

node reboot completed

node might be healthy now but has the taint already. I understand this should be avoided.

If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again. So we can avoid the situation that the healthy node will keep having the out-of-service taint after rebooting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k-keiichi-rh · 2023-11-03T01:49:10Z

Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".

I agree with you. I will fix it to the out-of-service taint remediation and check if there is no side effect by the change.
As for stopping any further fencing action on the healthy node, I think the basic idea here is that the control-plane should handle the fencing action if the failed node can communicate with the control-plane. So we don't need to do nothing in SNR.
If yes, does the same apply to the Resource Deletion Remediation as well as the OutOfServiceTaint Remediation?

Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

As far as I checked the effect of the out-of-service taint, putting the taint is not an issue and has no side effect.

In the "After rebooting" phase of SNR, the failed node has both the normal NoExecute taint and the NoSchedule taint and we expect that there are no stateful workloads on the node. So the out-of-service taint won't do anything.

slintes · 2023-11-03T09:32:04Z

Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".

I agree with you. I will fix it to the out-of-service taint remediation and check if there is no side effect by the change. As for stopping any further fencing action on the healthy node, I think the basic idea here is that the control-plane should handle the fencing action if the failed node can communicate with the control-plane. So we don't need to do nothing in SNR. If yes, does the same apply to the Resource Deletion Remediation as well as the OutOfServiceTaint Remediation?

not sure if I understand, what do you mean with "the control-plane should handle the fencing action"?
I think we need to do some "cleanup" in SNR, e.g removing taints which were already set in the pre-reboot phase. Maybe we can just switch to the fencing completed phase directly, it should do everything we need for cleanup?
yes, I think the the same applies for the ResourceDeletion strategy
before changing anything, we should wait for comment of @mshitrit
I will create an issue for SNR to have the discussion at the right place 🙂

Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

As far as I checked the effect of the out-of-service taint, putting the taint is not an issue and has no side effect.

In the "After rebooting" phase of SNR, the failed node has both the normal NoExecute taint and the NoSchedule taint and we expect that there are no stateful workloads on the node. So the out-of-service taint won't do anything.

Ok, then my concerns were wrong, and it that makes the SNR topic much less urgent. Sorry for the noise and thanks for the discussion!

slintes · 2023-11-03T09:44:57Z

for the SNR related discussion let's continue here: medik8s/self-node-remediation#159

k-keiichi-rh · 2023-11-04T02:32:53Z

I have added some more comments :) Mostly minor nits on phrasing, missing log, consts, and simulating the deletion of pod and VA. Please add a new commit after each review so it will be easier to review the changes between the commits and the last review.

@razo7 Thank you for taking your time and your review.
I reflected your comments. Please check it.

By the way, are my replies to your comments visible?
My replies are noted with the "Pending" tag.

razo7 · 2023-11-05T06:07:15Z

By the way, are my replies to your comments visible?
My replies are noted with the "Pending" tag.

No, I can't see your replies since you haven't submitted your review. Please see here on how to submit them.

k-keiichi-rh

Sorry that I have not submitted my review.
I reflected all your comments and there is no questions for the comments.

controllers/fenceagentsremediation_controller_test.go

pkg/utils/taints.go

controllers/fenceagentsremediation_controller.go

controllers/fenceagentsremediation_controller_test.go

controllers/fenceagentsremediation_controller.go

slintes

Mostly straightforward PR :) Needs rebase though.
Some comments inline.

pkg/utils/taints.go

slintes · 2023-11-08T19:42:13Z

pkg/utils/resources.go

@@ -72,3 +72,31 @@ func DeleteResources(ctx context.Context, r client.Client, nodeName string) erro

 	return nil
 }
+
+func IsResourceDeletionCompleted(r client.Client, nodeName string) bool {


please pass a Context to this function and use it in the API calls, similar to the function above

I will do this.

slintes · 2023-11-08T19:45:51Z

pkg/utils/resources.go

+	pods := &corev1.PodList{}
+	if err := r.List(context.Background(), pods); err != nil {
+		log.Error(err, "failed to get pod list")
+		return false


would it make sense to return an error here, for being able to differentiate between "something went wrong" and "pods not deleted yet" where this function is called?

I agree. I will change it.

pkg/utils/resources.go

slintes · 2023-11-08T19:52:34Z

controllers/fenceagentsremediation_controller.go

+		// remove out-of-service taint when using OutOfServiceTaint remediation
+		if far.Spec.RemediationStrategy == v1alpha1.OutOfServiceTaintRemediationStrategy {
+			r.Log.Info("Removing OutOfService taint", "Fence Agent", far.Spec.Agent, "Node Name", req.Name)
+			if !utils.IsResourceDeletionCompleted(r.Client, req.Name) {


Do we really need this check? Are we sure that always all Pods get the DeletionTimestamp? What about Pods which tolerate the taint?

Do we really need this check?

I think the ResourceDeletionRemediationStrategy tries to forcefully delete all of pods explicitly. We have the way to check the terminating pods are deleted by checking the result of the deletion. So I am 100% sure we don't need this check.

However, in OutOfServiceTaintRemediationStrategy, I have 1% doubt if the terminating pods are deleted.
If NHC identifies the node becomes healthy, the control-plane or kubelet deletes the terminating pods. So we can expect there is no terminating pod in this stage and may not need this check.
However we can't control the behavior of control-plane or kubelet. And we can just expect the terminating pods are deleted indirectly by them compared to the ResourceDeletionRemediationStrategy.

This remaining 1% was the reason why I thought we need this check.
But I may be thinking about it too much. So I will drop this change.

Are we sure that always all Pods get the DeletionTimestamp?
What about Pods which tolerate the taint?

The current out-of-service taint focuses on only the terminating pods which has the DeletionTimestamp to enable workloads to failover to another node. If we can confirm if there is no terminating pod, it means we can move all workloads on the failed node to another node. If we can not, we need to improve the out-of-service taint code in k8s.

controllers/fenceagentsremediation_controller_test.go

slintes

will review the e2e test tomorrow, 2 comments inline

controllers/fenceagentsremediation_controller.go

pkg/utils/events.go

k-keiichi-rh · 2024-04-09T14:58:37Z

@slintes Thank you for the comments again. I reflected your comments.
The following is all of changes from my last commit: https://github.com/k-keiichi-rh/fence-agents-remediation/commits/ecoproject-1326-with-review/

slintes · 2024-04-09T16:40:05Z

@k-keiichi-rh fyi, we have a CI outage at the moment, e2e tests are expected to fail until further notice 🙁

slintes · 2024-04-11T20:42:47Z

/test all

slintes

CI is working again.
I left one remark in the reconcile code, and there is an issue in the e2e test.
Beside that, lgtm :)

test/e2e/far_e2e_test.go

controllers/fenceagentsremediation_controller.go

slintes · 2024-04-12T07:58:50Z

there is duplicated code in the e2e test, but we clean it up in a follow up in order to get this in for the next release...

razo7 · 2024-04-18T06:49:20Z

/retest

mshitrit · 2024-04-18T06:54:06Z

/lgtm
/hold
Since we are after Code Freeze, waiting for QE green light before merging

mshitrit · 2024-04-18T09:08:03Z

/test 4.15-openshift-e2e

mshitrit · 2024-04-18T10:12:24Z

/test 4.14-openshift-e2e

k-keiichi-rh · 2024-04-18T12:29:00Z

/test 4.15-openshift-e2e

frajamomo · 2024-04-19T13:31:58Z

/lgtm

openshift-ci · 2024-04-19T13:32:00Z

@frajamomo: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mshitrit · 2024-04-19T15:37:06Z

/unhold

openshift-ci bot added the do-not-merge/work-in-progress label Oct 12, 2023

openshift-ci bot requested review from mshitrit and razo7 October 12, 2023 02:15

openshift-ci bot added the needs-ok-to-test label Oct 12, 2023

razo7 reviewed Oct 16, 2023

View reviewed changes

k-keiichi-rh force-pushed the ecoproject-1326 branch from c2e03f9 to 9e34b6a Compare October 26, 2023 02:29

k-keiichi-rh changed the title ~~[WIP] Enable out-of-service taint in FAR~~ Enable out-of-service taint in FAR Oct 26, 2023

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 26, 2023

k-keiichi-rh changed the title ~~Enable out-of-service taint in FAR~~ [WIP] Enable out-of-service taint in FAR Oct 26, 2023

openshift-ci bot added the do-not-merge/work-in-progress label Oct 26, 2023

razo7 reviewed Oct 26, 2023

View reviewed changes

slintes mentioned this pull request Nov 3, 2023

Stop fencing actions when node gets healthy medik8s/self-node-remediation#159

Open

k-keiichi-rh force-pushed the ecoproject-1326 branch from 9e34b6a to fd8b075 Compare November 4, 2023 02:29

openshift-merge-robot added the needs-rebase label Nov 6, 2023

k-keiichi-rh commented Nov 7, 2023

View reviewed changes

slintes requested changes Nov 8, 2023

View reviewed changes

openshift-ci bot assigned slintes Nov 8, 2023

razo7 mentioned this pull request Nov 12, 2023

Automatic remediation strategy medik8s/self-node-remediation#160

Merged

k-keiichi-rh force-pushed the ecoproject-1326 branch from 908f415 to 92ac6a1 Compare April 8, 2024 14:51

slintes requested changes Apr 8, 2024

View reviewed changes

controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved

pkg/utils/events.go Outdated Show resolved Hide resolved

k-keiichi-rh force-pushed the ecoproject-1326 branch 2 times, most recently from 24fe752 to 9fe49ca Compare April 9, 2024 14:56

slintes requested changes Apr 12, 2024

View reviewed changes

test/e2e/far_e2e_test.go Show resolved Hide resolved

controllers/fenceagentsremediation_controller.go Show resolved Hide resolved

k-keiichi-rh added 5 commits April 17, 2024 23:48

Enable out-of-service taint in FAR

8cbe344

Implement a validation test of the Kubernetes version

a0cba65

Pass OutOfServiceTaintSupported to Webhook instead of global var

88901cf

Add e2e test case for OutOfServiceTaint remediation strategy

d8da19b

Skip FAR OutOfServiceTaint remediation test if it's not supported

45d1fb9

k-keiichi-rh force-pushed the ecoproject-1326 branch from 9fe49ca to 45d1fb9 Compare April 18, 2024 03:50

openshift-ci bot assigned mshitrit Apr 18, 2024

openshift-ci bot added the lgtm label Apr 18, 2024

openshift-ci bot removed the do-not-merge/hold label Apr 19, 2024

mshitrit merged commit 7f4a492 into medik8s:main Apr 19, 2024
22 checks passed

razo7 mentioned this pull request Apr 28, 2024

Remove redundant VoulmeAttachment from testing #144

Merged

clobrano mentioned this pull request May 13, 2024

Update docs with the new remediationStrategy spec #145

Merged

k-keiichi-rh mentioned this pull request Aug 29, 2024

Error aggregation for Validation Webhooks #154

Open

Enable out-of-service taint in FAR #92

Enable out-of-service taint in FAR #92

Conversation

k-keiichi-rh commented Oct 12, 2023 • edited Loading

openshift-ci bot commented Oct 12, 2023

razo7 left a comment

Choose a reason for hiding this comment

k-keiichi-rh commented Oct 17, 2023 • edited Loading

razo7 commented Oct 25, 2023

razo7 left a comment • edited Loading

Choose a reason for hiding this comment

slintes commented Oct 28, 2023

k-keiichi-rh commented Nov 2, 2023

slintes commented Nov 2, 2023

openshift-ci bot commented Nov 2, 2023

k-keiichi-rh commented Nov 3, 2023

slintes commented Nov 3, 2023

slintes commented Nov 3, 2023

k-keiichi-rh commented Nov 4, 2023

razo7 commented Nov 5, 2023

k-keiichi-rh left a comment

Choose a reason for hiding this comment

slintes left a comment

Choose a reason for hiding this comment

slintes Nov 8, 2023

Choose a reason for hiding this comment

k-keiichi-rh Nov 9, 2023

Choose a reason for hiding this comment

slintes Nov 8, 2023

Choose a reason for hiding this comment

k-keiichi-rh Nov 9, 2023

Choose a reason for hiding this comment

slintes Nov 8, 2023

Choose a reason for hiding this comment

k-keiichi-rh Nov 9, 2023

Choose a reason for hiding this comment

slintes left a comment

Choose a reason for hiding this comment

k-keiichi-rh commented Apr 9, 2024

slintes commented Apr 9, 2024

slintes commented Apr 11, 2024

slintes left a comment

Choose a reason for hiding this comment

slintes commented Apr 12, 2024

razo7 commented Apr 18, 2024

mshitrit commented Apr 18, 2024

mshitrit commented Apr 18, 2024

mshitrit commented Apr 18, 2024

k-keiichi-rh commented Apr 18, 2024

frajamomo commented Apr 19, 2024

openshift-ci bot commented Apr 19, 2024

mshitrit commented Apr 19, 2024

k-keiichi-rh commented Oct 12, 2023 •

edited

Loading

k-keiichi-rh commented Oct 17, 2023 •

edited

Loading

razo7 left a comment •

edited

Loading