The Node Healthcheck operator requires configuration by creating one or more NHC custom resources. An example CR for remediating worker nodes with the Self Node Remediation operator looks like this:
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: nhc-snr-worker
spec:
selector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: node-role.kubernetes.io/master
operator: DoesNotExist
remediationTemplate: # Note: mutually exclusive with escalatingRemediations
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
namespace: <SNR namespace>
name: self-node-remediation-automatic-strategy-template
escalatingRemediations: # Note: mutually exclusive with remediationTemplate
- remediationTemplate:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
namespace: <SNR namespace>
name: self-node-remediation-automatic-strategy-template
order: 1
timeout: 300s
# Note: The remediator below is an example only, it doesn't exist
- remediationTemplate:
apiVersion: reprovison.example.com/v1
kind: ReprovisionRemediationTemplate
namespace: example
name: reprovision-remediation-template
order: 2
timeout: 30m
minHealthy: "51%"
unhealthyConditions:
- type: Ready
status: "False"
duration: 300s
- type: Ready
status: Unknown
duration: 300s
Field | Mandatory | Default Value | Description |
---|---|---|---|
selector | yes | n/a | A LabelSelector for selecting nodes to observe. See details below. |
remediationTemplate | yes but mutually exclusive with below | n/a | A ObjectReference to a remediation template provided by a remediation provider. See details below. |
escalatingRemediations | yes but mutually exclusive with above | n/a | A list of ObjectReferences to a remediation template with order and timeout. See details below. |
minHealthy | no | 51% | The minimum number of healthy nodes selected by this CR for allowing further remediation. Percentage or absolute number. |
pauseRequests | no | n/a | A string list. See details below. |
unhealthyConditions | no | [{type: Ready, status: False, duration: 300s},{type: Ready, status: Unknown, duration: 300s}] |
List of UnhealthyCondition, which defines node unhealthiness. See details below. |
The selector is selecting the nodes which should be observed. For its syntax have a look at the LabelSelector docs
A common selector for worker nodes looks like this:
selector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: node-role.kubernetes.io/master
operator: DoesNotExist
The reason for excluding master
and control-plane
nodes instead of selecting
worker
nodes is, that this also prevents potentially unwanted remediation of
control plane nodes in compact clusters, where nodes have both roles.
For remediating control plane nodes use this:
selector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
Note
On older clusters you have to use
master
in above example. You can't use bothmaster
andcontrol-plane
because the expressions are evaluated with logical "AND".
Warning
- Having a configuration which selects both worker and control plane nodes > is strongly discouraged, because control plane nodes have special handling > in NHC and potentially in remediators!
- Multiple configurations must not select an overlapping node set! This can lead to unwanted remediations.
The remediation template is an ObjectReference
to a remediation template provided by a remediation provider. Mandatory fields
are apiVersion
, kind
, name
and namespace
.
Note
This field is mutually exclusive with spec.EscalatingRemediations
Note that some remediators work with the template being created in any namespace, others require it to be in their installation namespace.
Also, some remediators install a remediation template by default, which can be used by NHC. This is e.g. the case for the SelfNodeRemediation remediator. The example CR above shows how to use its "resource deletion strategy" template.
For other remediators you might need to create a template manually. Please check their documentation for details.
For more details on the remediation template, and the remediation CRs created by NHC based on the template, see below
EscalatingRemediations is a list of RemediationTemplates with an order and
timeout field. Instead of just creating one remediation CR and waiting forever
that the node gets healthy, using this field offers the ability to try
multiple remediators one after another.
The order
field determines the order in which the remediations are invoked
(lower order = earlier invocation). The timeout
field determines when the
next remediation is invoked.
There are optional features available when using escalating remediations:
- When running into a timeout, NHC signals this to the remediator by adding a "remediation.medik8s.io/nhc-timed-out" annotation to the remediation CR. The remediator can use this to cancel its efforts.
- The other way around, when the remediator fails to remediate the node, it can set a status condition of type "Succeeded" with status "False" on the remediation CR. NHC will try the next remediator without waiting for the configured timeout to occur.
Note
- This field is mutually exclusive with spec.RemediationTemplate
- All other notes about remediation templates made above apply here as well
This is a list of conditions for identifying unhealthy nodes. Each condition has a mandatory type, status and duration. Type and status are compared with the node's status conditions. When they match for the time defined in duration, remediation will start. The list entries are evaluated with a logical "OR".
Typically, the Ready condition is used, and the node is considered unhealthy when the condition status is "False" (quoted because it needs to be string value) or Unknown for some time. This also is the default value being set in case it's empty when the CR is created:
unhealthyConditions:
- type: Ready
status: "False"
duration: 300s
- type: Ready
status: Unknown
duration: 300s
Warning
Be careful with the value of the duration field. While it's a common desire to remediate as quick as possible, a too low duration can trigger unneeded remediation (which typically means reboots) in case the cluster only has a short "hiccup", or when the node needs a longer time on start to get healthy. For finding the best value, you need to consider the hosts reboot time, the startup time of the kubernetes components and user workloads, and the downtime tolerance of the user workloads.
When pauseRequests has at least one value set, no new remediation will be started, while ongoing remediations keep running.
It's recommended to use descriptive pause reasons like "performing cluster upgrade".
Updating pauseRequests on the command line works like this:
oc patch nhc/<name> --patch '{"spec":{"pauseRequests":["pause for cluster upgrade by @admin"]}}' --type=merge
The status section of the NodeHealthCheck custom resource provides detailed information about what the operator is doing. It contains these fields:
Field | Description |
---|---|
observedNodes | The number of nodes observed according to the selector. |
healthyNodes | The number of observed healthy nodes. |
inFlightRemediations | ** DEPRECATED ** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes. |
unhealthyNodes | A list of unhealthy nodes and their remediations. See details below. |
conditions | A list of conditions representing NHC's current state. Currently the only used type is "Disabled", and it is true when the controller detects problems which prevent it to work correctly. See the workflow page for further information. |
phase | A short human readable representation of NHC's current state. Known phases are Disabled, Paused, Remediating and Enabled. |
reason | A longer human readable explanation of the phase. |
The unhealthyNodes
status field holds structured data for keeping track of
ongoing remediations. When a node recovered and is healthy again, the status
will be cleaned up.
This replaces the deprecated inFlightRemediations
field.
An example:
status:
# skip other fields here...
unhealthyNodes:
- name: unhealthy-node-name
remediations:
- resource:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediation
namespace: <SNR namespace>
name: unhealthy-node-name
uid: abcd-1234...
started: 2023-03-20T15:05:05Z01:00
timedOut: 2023-03-20T15:10:05Z01:00 # timed out
# when using `escalatingRemediations`, the next remediator will be appended:
- resource:
apiVersion: reprovison.example.com/v1
kind: ReprovisionRemediation
namespace: example
name: unhealthy-node-name
uid: bcde-2345...
started: 2023-03-20T15:10:07Z01:00
# no timeout set: ongoing remediation
There are two kind of remediation resources involved:
- at least one remediation template CR, which needs to exist when configuring a NHC CR.
- the remediation CRs created by NHC for unhealthy nodes, and being processed by the remediator providing the related CRDs.
As mentioned above, the NHC CR is referencing one or more remediation templates. While NHC does not know the template's CRD, it expects the CR to have a specific structure, which looks like this:
apiVersion: remediator.company.io/v1
kind: MyRemediationTemplate
metadata:
name: test-name
namespace: test-namespace
spec:
template:
spec: {}
Note
kind
must have a "Template" suffix.spec
must contain the nestedtemplate.spec
fields. The inner spec can be empty as in the above example, or have any content like here:
spec:
template:
spec:
strategy: reboot
timeout: 5m
extraParams:
foo: bar
importantNumber: 42
When NHC detects an unhealthy node, it will create a CR based on this template, following these steps:
- same apiVersion
- same kind but with stripped "Template" postfix
- same namespace
- name will be the unhealthy node's name
- spec will be a copy of spec.template.spec
- an owner reference will be set to the NHC CR
- another owner reference will be set the node's machine if available (currently on OKD and OpenShift only)
For the above template, a remediation CR will look like this:
apiVersion: remediator.company.io/v1
kind: MyRemediation
metadata:
name: unhealthy-node-name
namespace: test-namespace
ownerReferences:
- kind: NodeHealthCheck
apiVersion: remediation.medik8s.io/v1alpha1
name: nhc-snr-worker
uid: some-uid
spec:
strategy: reboot
timeout: 5m
In order to allow NHC to read template CRs, and to create/read/update/delete
remediation CRs, it's recommended to use role aggregation. For this the
remediator needs to create a ClusterRole with the needed rules, and label it
with rbac.ext-remediation/aggregate-to-ext-remediation: "true"
.
Example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
rbac.ext-remediation/aggregate-to-ext-remediation: "true"
name: my-aggregated-remediation-role
rules:
- apiGroups:
- remediator.company.io/v1
resources:
- MyRemediationTemplate
verbs:
- get
- apiGroups:
- remediator.company.io/v1
resources:
- MyRemediation
verbs:
- get
- list
- watch
- create
- update
- patch
- delete