Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerts to notify vertical or horizontal scaling #2866

Conversation

aruniiird
Copy link
Contributor

@aruniiird aruniiird commented Oct 22, 2024

Now CPU usage high alerts are categorized to TWO different sections,
First section: where we have high CPU usage due to high MDS requests rate: at this point we need to scale horzontally by adding more mds pods.
Second section: where we have only CPU usage high: at this point we need to add scale vertically by adding more resources (CPU, memory) to the pods.

@aruniiird aruniiird marked this pull request as draft October 22, 2024 08:36
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2024
@aruniiird
Copy link
Contributor Author

Converting this PR to draft, as we have to add/update the runbooks links at https://github.com/openshift/runbooks repo

@aruniiird
Copy link
Contributor Author

Created a PR: openshift/runbooks#217, to add the new files to the runbooks repo

@aruniiird aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 75b3bc3 to 4afc8a5 Compare November 11, 2024 19:26
@aruniiird aruniiird marked this pull request as ready for review November 13, 2024 08:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2024
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
severity_level: warning
expr: |
(label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit of explanation about the expression?

  • Why are we comparing with 0.67? 1000?
  • Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umangachapagain , @weirdwiz , we have made a change where we are not changing the MDSCPUUsage alert expression (except for a minor cosmetic change), but we are changing the description and runbook_url link according to the mds request load.

High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert

@weirdwiz , yes you are absolutely right.

@umangachapagain ,

Why are we comparing with 0.67? 1000?

About the 0.67 not really sure, as this was already there for the existing MDSCPUUsageHigh alert, which we never changed. A logical conclusion I draw here is, if you are using more than ≈70%-ntage of CPU for past 6 hours, then it is considered as a sign of high CPU usage.

Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?

Now by keeping 67% as our CPU threshold, the number 1000 was reached during testing, when (approx) 1000 or more requests were hitting the MD server, we saw a gradual CPU usage rise and in a 1hr window frame it reaches the CPU threshold.
That means if the rate of mds-requests is approx 1000 reqs / sec for an hour we see CPU usage crosses 67% threshold.

PS: please see the new changes, here we are not doing much modification to the expression, but making description and runbook_url text changes according to the query (for past 6hrs rate query: rate(ceph_mds_request[6h]))

annotations:
description: |-
Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage.
Please consider increasing the CPU request for the {{ $labels.pod }} pod as described in the runbook.
This may help to process more requests and thus evict more items from cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either remove this statement, or word it with more assurity. "This may help" is not a good response to an alert IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think incedental affects should be consolidated to the runbooks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
severity_level: warning
expr: |
(label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about the delta for calculating the rate, are we countering for short bursts of high requests and then silence, or are we looking at the scenario where there is consistently high request rate?

If it's the latter, the delta will work appropriately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are looking for a consistent high requests rate.
Now the delta is brought up to 6 hrs (the time of waiting period). We now moved the mds-request rate query to the annotation part, so that at the time of it being fired (that is after 6hrs) the rate will give appropriate description and runbook_url link. As you have mentioned (about higher the delta lower the jitter/error-rate), through a 6h delta span we should not have unnecessary variances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, so we're considering the full time span, only at the moment the link is displayed to the user
rather than flip flopping. Pretty neat

@aruniiird aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 4afc8a5 to 418a011 Compare November 14, 2024 09:38
Now CPU usage high alerts are categorized to TWO different scenarios,
First scenario: where we have high CPU usage due to high rate of mds
requests coming in:
Solution: at this point we need to scale horizontally
Second section: where we have only CPU usage high:
Solution: at this point we need to add more resources to the existing
mds pods, thus scaling vertically.

Signed-off-by: Arun Kumar Mohan <[email protected]>
@aruniiird aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 418a011 to 377ceb8 Compare November 14, 2024 13:57
@aruniiird
Copy link
Contributor Author

Screenshot of a sample alert to show how description and runbook_url link is shown

Vertical scaling example
image

Horizontal scaling example
image

Copy link
Contributor

@weirdwiz weirdwiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

openshift-ci bot commented Nov 15, 2024

@weirdwiz: changing LGTM is restricted to collaborators

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2024
Copy link
Contributor

openshift-ci bot commented Nov 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aruniiird, umangachapagain, weirdwiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 07cbcfa into red-hat-storage:main Nov 15, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants