Investigate on & Improve Latency Metrics #46

y-eight · 2023-12-18T12:54:42Z

Problem to investigate & solve

Currently, 3 different latency metrics are available.

Counter
Latency time
Histogram

If the health check fails (internally) the latency time will be 0. The status code as well.

This might be ok for the counter and latency metrics but might be not the best practice for the histogram. The buckets will be filled.

Example with 2 errors and 308 total requests:

# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 288
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 307
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 308
sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 120.39378972299998
sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 308

As @puffitos stated in #45 we should probably solve this with labelling or another set of metrics. E.g. label for the checks state.

The text was updated successfully, but these errors were encountered:

niklastreml · 2024-01-12T17:09:36Z

Maybe we should create an extra metric for failed requests and move those failed requests there. This would fix the issue with the buckets filling up, and also provide an easy way for monitoring failed requests

y-eight added the refactoring Refactoring of existing code label Dec 18, 2023

lvlcn-t added the area/checks Issues/PRs related to Checks label Jan 3, 2024

niklastreml added the bug Something isn't working label Jan 12, 2024

niklastreml self-assigned this Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate on & Improve Latency Metrics #46

Investigate on & Improve Latency Metrics #46

y-eight commented Dec 18, 2023

niklastreml commented Jan 12, 2024

Investigate on & Improve Latency Metrics #46

Investigate on & Improve Latency Metrics #46

Comments

y-eight commented Dec 18, 2023

Problem to investigate & solve

niklastreml commented Jan 12, 2024