-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: health endpoint #46
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alexander Bakardzhiev <[email protected]>
Signed-off-by: Alexander Bakardzhiev <[email protected]>
@alexbakar I think it might help you to elaborate on why this is beneficial beyond what |
to determine the state of every part of it. So it would be nice if the parts of | ||
this system have a common way to return their health/availability status. | ||
|
||
Currently there is no easy way for external (monitoring) system to understand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some more context to this for those outside the team
the way that we (the Concourse team itself) has been dealing with the "is this installation good or not?" question has been mostly through the use of the metrics that we expose, and SLIs (https://github.com/concourse/oxygen-mask - which relies on another Concourse installation sending probes to the first - / https://github.com/cirocosta/slirunner - doesn't require another installation, but is k8s and prometheus-first) which takes the approach of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion whether it's properly working or not.
what I like about this kind of approach is that you can keep that external to the main system, and thus, build the integrations to it however you want (e.g., in the case of slirunner
, it exposes the information through Prometheus, but if you'd prefer to have an endpoint being hit when things fail, you could build that too, all while still leveraging "the core concourse" under the hood), as well as highlighting which kinds of user workflows are currently broken / degredated 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for submitting! left some questions / thoughts 😁
|
||
# Proposal | ||
|
||
Concourse can expose a public HTTP endpoint called "health" endpoint that gives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind elaborating a bit more on how you see the detection of "healthiness" would look like?
Also, do you think this endpoint should be public (in the sense of requiring no credentials at all) and have per-node information being displayed? I think a problem there is that it'd conflict with the current auth requirements that we have for someone reading /api/v1/workers
.
Thank you guys for the feedback. It's very useful. |
@alexbakar we (the Concourse team) are focusing on paying more attention to RFCs and shepherding them to some form of resolution. I was wondering if you've had the chance to put any more thought into this/whether you're still interested in this topic? |
Hi there, I'd like to drop my thoughts here on this topic... 😄 I see two kinds of monitoring for two kinds of installations.
For the first option, of course you'd want a full on monitoring system like Prometheus with alerts etc that execute specific workloads to test the various components and retrieves metrics. However, for the second (and other) case(s), such a monitoring system would be overkill. Maybe you'd want to use a simple "ping" or "http content" check alerting tool. (think something like https://github.com/iloire/WatchMen) For such an alerting tool, an equally simple health check would be greatly appreciated. A clear endpoint for health/status checks:
If something were amiss with the workers for example, an admin could then zoom in by checking Just some thoughts... |
Adding a proposal for basic health endpoint for Concourse cluster.