Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: health endpoint #46

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alexbakar
Copy link

Adding a proposal for basic health endpoint for Concourse cluster.

Signed-off-by: Alexander Bakardzhiev <[email protected]>
Signed-off-by: Alexander Bakardzhiev <[email protected]>
@ari-becker
Copy link

@alexbakar I think it might help you to elaborate on why this is beneficial beyond what /api/v1/info already provides (which is what's currently used in the Helm chart liveness/readiness probes).

to determine the state of every part of it. So it would be nice if the parts of
this system have a common way to return their health/availability status.

Currently there is no easy way for external (monitoring) system to understand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some more context to this for those outside the team

the way that we (the Concourse team itself) has been dealing with the "is this installation good or not?" question has been mostly through the use of the metrics that we expose, and SLIs (https://github.com/concourse/oxygen-mask - which relies on another Concourse installation sending probes to the first - / https://github.com/cirocosta/slirunner - doesn't require another installation, but is k8s and prometheus-first) which takes the approach of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion whether it's properly working or not.

what I like about this kind of approach is that you can keep that external to the main system, and thus, build the integrations to it however you want (e.g., in the case of slirunner, it exposes the information through Prometheus, but if you'd prefer to have an endpoint being hit when things fail, you could build that too, all while still leveraging "the core concourse" under the hood), as well as highlighting which kinds of user workflows are currently broken / degredated 🤔

Copy link
Member

@cirocosta cirocosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for submitting! left some questions / thoughts 😁


# Proposal

Concourse can expose a public HTTP endpoint called "health" endpoint that gives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind elaborating a bit more on how you see the detection of "healthiness" would look like?

Also, do you think this endpoint should be public (in the sense of requiring no credentials at all) and have per-node information being displayed? I think a problem there is that it'd conflict with the current auth requirements that we have for someone reading /api/v1/workers.

@alexbakar
Copy link
Author

Thank you guys for the feedback. It's very useful.
@ari-becker AFAIK the /api/v1/info provides an information for Concourse version, workers version, and the external URL. My idea was to have a way to get also the overall status of nodes. And as the endpoint is to be public (no authentication required), this information could only be very basic - db/web/workers and their status.
@cirocosta Thanks, the explanation really makes sense. I agree the proposition I made is probably already covered by the /api/v1/workers (as there is plenty of information about the workers including their state) and conflicts with the auth requirements for it.
I also like the idea of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion (slirunner).
I have to think about these topics and reconsider my proposal. I will send my comments then.

@aoldershaw
Copy link

@alexbakar we (the Concourse team) are focusing on paying more attention to RFCs and shepherding them to some form of resolution. I was wondering if you've had the chance to put any more thought into this/whether you're still interested in this topic?

@mvdkleijn
Copy link

Hi there,

I'd like to drop my thoughts here on this topic... 😄 I see two kinds of monitoring for two kinds of installations.

  1. A full-on, 8000 user, 800 pipeline, pedal to the metal, HA, 99.9999% uptime critical Concourse installation. (exaggeration on purpose 😉 )
  2. A simple, small, quick-n-dirty, 3 user, 5 pipeline, 98.0% uptime Concourse install.

For the first option, of course you'd want a full on monitoring system like Prometheus with alerts etc that execute specific workloads to test the various components and retrieves metrics.

However, for the second (and other) case(s), such a monitoring system would be overkill. Maybe you'd want to use a simple "ping" or "http content" check alerting tool. (think something like https://github.com/iloire/WatchMen)

For such an alerting tool, an equally simple health check would be greatly appreciated.

A clear endpoint for health/status checks: /api/v1/health or /api/v1/status
A clear HTTP response code: 200 for all OKs, 503 in case one item is not OK, 500 if none are OK.
A clear and simple JSON response like:

{
  "atc": "ok",
  "db": "ok",
  "workers": "ok",
  "timestamp": 1619710516
}

If something were amiss with the workers for example, an admin could then zoom in by checking /api/v1/workers.

Just some thoughts...

@aoldershaw aoldershaw removed their assignment Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants