RFC: health endpoint #46

alexbakar · 2020-03-07T13:36:10Z

Adding a proposal for basic health endpoint for Concourse cluster.

Signed-off-by: Alexander Bakardzhiev <[email protected]>

ari-becker · 2020-03-07T17:11:34Z

@alexbakar I think it might help you to elaborate on why this is beneficial beyond what /api/v1/info already provides (which is what's currently used in the Helm chart liveness/readiness probes).

cirocosta · 2020-03-10T13:22:51Z

050-health-endpoint/proposal.md

+to determine the state of every part of it. So it would be nice if the parts of
+this system have a common way to return their health/availability status.
+
+Currently there is no easy way for external (monitoring) system to understand


Adding some more context to this for those outside the team

the way that we (the Concourse team itself) has been dealing with the "is this installation good or not?" question has been mostly through the use of the metrics that we expose, and SLIs (https://github.com/concourse/oxygen-mask - which relies on another Concourse installation sending probes to the first - / https://github.com/cirocosta/slirunner - doesn't require another installation, but is k8s and prometheus-first) which takes the approach of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion whether it's properly working or not.

what I like about this kind of approach is that you can keep that external to the main system, and thus, build the integrations to it however you want (e.g., in the case of slirunner, it exposes the information through Prometheus, but if you'd prefer to have an endpoint being hit when things fail, you could build that too, all while still leveraging "the core concourse" under the hood), as well as highlighting which kinds of user workflows are currently broken / degredated 🤔

cirocosta

thanks for submitting! left some questions / thoughts 😁

cirocosta · 2020-03-10T13:25:41Z

050-health-endpoint/proposal.md

+
+# Proposal
+
+Concourse can expose a public HTTP endpoint called "health" endpoint that gives


Would you mind elaborating a bit more on how you see the detection of "healthiness" would look like?

Also, do you think this endpoint should be public (in the sense of requiring no credentials at all) and have per-node information being displayed? I think a problem there is that it'd conflict with the current auth requirements that we have for someone reading /api/v1/workers.

alexbakar · 2020-03-11T07:33:59Z

Thank you guys for the feedback. It's very useful.
@ari-becker AFAIK the /api/v1/info provides an information for Concourse version, workers version, and the external URL. My idea was to have a way to get also the overall status of nodes. And as the endpoint is to be public (no authentication required), this information could only be very basic - db/web/workers and their status.
@cirocosta Thanks, the explanation really makes sense. I agree the proposition I made is probably already covered by the /api/v1/workers (as there is plenty of information about the workers including their state) and conflicts with the auth requirements for it.
I also like the idea of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion (slirunner).
I have to think about these topics and reconsider my proposal. I will send my comments then.

aoldershaw · 2021-04-19T14:21:54Z

@alexbakar we (the Concourse team) are focusing on paying more attention to RFCs and shepherding them to some form of resolution. I was wondering if you've had the chance to put any more thought into this/whether you're still interested in this topic?

mvdkleijn · 2021-04-29T15:48:23Z

Hi there,

I'd like to drop my thoughts here on this topic... 😄 I see two kinds of monitoring for two kinds of installations.

A full-on, 8000 user, 800 pipeline, pedal to the metal, HA, 99.9999% uptime critical Concourse installation. (exaggeration on purpose 😉 )
A simple, small, quick-n-dirty, 3 user, 5 pipeline, 98.0% uptime Concourse install.

For the first option, of course you'd want a full on monitoring system like Prometheus with alerts etc that execute specific workloads to test the various components and retrieves metrics.

However, for the second (and other) case(s), such a monitoring system would be overkill. Maybe you'd want to use a simple "ping" or "http content" check alerting tool. (think something like https://github.com/iloire/WatchMen)

For such an alerting tool, an equally simple health check would be greatly appreciated.

A clear endpoint for health/status checks: /api/v1/health or /api/v1/status
A clear HTTP response code: 200 for all OKs, 503 in case one item is not OK, 500 if none are OK.
A clear and simple JSON response like:

{
  "atc": "ok",
  "db": "ok",
  "workers": "ok",
  "timestamp": 1619710516
}

If something were amiss with the workers for example, an admin could then zoom in by checking /api/v1/workers.

Just some thoughts...

alexbakar added 2 commits March 7, 2020 15:22

Health endpoint proposal

2163d56

Signed-off-by: Alexander Bakardzhiev <[email protected]>

Format

1d5b7de

Signed-off-by: Alexander Bakardzhiev <[email protected]>

cirocosta reviewed Mar 10, 2020

View reviewed changes

vito mentioned this pull request May 19, 2020

experiment: Health endpoint concourse/concourse#4818

Closed

6 tasks

vito assigned aoldershaw Apr 15, 2021

aoldershaw removed their assignment Aug 27, 2021

clarafu assigned taylorsilva Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: health endpoint #46

RFC: health endpoint #46

alexbakar commented Mar 7, 2020

ari-becker commented Mar 7, 2020

cirocosta Mar 10, 2020

cirocosta left a comment

cirocosta Mar 10, 2020

alexbakar commented Mar 11, 2020

aoldershaw commented Apr 19, 2021

mvdkleijn commented Apr 29, 2021


		# Proposal

		Concourse can expose a public HTTP endpoint called "health" endpoint that gives

RFC: health endpoint #46

Are you sure you want to change the base?

RFC: health endpoint #46

Conversation

alexbakar commented Mar 7, 2020

ari-becker commented Mar 7, 2020

cirocosta Mar 10, 2020

Choose a reason for hiding this comment

cirocosta left a comment

Choose a reason for hiding this comment

cirocosta Mar 10, 2020

Choose a reason for hiding this comment

alexbakar commented Mar 11, 2020

aoldershaw commented Apr 19, 2021

mvdkleijn commented Apr 29, 2021