Optionally export metrics for external monitoring #530

isuldor · 2019-09-18T00:07:32Z

WIP

use config.go for default values
refactor server.go function to stakepoold.go method
shutdown context
readme one-liner
prometheus quickstart guide
documentation with full samples of systemd units and useful prometheus alerts

I've added optional Prometheus instrumentation (--prometheus) for monitoring and outage alerts (#525). The intention is to add reliability to a vsp by enabling the operator to quickly detect outages. An external monitoring system can be set up to monitor stakepoold instances, in addition to the existing api health check endpoint provided by dcrstakepool. The advantage to detecting stakepoold outages is that they generally happen before dcrstakepool falls over. This will likely become increasingly critical once the planned changes for fee addresses is implemented. The Prometheus client libraries also include parsing and bridges for other monitoring systems, such as InfluxDB, Sensu, and more. The Prometheus exposition format is also being used as the basis for OpenMetrics.

I think my error handling within the goroutine is suboptimal. I'd be happy to receive feedback on that or anything else. The configurable port 9101 was to just follow the port that Prometheus node_exporter uses (it's not set in stone by any means though).

When enabled, you should be able to connect to http://localhost:9101 and see your metrics:

# HELP dcr_block_height Returns the latest block height
# TYPE dcr_block_height gauge
dcr_block_height 268862
# HELP dcr_peer_connections Returns a count of connections from peers
# TYPE dcr_peer_connections gauge
dcr_peer_connections 23
# HELP dcr_tickets_live Returns a count of live tickets
# TYPE dcr_tickets_live gauge
dcr_tickets_live 42
# HELP dcr_tickets_missed Returns a count of missed tickets
# TYPE dcr_tickets_missed gauge
dcr_tickets_missed 4

I'll be writing a guide on how to set up some useful Prometheus alerts along with a Grafana dashboard.

One useful metric gathered here is the block height metric. It's pulled from dcrwallet, so it ensures both dcrd and dcrwallet are working with the tip of the chain. My testnet nodes are fairly fragile and this metric catches any of their failures including transient ones where processes are running but are actually hung. On top of that the prometheus alertmanager can basically confirm that an error is happening a number of times before sending the actual alert, which is pretty useful for weeding out any short-term false positives. My promql expression to create alert looks like this:

- alert: TestnetBlockHeightStalled
        expr: dcr_block_height{group="testnet-voters"} < on() group_left max(dcr_block_height{group="testnet-voters"})
        for: 15m
        annotations:
          summary: Node is behind on best block height

And the other particularly useful metric is the live ticket count. You can monitor whether all nodes agree. When they don't, dcrstakepool becomes unresponsive. Live tickets are also the primary responsibility of the voting nodes, so it's pretty relevant whether they all agree about the current state of tickets. I'll write some documentation about how to monitor these metrics, and how to address issues when those alarms sound off.

Note: cve-2019-16276 is relevant here as we're using net/http. Patched Go versions are Go 1.13.1 and 1.12.10.

dajohi · 2019-09-18T18:10:23Z

env GO111MODULE=on go mod tidy

commit go.mod and go.sum changes.

JoeGruffins

The error handling looks ok to me.

It works!

Also, I think you should make a section in the main readme, otherwise noone will know about it. The readme needs to be cleaned up anyway (#508 ), so don't worry too much about where.

backend/stakepoold/server.go

backend/stakepoold/config.go

backend/stakepoold/server.go

jholdstock · 2019-09-23T11:42:24Z

Hey @isuldor, I added a few in-line comments, and also I agree with Joe that we should mention this feature in README. Maybe just a one-liner in README.md which links to a new document under docs/. This could include usage instructions and some sample Prometheus configs, like the one you posted above.

Also wanted to make you aware of https://github.com/lightninglabs/lndmon which already implements a Prometheus exporter for lnd. It should be possible to port this to dcrlnd really easily

JoeGruffins

Left some inline comments. Looking good though!

backend/stakepoold/config.go

backend/stakepoold/stakepool/stakepool.go

jholdstock · 2019-10-15T11:11:24Z

Code is looking good at this point. Just awaiting the doco updates

JoeGruffins

Where did your changes to sample-stakepoold.conf go? we need those. Be sure to leave a small comment about how to input the times, same as the config.

backend/stakepoold/server.go

backend/stakepoold/stakepool/stakepool.go

isuldor · 2019-10-25T01:41:27Z

How do I squash my commits? I've mostly just been cheating and rolling back HEAD, but I should probably learn to do it proper.

JoeGruffins · 2019-10-25T03:12:12Z

The problem here is that you merged master onto your branch instead of rebasing your branch with master.

I only ever use the command line, but I think what you need to do is something like,
git checkout master
git fetch upstream/master (note: you may not have upstream set as so)
git merge upstream/master

git checkout prometheus_instrumentation
git rebase -i master

And here delete all the comments that aren't yours
Save and quit

You will probably have merge conflicts, fix conflicts and add the files
git add foo.go
git rebase --continue

until
git status
reborts a clean status

then
git push --force

at any time during rebase, you can revert to before the rebase using
git rebase --abort

JoeGruffins · 2019-10-26T02:32:59Z

Ohhh. This looks awesome. Give me a little time to test it out.

jholdstock · 2019-10-26T11:54:53Z

Hey @isuldor,

The /metrics endpoint seems to work well for me, but I am having some issues when I try to shutdown stakepoold. Using CTRL+C I get

2019-10-26 12:28:44.435 [INF] STPK: Received signal (interrupt).  Shutting down...

and then the process just hangs indefinitely. I believe it is because the call to http.ListenAndServe is blocking and does not return. I'm chatting to @JoeGruffins about how best to resolve this, as he has been touching the shutdown signals recently.

JoeGruffins · 2019-10-26T12:06:43Z

We are waiting for a shutdown and then shutting down a server here

dcrstakepool/server.go

Line 296 in 39945e3

// Cleanly shutdown server on interrupt signal.

I think this before listening, and also listening in a go routine may be best

isuldor · 2019-10-29T00:44:01Z

It now handles shutdown signals. I have to push back on running the http server in a thread as I see no precedent for doing this (at least dcrstakepool doesn't appear to do that). And the Golang manual explains that the http server spins off each request into their own thread already, so it seems like overkill.

JoeGruffins

Following the guides, I was able to set up a monitoring service on a different machine that emailed me when my geographically separated stakepoold hit an alert condition.

Proof

One this that I'm not sure of is clicking on the "View in alert manager" doesn't seem to do that. Maybe a mistake on my part.

I think this is fantastic and needed work. The documentation is truly professional grade, imo. I left a few inline comments, but nothing big.

However there are some unrelated services in Samples that I think deserve seperate prs. dcrd and stakepool services, and the accompanying bash script. If it's not too much trouble, could you make a new pr for those?

Again, I think you put a lot of thought and effort into this, and it looks great to me.

docs/samples/wait-for-dcrd.sh

backend/stakepoold/stakepool/stakepool.go

docs/prometheus-quickstart.md

docs/prometheus-examples.md

isuldor · 2019-10-31T00:29:39Z

I've changed up the error handling a bit so that it will continue on errors. ie. It should wait for the next interval period to try again. The two rpc methods I'm calling already print errors, so I opted to use log.Debugf so it won't print the same error twice when debugging is off.

"View in alert manager" doesn't seem to do that

Yeah I don't recall off the top of my head which config variable it uses for the alert source. It just defaults to the prometheus server's hostname for me.

JoeGruffins

Looks good to me! Thank you for answering my concerns.

One thing going forward, it would be better if you did not squash your commits after reviewing has started, as it is rather difficult to see what has changed. I have your past commits in my reflog, so I can see what has changed, but other's will not have that locally, and you can't see the diff on github (afaik). You can squash in the end after approval, or just let @dajohi do it for you. He enjoys it... idk... maybe.

isuldor · 2019-12-09T21:53:24Z

For what its worth, I've been running this in production for a few weeks now with a 30 second interval and have had zero missed votes in that time.

JoeGruffins · 2019-12-10T03:53:55Z

@dajohi

JoeGruffins requested changes Sep 23, 2019

View reviewed changes

backend/stakepoold/server.go Outdated Show resolved Hide resolved

backend/stakepoold/server.go Outdated Show resolved Hide resolved

backend/stakepoold/config.go Outdated Show resolved Hide resolved

jholdstock requested changes Sep 23, 2019

View reviewed changes

backend/stakepoold/config.go Outdated Show resolved Hide resolved

backend/stakepoold/server.go Outdated Show resolved Hide resolved

backend/stakepoold/server.go Outdated Show resolved Hide resolved

isuldor closed this Oct 13, 2019

isuldor reopened this Oct 13, 2019

isuldor requested review from JoeGruffins and jholdstock October 14, 2019 04:17

JoeGruffins reviewed Oct 14, 2019

View reviewed changes

backend/stakepoold/config.go Outdated Show resolved Hide resolved

backend/stakepoold/stakepool/stakepool.go Outdated Show resolved Hide resolved

backend/stakepoold/stakepool/stakepool.go Outdated Show resolved Hide resolved

isuldor requested a review from JoeGruffins October 14, 2019 16:29

JoeGruffins reviewed Oct 15, 2019

View reviewed changes

backend/stakepoold/server.go Show resolved Hide resolved

backend/stakepoold/stakepool/stakepool.go Show resolved Hide resolved

isuldor requested a review from JoeGruffins October 25, 2019 01:37

JoeGruffins requested changes Oct 30, 2019

View reviewed changes

isuldor requested a review from JoeGruffins October 31, 2019 00:50

JoeGruffins approved these changes Oct 31, 2019

View reviewed changes

jholdstock approved these changes Nov 24, 2019

View reviewed changes

Brian Baligad added 2 commits January 31, 2020 07:55

Optionally export Prometheus metrics

d986c32

rebase to master

a331114

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally export metrics for external monitoring #530

Optionally export metrics for external monitoring #530

isuldor commented Sep 18, 2019 •

edited

Loading

dajohi commented Sep 18, 2019

JoeGruffins left a comment •

edited

Loading

jholdstock commented Sep 23, 2019

JoeGruffins left a comment

jholdstock commented Oct 15, 2019

JoeGruffins left a comment

isuldor commented Oct 25, 2019 •

edited

Loading

JoeGruffins commented Oct 25, 2019 •

edited

Loading

JoeGruffins commented Oct 26, 2019

jholdstock commented Oct 26, 2019

JoeGruffins commented Oct 26, 2019

isuldor commented Oct 29, 2019

JoeGruffins left a comment

isuldor commented Oct 31, 2019 •

edited

Loading

JoeGruffins left a comment •

edited

Loading

isuldor commented Dec 9, 2019

JoeGruffins commented Dec 10, 2019

Optionally export metrics for external monitoring #530

Are you sure you want to change the base?

Optionally export metrics for external monitoring #530

Conversation

isuldor commented Sep 18, 2019 • edited Loading

dajohi commented Sep 18, 2019

JoeGruffins left a comment • edited Loading

Choose a reason for hiding this comment

jholdstock commented Sep 23, 2019

JoeGruffins left a comment

Choose a reason for hiding this comment

jholdstock commented Oct 15, 2019

JoeGruffins left a comment

Choose a reason for hiding this comment

isuldor commented Oct 25, 2019 • edited Loading

JoeGruffins commented Oct 25, 2019 • edited Loading

JoeGruffins commented Oct 26, 2019

jholdstock commented Oct 26, 2019

JoeGruffins commented Oct 26, 2019

isuldor commented Oct 29, 2019

JoeGruffins left a comment

Choose a reason for hiding this comment

isuldor commented Oct 31, 2019 • edited Loading

JoeGruffins left a comment • edited Loading

Choose a reason for hiding this comment

isuldor commented Dec 9, 2019

JoeGruffins commented Dec 10, 2019

isuldor commented Sep 18, 2019 •

edited

Loading

JoeGruffins left a comment •

edited

Loading

isuldor commented Oct 25, 2019 •

edited

Loading

JoeGruffins commented Oct 25, 2019 •

edited

Loading

isuldor commented Oct 31, 2019 •

edited

Loading

JoeGruffins left a comment •

edited

Loading