-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally export metrics for external monitoring #530
base: master
Are you sure you want to change the base?
Conversation
env GO111MODULE=on go mod tidy commit go.mod and go.sum changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling looks ok to me.
Also, I think you should make a section in the main readme, otherwise noone will know about it. The readme needs to be cleaned up anyway (#508 ), so don't worry too much about where.
Hey @isuldor, I added a few in-line comments, and also I agree with Joe that we should mention this feature in README. Maybe just a one-liner in README.md which links to a new document under Also wanted to make you aware of https://github.com/lightninglabs/lndmon which already implements a Prometheus exporter for lnd. It should be possible to port this to dcrlnd really easily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some inline comments. Looking good though!
Code is looking good at this point. Just awaiting the doco updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did your changes to sample-stakepoold.conf go? we need those. Be sure to leave a small comment about how to input the times, same as the config.
How do I squash my commits? I've mostly just been cheating and rolling back HEAD, but I should probably learn to do it proper. |
The problem here is that you merged master onto your branch instead of rebasing your branch with master. I only ever use the command line, but I think what you need to do is something like, git checkout prometheus_instrumentation And here delete all the comments that aren't yours You will probably have merge conflicts, fix conflicts and add the files until then at any time during rebase, you can revert to before the rebase using |
Ohhh. This looks awesome. Give me a little time to test it out. |
Hey @isuldor, The
and then the process just hangs indefinitely. I believe it is because the call to |
We are waiting for a shutdown and then shutting down a server here Line 296 in 39945e3
I think this before listening, and also listening in a go routine may be best |
It now handles shutdown signals. I have to push back on running the http server in a thread as I see no precedent for doing this (at least dcrstakepool doesn't appear to do that). And the Golang manual explains that the http server spins off each request into their own thread already, so it seems like overkill. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the guides, I was able to set up a monitoring service on a different machine that emailed me when my geographically separated stakepoold hit an alert condition.
One this that I'm not sure of is clicking on the "View in alert manager" doesn't seem to do that. Maybe a mistake on my part.I think this is fantastic and needed work. The documentation is truly professional grade, imo. I left a few inline comments, but nothing big.
However there are some unrelated services in Samples that I think deserve seperate prs. dcrd and stakepool services, and the accompanying bash script. If it's not too much trouble, could you make a new pr for those?
Again, I think you put a lot of thought and effort into this, and it looks great to me.
I've changed up the error handling a bit so that it will
Yeah I don't recall off the top of my head which config variable it uses for the alert source. It just defaults to the prometheus server's hostname for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thank you for answering my concerns.
One thing going forward, it would be better if you did not squash your commits after reviewing has started, as it is rather difficult to see what has changed. I have your past commits in my reflog, so I can see what has changed, but other's will not have that locally, and you can't see the diff on github (afaik). You can squash in the end after approval, or just let @dajohi do it for you. He enjoys it... idk... maybe.
For what its worth, I've been running this in production for a few weeks now with a 30 second interval and have had zero missed votes in that time. |
WIP
server.go
function tostakepoold.go
methodI've added optional Prometheus instrumentation (
--prometheus
) for monitoring and outage alerts (#525). The intention is to add reliability to a vsp by enabling the operator to quickly detect outages. An external monitoring system can be set up to monitor stakepoold instances, in addition to the existing api health check endpoint provided by dcrstakepool. The advantage to detecting stakepoold outages is that they generally happen before dcrstakepool falls over. This will likely become increasingly critical once the planned changes for fee addresses is implemented. The Prometheus client libraries also include parsing and bridges for other monitoring systems, such as InfluxDB, Sensu, and more. The Prometheus exposition format is also being used as the basis for OpenMetrics.I think my error handling within the goroutine is suboptimal. I'd be happy to receive feedback on that or anything else. The configurable port 9101 was to just follow the port that Prometheus node_exporter uses (it's not set in stone by any means though).
When enabled, you should be able to connect to
http://localhost:9101
and see your metrics:I'll be writing a guide on how to set up some useful Prometheus alerts along with a Grafana dashboard.
One useful metric gathered here is the block height metric. It's pulled from dcrwallet, so it ensures both dcrd and dcrwallet are working with the tip of the chain. My testnet nodes are fairly fragile and this metric catches any of their failures including transient ones where processes are running but are actually hung. On top of that the prometheus alertmanager can basically confirm that an error is happening a number of times before sending the actual alert, which is pretty useful for weeding out any short-term false positives. My promql expression to create alert looks like this:
And the other particularly useful metric is the live ticket count. You can monitor whether all nodes agree. When they don't, dcrstakepool becomes unresponsive. Live tickets are also the primary responsibility of the voting nodes, so it's pretty relevant whether they all agree about the current state of tickets. I'll write some documentation about how to monitor these metrics, and how to address issues when those alarms sound off.
Note: cve-2019-16276 is relevant here as we're using net/http. Patched Go versions are Go 1.13.1 and 1.12.10.