Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally export metrics for external monitoring #530

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,21 +298,24 @@ running with `txindex=1`):
dcrctl --wallet stakepooluserinfo "MultiSigAddress" | grep -Pzo '(?<="invalid": \[)[^\]]*' | tr -d , | xargs -Itickethash dcrctl --wallet getrawtransaction tickethash | xargs -Itickethex dcrctl --wallet addticket "tickethex"
```

## Backups, monitoring, security considerations
## Backups, security considerations

- MySQL should be backed up often and regularly (probably at least hourly).
Backups should be transferred off-site. If using binary backups, do a test
restore. For .sql files, verify visually.

- Wallets should never be used for anything else (they should always have a
balance of 0).

## Monitoring

- A monitoring system with alerting should be pointed at dcrstakepool and
tested/verified to be operating properly. There is a hidden /status page
which throws 500 if the RPC client is shutdown. If your monitoring system
supports it, add additional points of verification such as: checking that the
/stats page loads and has expected information in it, create a test account
and setup automated login testing, etc.

- Wallets should never be used for anything else (they should always have a
balance of 0).
- [docs/prometheus-quickstart.md](docs/prometheus-quickstart.md) and [docs/prometheus-examples.md](docs/prometheus-examples.md) describe how to roll your own monitoring system that can detect and notify you about problems.

## Disaster Recovery

Expand Down
64 changes: 37 additions & 27 deletions backend/stakepoold/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import (
"runtime"
"sort"
"strings"
"time"

"github.com/decred/dcrd/dcrutil/v2"
"github.com/decred/dcrstakepool/internal/version"
Expand All @@ -39,6 +40,9 @@ var (
defaultDBName = "stakepool"
defaultDBPort = "3306"
defaultDBUser = "stakepool"

defaultPrometheusWait = time.Minute
defaultPrometheusListen = ":9101"
)

// runServiceCommand is only set to a real function on Windows. It is used
Expand All @@ -49,33 +53,36 @@ var runServiceCommand func(string) error
//
// See loadConfig for details on the configuration load process.
type config struct {
HomeDir string `short:"A" long:"appdata" description:"Path to application home directory"`
ShowVersion bool `short:"V" long:"version" description:"Display version information and exit"`
ConfigFile string `short:"C" long:"configfile" description:"Path to configuration file"`
DataDir string `short:"b" long:"datadir" description:"Directory to store data"`
LogDir string `long:"logdir" description:"Directory to log output."`
TestNet bool `long:"testnet" description:"Use the test network"`
SimNet bool `long:"simnet" description:"Use the simulation test network"`
DebugLevel string `short:"d" long:"debuglevel" description:"Logging level for all subsystems {trace, debug, info, warn, error, critical} -- You may also specify <subsystem>=<level>,<subsystem2>=<level>,... to set the log level for individual subsystems -- Use show to list available subsystems"`
ColdWalletExtPub string `long:"coldwalletextpub" description:"The extended public key for addresses to which voting service user fees are sent."`
PoolFees float64 `long:"poolfees" description:"The per-ticket fees the user must send to the voting service with their tickets"`
DBHost string `long:"dbhost" description:"Hostname for database connection"`
DBUser string `long:"dbuser" description:"Username for database connection"`
DBPassword string `long:"dbpassword" description:"Password for database connection"`
DBPort string `long:"dbport" description:"Port for database connection"`
DBName string `long:"dbname" description:"Name of database"`
DcrdHost string `long:"dcrdhost" description:"Hostname/IP for dcrd server"`
DcrdUser string `long:"dcrduser" description:"Username for dcrd server"`
DcrdPassword string `long:"dcrdpassword" description:"Password for dcrd server"`
DcrdCert string `long:"dcrdcert" description:"Certificate path for dcrd server"`
WalletHost string `long:"wallethost" description:"Hostname for wallet server"`
WalletUser string `long:"walletuser" description:"Username for wallet server"`
WalletPassword string `long:"walletpassword" description:"Password for wallet server"`
WalletCert string `long:"walletcert" description:"Certificate path for wallet server"`
NoRPCListen bool `long:"norpclisten" description:"Do not start a gRPC server. User voting preferences update on a ticker"`
RPCListeners []string `long:"rpclisten" description:"Add an interface/port to listen for RPC connections (default port: 9113, testnet: 19113)"`
RPCCert string `long:"rpccert" description:"File containing the certificate file"`
RPCKey string `long:"rpckey" description:"File containing the certificate key"`
HomeDir string `short:"A" long:"appdata" description:"Path to application home directory"`
ShowVersion bool `short:"V" long:"version" description:"Display version information and exit"`
ConfigFile string `short:"C" long:"configfile" description:"Path to configuration file"`
DataDir string `short:"b" long:"datadir" description:"Directory to store data"`
LogDir string `long:"logdir" description:"Directory to log output."`
TestNet bool `long:"testnet" description:"Use the test network"`
SimNet bool `long:"simnet" description:"Use the simulation test network"`
DebugLevel string `short:"d" long:"debuglevel" description:"Logging level for all subsystems {trace, debug, info, warn, error, critical} -- You may also specify <subsystem>=<level>,<subsystem2>=<level>,... to set the log level for individual subsystems -- Use show to list available subsystems"`
ColdWalletExtPub string `long:"coldwalletextpub" description:"The extended public key for addresses to which voting service user fees are sent."`
PoolFees float64 `long:"poolfees" description:"The per-ticket fees the user must send to the voting service with their tickets"`
DBHost string `long:"dbhost" description:"Hostname for database connection"`
DBUser string `long:"dbuser" description:"Username for database connection"`
DBPassword string `long:"dbpassword" description:"Password for database connection"`
DBPort string `long:"dbport" description:"Port for database connection"`
DBName string `long:"dbname" description:"Name of database"`
DcrdHost string `long:"dcrdhost" description:"Hostname/IP for dcrd server"`
DcrdUser string `long:"dcrduser" description:"Username for dcrd server"`
DcrdPassword string `long:"dcrdpassword" description:"Password for dcrd server"`
DcrdCert string `long:"dcrdcert" description:"Certificate path for dcrd server"`
WalletHost string `long:"wallethost" description:"Hostname for wallet server"`
WalletUser string `long:"walletuser" description:"Username for wallet server"`
WalletPassword string `long:"walletpassword" description:"Password for wallet server"`
WalletCert string `long:"walletcert" description:"Certificate path for wallet server"`
NoRPCListen bool `long:"norpclisten" description:"Do not start a gRPC server. User voting preferences update on a ticker"`
RPCListeners []string `long:"rpclisten" description:"Add an interface/port to listen for RPC connections (default port: 9113, testnet: 19113)"`
RPCCert string `long:"rpccert" description:"File containing the certificate file"`
RPCKey string `long:"rpckey" description:"File containing the certificate key"`
Prometheus bool `long:"prometheus" description:"Export metrics for external monitoring. Disabled by default."`
PrometheusWait time.Duration `long:"prometheuswait" description:"How long to wait between metric updates. Valid time units are {s, m, h}."`
PrometheusListen string `long:"prometheuslisten" description:"Address/port to listen for Promethues scrapes. (default port: 9101)"`
}

// serviceOptions defines the configuration options for the daemon as a service
Expand Down Expand Up @@ -261,6 +268,9 @@ func loadConfig() (*config, []string, error) {
PoolFees: defaultPoolFees,
RPCKey: defaultRPCKeyFile,
RPCCert: defaultRPCCertFile,

PrometheusWait: defaultPrometheusWait,
PrometheusListen: defaultPrometheusListen,
}

// Service options which are only added on Windows.
Expand Down
4 changes: 4 additions & 0 deletions backend/stakepoold/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,10 @@ func runMain(ctx context.Context) error {
}()
}

if cfg.Prometheus {
spd.ExportPrometheusMetrics(ctx, wg, cfg.PrometheusWait, cfg.PrometheusListen)
isuldor marked this conversation as resolved.
Show resolved Hide resolved
}

// Wait for CTRL+C to signal goroutines to terminate
wg.Wait()
saveData(spd)
Expand Down
95 changes: 95 additions & 0 deletions backend/stakepoold/stakepool/stakepool.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ import (
"encoding/hex"
"errors"
"fmt"
"net"
"net/http"
"strings"
"sync"
"time"
Expand All @@ -24,6 +26,10 @@ import (
"github.com/decred/dcrd/wire"
"github.com/decred/dcrstakepool/backend/stakepoold/userdata"
"github.com/decred/dcrwallet/wallet/v3/txrules"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
Expand Down Expand Up @@ -472,6 +478,95 @@ func (spd *Stakepoold) UpdateUserDataFromMySQL() error {
return nil
}

// Enable early detection of outages by periodically scraping dcrd and
// dcrwallet, and exporting the results to a monitoring system.
func (spd *Stakepoold) ExportPrometheusMetrics(ctx context.Context, wg *sync.WaitGroup, wait time.Duration, addr string) {
promConnectionCount := promauto.NewGauge(prometheus.GaugeOpts{
Name: "dcr_peer_connections",
Help: "Returns a count of connections from peers",
})
promTicketsLive := promauto.NewGauge(prometheus.GaugeOpts{
Name: "dcr_tickets_live",
Help: "Returns a count of live tickets",
})
promTicketsMissed := promauto.NewGauge(prometheus.GaugeOpts{
Name: "dcr_tickets_missed",
Help: "Returns a count of missed tickets",
})
promBlockHeight := promauto.NewGauge(prometheus.GaugeOpts{
Name: "dcr_block_height",
Help: "Returns the latest block height",
})

isuldor marked this conversation as resolved.
Show resolved Hide resolved
// Initialize metrics on startup from dcrd and dcrwallet
dcrdConnectionCount, err := spd.NodeConnection.GetConnectionCount()
if err == nil {
promConnectionCount.Set(float64(dcrdConnectionCount))
}
stakeInfo, err := spd.GetStakeInfo()
if err == nil {
promBlockHeight.Set(float64(stakeInfo.BlockHeight))
promTicketsLive.Set(float64(stakeInfo.Live))
promTicketsMissed.Set(float64(stakeInfo.Missed))
}

// Periodically update metrics in a thread
wg.Add(1)
go func() {
defer wg.Done()
for {
select {
case <-ctx.Done():
return
case <-time.After(wait):
dcrdConnectionCount, err := spd.NodeConnection.GetConnectionCount()
isuldor marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
log.Debugf("ExportPrometheusMetrics: unable to retreive metrics: %v", err)
continue
}
promConnectionCount.Set(float64(dcrdConnectionCount))
stakeInfo, err := spd.GetStakeInfo()
isuldor marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
log.Debugf("ExportPrometheusMetrics: unable to retreive metrics: %v", err)
continue
}
promBlockHeight.Set(float64(stakeInfo.BlockHeight))
promTicketsLive.Set(float64(stakeInfo.Live))
promTicketsMissed.Set(float64(stakeInfo.Missed))
}
}
}()

// Set up http server to provide metrics to a scraping Prometheus server.
srv := &http.Server{}
http.Handle("/metrics", promhttp.Handler())
listener, err := net.Listen("tcp", addr)
if err != nil {
log.Errorf("Error parsing PrometheusListen: %s", err.Error())
return
}

// Cleanly shutdown http server on interrupt signal.
wg.Add(1)
go func() {
defer wg.Done()
// Wait for shutdown.
<-ctx.Done()

// We received an interrupt signal, shut down.
if err := srv.Shutdown(context.Background()); err != nil {
// Error from closing listeners, or context timeout:
log.Errorf("HTTP server Shutdown: %v", err)
}
}()

// Start the http server.
log.Infof("Exporting metrics on %v", listener.Addr())
if err = srv.Serve(listener); err != http.ErrServerClosed {
log.Errorf("Error exporting metrics: %s", err.Error())
}
}

// vote Generates a vote and send it off to the network. This is a go routine!
func (spd *Stakepoold) vote(wg *sync.WaitGroup, blockHash *chainhash.Hash, blockHeight int64, w *ticketMetadata) {
start := time.Now()
Expand Down
Binary file added docs/img/prometheus-email-alert.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/prometheus-first-run.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions docs/prometheus-examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## Prometheus examples
A basic Prometheus setup is described in [prometheus-quickstart.md](prometheus-quickstart.md). This document provides examples for how a permanent Prometheus setup can be deployed. It is assumed you have some familiarity with Linux and Systemd. As these are all examples, they obviously need not be followed verbatim. Samples are provided as-is with minimum necessary descriptions. You should refer to [official documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) to fully understand them. Please find the following samples in the `docs/samples` directory:
- alerting.rules.yml
- alertmanager.service
- alertmanager.yml
- blackbox_exporter.service
- blackbox.yml
- prometheus.service
- prometheus.yml

### Voting node health
The `alerting.rules.yml` example can proactively detect potentially transient failures:
- The node appears online, but `dcrd` has stalled (BlockHeightStalled)
- Block height is correct but `dcrwallet` doesn't agree with the other voting nodes about the state of the ticket pool (LiveTicketParity)

On my testnet vsp, which uses extremely cheap virtual servers, the first alert frequently catches problems early. The second scenario happens often, but resolves itself within a few minutes. So that alert is configured to wait for a much longer duration before firing. Definitely consider tuning how long an alert waits (eg. start with `for: 5m` and increase until false positives stop) before it fires. When the duration is tuned appropriately, these alerts can inform you of a problem before `dcrstakepool` itself collapses. At a minimum, when you find your `dcrstakepool` instance offline, you can pull up the Prometheus `/alerts` page and drill down to see which specific nodes aren't healthy.

### Systemd service units
These [service units](https://www.freedesktop.org/wiki/Software/systemd/) assume you've set up Prometheus under various `/opt` subdirectories, with binaries stored in `/opt/bin`. They use an unprivileged user account to run the daemon. For your convenience, `ExecReload` has been configured so that you can reload configuration on the fly, e.g. `systemctl reload prometheus`. Assuming you went through the quickstart already, here is how you could set up the `prometheus` daemon :
```bash
sudo mkdir -p /opt/bin
sudo mkdir -p /opt/prometheus/data
sudo cp -r prometheus-2.*.linux-amd64/{consoles,console_libraries,*.yml} /opt/prometheus/
sudo cp prometheus-2.*.linux-amd64/{prometheus,promtool} /opt/bin/
sudo useradd -Us /usr/bin/nologin -Md /opt/prometheus -c "Prometheus daemon" prometheus
sudo chown prometheus:prometheus /opt/prometheus/data
sudo cp prometheus.service /etc/systemd/system
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
```

### Mobile push notifications
The provided `alertmanager.yml` example sends alerts to the api of a service called [pushover.net](https://pushover.net/). For one time fee, you can use their api to send push notifications directly to your Android or iOS mobile device. Alertmanager supports numerous notification methods, including a generic webhook api, PagerDuty (a significantly more expensive service), OpsGenie (free for up to 5 users), and more.

### Blackbox Exporter
[Blackbox exporter](https://github.com/prometheus/blackbox_exporter/) is a Prometheus daemon you can deploy to monitor http endpoints like `dcrstakepool` itself. See the relevant job named "health-checks" in `prometheus.yml`. The service unit assumes configuration is in a directory `/opt/prometheus` owned by the `prometheus` user like in the example above.

### Observability with Grafana
Once you've set up alerts, you might find it valuable to visualize your metrics. [Grafana](https://grafana.com/docs/) supports Prometheus as a datasource out of the box, and can be set up by simply adding a third party repository.

### Node Exporter
To monitor basic system statistics like CPU usage, RAM, disk IO, etc, there is [Node Exporter](https://github.com/prometheus/node_exporter). It's another daemon you can run on your nodes to export system-level metrics. There's an [example dashboard](https://grafana.com/grafana/dashboards/1860) that visualizes all of the numerous collected metrics.

Loading