Node monitoring essentials

Sane defaults for Bitcoin and Lightning monitoring: metrics, logs, webhooks, SLOs, alerts, and incident runbooks.

Set service objectives (SLOs)

  • Availability: Node RPC reachable > 99.9% monthly.
  • Freshness: Chain tip lag < 2 blocks for 99% of minutes.
  • Webhook latency: p95 delivery < 15s; success rate > 99.5%.
  • Lightning payment success: p50 < 2s, p95 < 8s; success rate > 98% in steady state.

What to monitor

Chain sync: headers/blocks behind, mempool health, peer count.
Resources: disk free > 15%, disk I/O, CPU/RAM, file descriptors.
Services: bitcoind/lnd/cln processes up, RPC/REST responsiveness.
Security: auth failures, config drift, unexpected open ports.
Lightning: channel states, inbound/outbound liquidity, failed HTLCs, gossip sync.
Webhooks: delivery success, retries, signature verification, end‑to‑end latency.

Dashboards

  • Overview: tip height, lag, peers, mempool size, CPU/RAM/disk.
  • Lightning: channels by state, liquidity heatmap, payment success, HTLC failures.
  • Webhooks: sent, succeeded, failed, p50/p95 latency (see Bitcoin Flux).

Logs & events

  • Ship bitcoind and node logs to centralized storage with retention > 30 days.
  • Tag events with env, service, and host for search.
  • Keep webhook delivery logs with ids and HMAC verification status.

Alerting policy

  • Page (critical): node down, lag >= 6 blocks for 5m, disk < 5%.
  • Ticket (warning): lag 2–5 blocks, disk < 15%, mem > 90% for 10m.
  • Info: new release available, peer churn > threshold.

Incident runbooks

  1. Lagging node: check peers, bandwidth, disk I/O; restart service; if corruption, reindex from snapshot.
  2. Webhook failures: verify DNS/HTTPS, rotate secret, replay from dead‑letter queue; confirm idempotency on receiver.
  3. Lightning failures: inspect liquidity on path, attempt MPP, rebalance, or fallback on‑chain.

Recommended tooling

  • Metrics stack (Prometheus + Grafana or managed equivalent).
  • Log shipping (vector/fluentbit) to centralized search.
  • Alerting to on‑call with escalation policy.
  • Bitcoin Flux for invoice/webhook observability.
Tip: Track symptom alerts (lag, failed payments) rather than only cause alerts (process down). Users feel symptoms.