Setting Up Prometheus + Grafana for Cosmos Validator Monitoring

Your node is running. Now make sure you actually know when something goes wrong.

Why Monitoring Is Not Optional

Your validator is live. Blocks are being signed. Everything feels fine.

Until it isn’t.

A missed upgrade, a memory leak, a peer count drop, a disk filling up at 3 AM — these are not edge cases. They are the routine reality of running production infrastructure. The difference between a validator with 99.9% uptime and one that gets jailed isn’t always hardware quality or software skill. It’s visibility.

Prometheus + Grafana is the standard monitoring stack for Cosmos validators — battle-tested, widely adopted, and deeply integrated into the Cosmos SDK. This guide assumes your node is already running and walks you through adding a complete monitoring layer on top of it.

Architecture Overview

Before touching a config file, understand what you’re building:

Cosmos SDK metrics — your gaiad process exposes validator-specific metrics over HTTP
node_exporter — exposes host-level metrics (CPU, RAM, disk, network)
Prometheus — scrapes both endpoints and stores time-series data
Grafana — queries Prometheus and renders dashboards and alerts

All four components can run on the same server for a single-node setup, or you can run Prometheus + Grafana on a separate monitoring server for better isolation.

Step 1: Enable Metrics on Your Cosmos Node

Your node exposes Prometheus metrics natively — you just need to turn it on.

Open config.toml:

nano ~/.gaia/config/config.toml

Find the instrumentation section and enable it:

#######################################################
###       Instrumentation Configuration             ###
#######################################################
[instrumentation]

# Enable Prometheus metrics
prometheus = true

# Address to listen on
prometheus_listen_addr = ":26660"

# Maximum number of simultaneous connections
max_open_connections = 3

# Instrumentation namespace
namespace = "cometbft"

Also check app.toml for application-level metrics (Cosmos SDK v0.47+):

[telemetry]
enabled = true
prometheus-retention-time = 600  # seconds

Restart your node to apply:

sudo systemctl restart gaiad

Verify metrics are being exposed:

curl -s localhost:26660/metrics | head -40

You should see output like:

# HELP cometbft_consensus_height Height of the chain
# TYPE cometbft_consensus_height gauge
cometbft_consensus_height 15482300
# HELP cometbft_consensus_validators Number of validators
# TYPE cometbft_consensus_validators gauge
cometbft_consensus_validators 18

If you see this, your node is ready to be scraped.

Step 2: Install node_exporter

node_exporter exposes host-level metrics — CPU, memory, disk, network. Essential for catching infrastructure problems before they affect your validator.

# Download latest node_exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz

# Move binary
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create dedicated user
sudo useradd -rs /bin/false node_exporter

Create a systemd service:

sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify
curl -s localhost:9100/metrics | head -10

Step 3: Install Prometheus

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar xvfz prometheus-2.50.0.linux-amd64.tar.gz

sudo mv prometheus-2.50.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.50.0.linux-amd64/promtool /usr/local/bin/

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo useradd -rs /bin/false prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Configure Prometheus

Create the main config file:

sudo nano /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    chain: cosmoshub
    validator: your-moniker   # replace with your validator moniker

# Alerting rules (we'll add these in Step 5)
rule_files:
  - /etc/prometheus/alerts.yml

scrape_configs:
  # Cosmos node metrics (CometBFT + SDK)
  - job_name: cosmos_node
    static_configs:
      - targets: ['localhost:26660']
    labels:
      instance: validator

  # Host-level metrics
  - job_name: node_exporter
    static_configs:
      - targets: ['localhost:9100']
    labels:
      instance: validator

Create the systemd service:

sudo nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.listen-address=0.0.0.0:9090
Restart=on-failure

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

# Confirm it's running
sudo systemctl status prometheus

Visit http://your-server-ip:9090 — you should see the Prometheus UI. Check Status → Targets to confirm both scrape jobs are UP.

Step 4: Install Grafana

# Add Grafana APT repository
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt-get update
sudo apt-get install -y grafana

sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Access Grafana at http://your-server-ip:3000

Default credentials: admin / admin (you'll be prompted to change on first login).

Add Prometheus as a Data Source

Go to Connections → Data Sources → Add data source
Select Prometheus
Set URL to http://localhost:9090
Click Save & Test — you should see “Data source is working”

Import a Community Dashboard

Rather than building from scratch, use the community Cosmos validator dashboard:

Go to Dashboards → Import
Enter dashboard ID 14914 (Cosmos Validator Dashboard by Chainode Tech)
Select your Prometheus data source
Click Import

This gives you an immediate view of your validator’s key metrics without writing a single PromQL query.

Step 5: Key Metrics to Watch

Once your dashboard is up, these are the metrics that matter most for a Cosmos validator:

Consensus & Signing

# Current block height — should be increasing
cometbft_consensus_height

# Missed blocks — alert if this increases rapidly
cometbft_consensus_validator_missed_blocks

# Validator voting power
cometbft_consensus_validators_power

# Number of connected peers — alert if drops below 5
cometbft_p2p_peers

Node Health

# Is the node catching up? (0 = synced, 1 = catching up)
cometbft_consensus_fast_syncing

# Mempool size
cometbft_mempool_size

# Block processing time (ms)
cometbft_consensus_block_interval_seconds

Host-Level (via node_exporter)

promql

# CPU usage %
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage %
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage %
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# Disk growth rate (bytes/hour) — useful for forecasting when you'll run out
rate(node_filesystem_size_bytes{mountpoint="/"}[1h])

Step 6: Set Up Alerting Rules

This is where monitoring becomes operational. Grafana dashboards tell you what’s happening — alerts tell you when to wake up.

Prometheus Alert Rules

Create /etc/prometheus/alerts.yml:

groups:
  - name: cosmos_validator
    rules:

      # Node is down
      - alert: ValidatorNodeDown
        expr: up{job="cosmos_node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Validator node is unreachable"
          description: "Prometheus cannot scrape the validator node. It may be down."

      # Node is catching up (not in sync)
      - alert: ValidatorNotSynced
        expr: cometbft_consensus_fast_syncing == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Validator is catching up"
          description: "Node has been in fast sync mode for over 5 minutes."

      # Peer count too low
      - alert: LowPeerCount
        expr: cometbft_p2p_peers < 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count: {{ $value }} peers"
          description: "Validator has fewer than 5 peers. Network connectivity may be degraded."

      # Missed blocks — 5 in last 10 minutes
      - alert: MissingBlocks
        expr: increase(cometbft_consensus_validator_missed_blocks[10m]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Validator missing blocks"
          description: "{{ $value }} blocks missed in the last 10 minutes."

      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage: {{ $value }}%"

      # Disk filling up
      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage above 80%: {{ $value }}%"
          description: "At current growth rate, plan for expansion soon."

      # Critical disk
      - alert: DiskSpaceCritical
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CRITICAL: Disk nearly full at {{ $value }}%"

Reload Prometheus to pick up the rules:

sudo systemctl reload prometheus

# Verify rules loaded correctly
promtool check rules /etc/prometheus/alerts.yml

Configure Grafana Alert Notifications

To receive alerts via Telegram (most popular among Cosmos validators):

Create a Telegram bot via @BotFather and note the bot token
Get your chat ID by messaging @userinfobot
In Grafana: Alerting → Contact Points → Add contact point
Select Telegram, enter your bot token and chat ID
Test the connection

For PagerDuty or email, Grafana supports both natively under the same menu.

Step 7: Secure Your Monitoring Stack

By default, Prometheus and Grafana are exposed on all interfaces. Lock them down:

# Firewall rules — allow Grafana only from your IP
sudo ufw allow from YOUR_IP to any port 3000
sudo ufw allow from YOUR_IP to any port 9090

# Or bind Prometheus to localhost only (edit service ExecStart)
--web.listen-address=127.0.0.1:9090

# Change Grafana default admin password immediately
# Settings → Profile → Change Password

# Disable Grafana anonymous access
sudo nano /etc/grafana/grafana.ini
# [auth.anonymous]
# enabled = false

If you’re running Prometheus + Grafana on a separate monitoring server, use a private network or WireGuard tunnel between your validator and monitoring host — never expose port 26660 publicly.

What Your Dashboard Should Show at a Glance

A well-configured Grafana dashboard should answer these questions within 5 seconds of opening:

✅ Is my node synced and signing blocks?
✅ How many peers am I connected to?
✅ What’s my CPU / RAM / disk usage?
✅ Have I missed any blocks in the last hour?
✅ Is disk usage trending toward a problem?

If you can’t answer all five at a glance, your dashboard needs work.

Common Issues

Metrics endpoint returns nothing Check that prometheus = true is set in config.toml and that the node has been restarted. Confirm the port with ss -tlnp | grep 26660.

Prometheus targets show as DOWN Usually a firewall issue or wrong port. Check ufw status and verify the metrics endpoint is reachable from where Prometheus is running.

Grafana shows “No data” Check the data source URL and that Prometheus is actually receiving data. Run a raw query in Prometheus UI first to confirm metrics exist.

node_exporter metrics missing Confirm the service is running with systemctl status node_exporter and that port 9100 is reachable.

Summary

The stack takes about 30–45 minutes to set up end-to-end. Once it’s running, you’ll wonder how you ever operated a validator without it.

Monitoring doesn’t prevent problems — but it ensures you’re never the last one to know when something goes wrong.

Setting Up Prometheus + Grafana for Cosmos Validator Monitoring was originally published in Vitwit on Medium, where people are continuing the conversation by highlighting and responding to this story.