两条主流路线

适合 备注
Prometheus + Grafana 云原生 / 容器化 / 现代项目 业界事实标准,CNCF Graduated
Zabbix 传统运维 / 多种设备 老牌,邮件告警 / 模板丰富
Netdata 单机详细监控 装好即用,可视化漂亮

本文走 Prometheus + Grafana——和 ops-corp/19 一致。

架构

被监控机器(装 node_exporter) ← scrape ← Prometheus(中心)
                                              ↓
                                          Grafana(可视化)

1. 装 node_exporter(在每台被监控机上)

📌 下面下载链接里的版本号会过时——动手前先到 node_exporter releases 看当前最新版本,替换 VER 变量。

# 下载(把 VER 替换成 releases 页最新版本号,例如 1.8.2)
VER=1.8.2
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v${VER}/node_exporter-${VER}.linux-amd64.tar.gz
tar -xzf node_exporter-${VER}.linux-amd64.tar.gz
sudo mv node_exporter-${VER}.linux-amd64/node_exporter /usr/local/bin/

# 系统用户
sudo useradd -rs /usr/sbin/nologin node_exporter

# systemd unit
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

验证:

curl http://localhost:9100/metrics | head

防火墙允许 9100:

sudo ufw allow from 监控服务器IP to any port 9100

2. 装 Prometheus(在监控服务器)

📌 同样:到 Prometheus releases 拿最新版本号替换 VER

# 把 VER 替换成 releases 页最新版本号
VER=2.55.1
cd /tmp
curl -LO https://github.com/prometheus/prometheus/releases/download/v${VER}/prometheus-${VER}.linux-amd64.tar.gz
tar -xzf prometheus-${VER}.linux-amd64.tar.gz

sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp prometheus-${VER}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo cp -r prometheus-${VER}.linux-amd64/{consoles,console_libraries} /etc/prometheus/

sudo useradd -rs /usr/sbin/nologin prometheus
sudo chown -R prometheus: /etc/prometheus /var/lib/prometheus

/etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'nodes'
    static_configs:
      - targets:
          - '192.168.1.100:9100'
          - '192.168.1.101:9100'
          - '192.168.1.102:9100'

  # 也可以 scrape 应用本身的 /metrics
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:3000']

systemd unit /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries
Restart=on-failure

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

访问 http://监控服务器IP:9090 看 Prometheus 自带 UI。

3. 装 Grafana

sudo apt install -y software-properties-common apt-transport-https
sudo wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt update
sudo apt install grafana
sudo systemctl enable --now grafana-server

访问 http://监控服务器IP:3000,默认账号 admin / admin,首次会让改密码。

4. Grafana 接 Prometheus

进入 Grafana:

  1. 左侧菜单 → Configuration → Data sources → Add data source
  2. Prometheus
  3. URL 填 http://localhost:9090
  4. Save & Test

5. 导入 node_exporter 仪表板

社区有大量现成模板:

  1. 左侧菜单 → Dashboards → New → Import
  2. 输入 ID 1860(Node Exporter Full,最常用)
  3. 选刚加的 Prometheus 数据源 → Import

立刻能看到所有机器的 CPU / 内存 / 磁盘 / 网络 / 负载图。

6. 加告警(Alertmanager 配合)

详细见 ops-corp/22-alerting。最小例:

/etc/prometheus/rules.yml:

groups:
  - name: node_alerts
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "高 CPU 在 {{ $labels.instance }}"
          description: "CPU 持续 5 分钟超过 80%"

      - alert: DiskFull
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘快满了 在 {{ $labels.instance }}"

prometheus.yml 引用:

rule_files:
  - "rules.yml"

Reload:

sudo systemctl reload prometheus

监控应用本身

应用层面要监控的话,应用要暴露 /metrics 端点(Prometheus 格式)。

Node:装 prom-client; Python:prometheus_client; Go:prometheus/client_golang

详细在 ops-corp/19-prometheus-grafana。

替代方案

不想自己搭:

  • Cloudflare 监控(免费)
  • Better Stack / UptimeRobot(轻量探活)
  • Grafana Cloud Free(免费档够个人项目)
  • Datadog / New Relic(商业,贵但全)

下一篇:Docker 容器入门。