theme: default themeName: "默认主题" title: "我用了3年的运维监控神器：Grafana + Prometheus 实战搭建指南"

我用了3年的运维监控神器：Grafana + Prometheus 实战搭建指南

运维工作里，有一件事比处理故障更重要——在故障发生之前就发现问题。而要做到这一点，你需要一套好用的监控系统。

我用过Zabbix、Nagios、Cacti，最后在Grafana + Prometheus这套组合上安了家，一用就是三年。今天把搭建和使用经验整理出来，帮你少走弯路。

为什么选Grafana + Prometheus？

先说说为什么不用其他方案：

Zabbix：功能强大，但配置复杂，界面老旧，学习曲线陡

Nagios：老牌工具，但扩展性差，现代化程度低

商业方案（Datadog/New Relic）：好用，但贵，数据在别人那里

Grafana + Prometheus的优势：

完全开源，免费

生态丰富，有大量现成的Exporter和Dashboard

时序数据库，专为监控设计，查询性能优秀

Grafana界面美观，可视化能力强

支持多种数据源，不只是Prometheus

架构说明

[被监控服务器]
  ├── Node Exporter (系统指标)   ├── MySQL Exporter (数据库指标)   └── Nginx Exporter (Web服务指标)          ↓ HTTP Pull [Prometheus Server]   ├── 存储时序数据   ├── 告警规则评估   └── 推送告警到 Alertmanager          ↓ [Alertmanager]   └── 发送告警到 钉钉/企业微信/邮件          ↓ [Grafana]   └── 查询Prometheus，展示Dashboard

第一步：安装Prometheus

# 下载Prometheus（以Linux amd64为例）
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz tar xvf prometheus-2.51.0.linux-amd64.tar.gz mv prometheus-2.51.0.linux-amd64 /opt/prometheus

创建配置文件 /opt/prometheus/prometheus.yml
global:   scrape_interval: 15s   evaluation_interval: 15s

alerting:   alertmanagers:     - static_configs:         - targets:           - alertmanager:9093

rule_files:   - "rules/*.yml"

scrape_configs:   - job_name: 'prometheus'     static_configs:       - targets: ['localhost:9090']

  - job_name: 'node'     static_configs:       - targets: ['192.168.1.10:9100', '192.168.1.11:9100']         labels:           env: 'production'

  - job_name: 'mysql'     static_configs:       - targets: ['192.168.1.10:9104']

  - job_name: 'nginx'     static_configs:       - targets: ['192.168.1.10:9113']

创建systemd服务：

# /etc/systemd/system/prometheus.service
[Unit] Description=Prometheus After=network.target

[Service] Type=simple User=prometheus ExecStart=/opt/prometheus/prometheus \     --config.file=/opt/prometheus/prometheus.yml \     --storage.tsdb.path=/opt/prometheus/data \     --storage.tsdb.retention.time=30d \     --web.enable-lifecycle Restart=on-failure

[Install] WantedBy=multi-user.target

useradd -r -s /bin/false prometheus
chown -R prometheus:prometheus /opt/prometheus systemctl daemon-reload systemctl enable prometheus systemctl start prometheus

第二步：安装Node Exporter（被监控服务器）

# 在每台被监控服务器上安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvf node_exporter-1.7.0.linux-amd64.tar.gz mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

systemd服务
cat > /etc/systemd/system/node_exporter.service << EOF [Unit] Description=Node Exporter After=network.target

[Service] Type=simple User=nobody ExecStart=/usr/local/bin/node_exporter Restart=on-failure

[Install] WantedBy=multi-user.target EOF

systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter

Node Exporter默认监听9100端口，提供CPU、内存、磁盘、网络等系统指标。

第三步：安装Grafana

# Ubuntu/Debian
apt install -y apt-transport-https software-properties-common

wget -q -O - https://packages.grafana.com/gpg.key apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" tee /etc/apt/sources.list.d/grafana.list

apt update apt install grafana

systemctl enable grafana-server systemctl start grafana-server

Grafana默认监听3000端口，默认账号密码：admin/admin。

添加Prometheus数据源：

1. 进入 Configuration → Data Sources 2. 点击 Add data source 3. 选择 Prometheus 4. URL填写：`http://localhost:9090` 5. 点击 Save & Test

第四步：导入现成Dashboard

Grafana有大量社区贡献的Dashboard，不用自己从头画。

推荐Dashboard ID：

Node Exporter Full：ID 1860 - 系统全面监控

MySQL Overview：ID 7362 - MySQL监控

Nginx：ID 9614 - Nginx监控

Redis：ID 11835 - Redis监控

导入方法：

1. 点击左侧 + → Import 2. 输入Dashboard ID 3. 选择数据源 4. 点击Import

几分钟就能有一个专业的监控大屏。

第五步：配置告警规则

# /opt/prometheus/rules/node_alerts.yml
groups:   - name: node_alerts     rules:       # CPU使用率超过80%持续5分钟       - alert: HighCPUUsage         expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80         for: 5m         labels:           severity: warning         annotations:           summary: "CPU使用率过高 {{ $labels.instance }}"

description: "CPU使用率: {{ $value humanize }}%"


      # 内存可用率低于10%       - alert: LowMemory         expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10         for: 5m         labels:           severity: critical         annotations:           summary: "内存不足 {{ $labels.instance }}"

description: "可用内存: {{ $value humanize }}%"


      # 磁盘使用率超过85%       - alert: DiskSpaceLow         expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85         for: 5m         labels:           severity: warning         annotations:           summary: "磁盘空间不足 {{ $labels.instance }}"

description: "磁盘使用率: {{ $value humanize }}%"


      # 服务器宕机       - alert: InstanceDown         expr: up == 0         for: 1m         labels:           severity: critical         annotations:           summary: "服务器宕机 {{ $labels.instance }}"

第六步：配置Alertmanager发送钉钉告警

# /opt/alertmanager/alertmanager.yml
global:   resolve_timeout: 5m

route:   group_by: ['alertname', 'instance']   group_wait: 30s   group_interval: 5m   repeat_interval: 4h   receiver: 'dingtalk'

receivers:   - name: 'dingtalk'     webhook_configs:       - url: 'http://localhost:8060/dingtalk/webhook1/send'         send_resolved: true

配合`prometheus-webhook-dingtalk`工具，可以把告警发送到钉钉群。

实用PromQL查询

学会PromQL，才能真正发挥Prometheus的威力：

# CPU使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

磁盘IO读取速率
irate(node_disk_read_bytes_total[5m])

网络流量（入）
irate(node_network_receive_bytes_total{device!="lo"}[5m])

HTTP请求速率（Nginx）
irate(nginx_http_requests_total[5m])

MySQL QPS
irate(mysql_global_status_queries[5m])

写在最后

Grafana + Prometheus这套组合，搭建一次，受益长久。

搭建完成后你能做到：

实时查看所有服务器的CPU、内存、磁盘、网络状态

故障发生前收到告警，提前处理

历史数据回溯，分析故障根因

容量规划，预测何时需要扩容

进阶方向：

接入Loki做日志聚合

接入Jaeger做链路追踪

搭建完整的可观测性（Observability）平台

【放心，我们兜底】

不管你是自己尝试修复，还是需要专业人员上门，易云城IT服务都给你托底。修不好不收费，修好了质保期内随时找我。

📞 服务热线：13708730161 💬 微信：eyc1689 📧 邮箱：service@eycit.com 🌐 https://www.eycit.com

您身边的IT专家。

wget -q -O - https://packages.grafana.com/gpg.key	apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main"	tee /etc/apt/sources.list.d/grafana.list