theme: default themeName: "默认主题" title: "我用了3年的运维监控神器:Grafana + Prometheus 实战搭建指南"
我用了3年的运维监控神器:Grafana + Prometheus 实战搭建指南
运维工作里,有一件事比处理故障更重要——在故障发生之前就发现问题。而要做到这一点,你需要一套好用的监控系统。
我用过Zabbix、Nagios、Cacti,最后在Grafana + Prometheus这套组合上安了家,一用就是三年。今天把搭建和使用经验整理出来,帮你少走弯路。
为什么选Grafana + Prometheus?
先说说为什么不用其他方案:
- Zabbix:功能强大,但配置复杂,界面老旧,学习曲线陡
- Nagios:老牌工具,但扩展性差,现代化程度低
- 商业方案(Datadog/New Relic):好用,但贵,数据在别人那里
- 完全开源,免费
- 生态丰富,有大量现成的Exporter和Dashboard
- 时序数据库,专为监控设计,查询性能优秀
- Grafana界面美观,可视化能力强
- 支持多种数据源,不只是Prometheus
架构说明
[被监控服务器]
├── Node Exporter (系统指标) ├── MySQL Exporter (数据库指标) └── Nginx Exporter (Web服务指标) ↓ HTTP Pull [Prometheus Server] ├── 存储时序数据 ├── 告警规则评估 └── 推送告警到 Alertmanager ↓ [Alertmanager] └── 发送告警到 钉钉/企业微信/邮件 ↓ [Grafana] └── 查询Prometheus,展示Dashboard
第一步:安装Prometheus
# 下载Prometheus(以Linux amd64为例)
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz tar xvf prometheus-2.51.0.linux-amd64.tar.gz mv prometheus-2.51.0.linux-amd64 /opt/prometheus
创建配置文件 /opt/prometheus/prometheus.yml
global: scrape_interval: 15s evaluation_interval: 15s
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
rule_files: - "rules/*.yml"
scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
- job_name: 'node' static_configs: - targets: ['192.168.1.10:9100', '192.168.1.11:9100'] labels: env: 'production'
- job_name: 'mysql' static_configs: - targets: ['192.168.1.10:9104']
- job_name: 'nginx' static_configs: - targets: ['192.168.1.10:9113']
创建systemd服务:
# /etc/systemd/system/prometheus.service
[Unit] Description=Prometheus After=network.target
[Service] Type=simple User=prometheus ExecStart=/opt/prometheus/prometheus \ --config.file=/opt/prometheus/prometheus.yml \ --storage.tsdb.path=/opt/prometheus/data \ --storage.tsdb.retention.time=30d \ --web.enable-lifecycle Restart=on-failure
[Install] WantedBy=multi-user.target
useradd -r -s /bin/false prometheus
chown -R prometheus:prometheus /opt/prometheus systemctl daemon-reload systemctl enable prometheus systemctl start prometheus
第二步:安装Node Exporter(被监控服务器)
# 在每台被监控服务器上安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvf node_exporter-1.7.0.linux-amd64.tar.gz mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
systemd服务
cat > /etc/systemd/system/node_exporter.service << EOF [Unit] Description=Node Exporter After=network.target
[Service] Type=simple User=nobody ExecStart=/usr/local/bin/node_exporter Restart=on-failure
[Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter
Node Exporter默认监听9100端口,提供CPU、内存、磁盘、网络等系统指标。
第三步:安装Grafana
# Ubuntu/Debian
apt install -y apt-transport-https software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" tee /etc/apt/sources.list.d/grafana.list
apt update apt install grafana
systemctl enable grafana-server systemctl start grafana-server
Grafana默认监听3000端口,默认账号密码:admin/admin。
添加Prometheus数据源:1. 进入 Configuration → Data Sources 2. 点击 Add data source 3. 选择 Prometheus 4. URL填写:`http://localhost:9090` 5. 点击 Save & Test
第四步:导入现成Dashboard
Grafana有大量社区贡献的Dashboard,不用自己从头画。
推荐Dashboard ID:- Node Exporter Full:ID 1860 - 系统全面监控
- MySQL Overview:ID 7362 - MySQL监控
- Nginx:ID 9614 - Nginx监控
- Redis:ID 11835 - Redis监控
1. 点击左侧 + → Import 2. 输入Dashboard ID 3. 选择数据源 4. 点击Import
几分钟就能有一个专业的监控大屏。
第五步:配置告警规则
# /opt/prometheus/rules/node_alerts.yml
groups: - name: node_alerts rules: # CPU使用率超过80%持续5分钟 - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "CPU使用率过高 {{ $labels.instance }}"
description: "CPU使用率: {{ $value humanize }}%"
# 内存可用率低于10% - alert: LowMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 5m labels: severity: critical annotations: summary: "内存不足 {{ $labels.instance }}"
description: "可用内存: {{ $value humanize }}%"
# 磁盘使用率超过85% - alert: DiskSpaceLow expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "磁盘空间不足 {{ $labels.instance }}"
description: "磁盘使用率: {{ $value humanize }}%"
# 服务器宕机 - alert: InstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "服务器宕机 {{ $labels.instance }}"
第六步:配置Alertmanager发送钉钉告警
# /opt/alertmanager/alertmanager.yml
global: resolve_timeout: 5m
route: group_by: ['alertname', 'instance'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'dingtalk'
receivers: - name: 'dingtalk' webhook_configs: - url: 'http://localhost:8060/dingtalk/webhook1/send' send_resolved: true
配合`prometheus-webhook-dingtalk`工具,可以把告警发送到钉钉群。
实用PromQL查询
学会PromQL,才能真正发挥Prometheus的威力:
# CPU使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
磁盘IO读取速率
irate(node_disk_read_bytes_total[5m])
网络流量(入)
irate(node_network_receive_bytes_total{device!="lo"}[5m])
HTTP请求速率(Nginx)
irate(nginx_http_requests_total[5m])
MySQL QPS
irate(mysql_global_status_queries[5m])
写在最后
Grafana + Prometheus这套组合,搭建一次,受益长久。
搭建完成后你能做到:- 实时查看所有服务器的CPU、内存、磁盘、网络状态
- 故障发生前收到告警,提前处理
- 历史数据回溯,分析故障根因
- 容量规划,预测何时需要扩容
- 接入Loki做日志聚合
- 接入Jaeger做链路追踪
- 搭建完整的可观测性(Observability)平台
【放心,我们兜底】
不管你是自己尝试修复,还是需要专业人员上门,易云城IT服务都给你托底。修不好不收费,修好了质保期内随时找我。
📞 服务热线:13708730161 💬 微信:eyc1689 📧 邮箱:service@eycit.com 🌐 https://www.eycit.com
您身边的IT专家。