目前公司有两个业务线,所以搭建了两套Kubernetes集群用来部署不同业务,每套集群内部都是用容器部署了一套Prometheus监控自身的业务。基于数据易于分析和预警及时的考虑,故整合两个集群的Prometheus数据到外层的一个Prometheus里,并增加高级别异常电话告警。联邦集群的原理这里不在赘述,各组件具体配置示例如下:

一、外层聚合Prometheus(二进制形式)配置示例:

  • Prometheus配置示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# file: prometheus.yml

# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files: # 相关rule此处请根据需要自行配置
- "/opt/prometheus/rules/*.rules"
- "/opt/prometheus/rules/*.yaml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:

# static_configs:
# - targets: ['localhost:9090']
- job_name: 'federate'
# scrape_interval: 15s

honor_labels: true
metrics_path: '/federate'

params: # 此处注意,job需要和收集端的job名称一致
'match[]':
- '{job="prometheus"}'
- '{job="etcd"}'
- '{job=~"kubernetes-.*"}'
- '{job=~"kube-.*"}'
# - '{job=~"kubelet.*"}'

static_configs: # 需收集的地址
- targets:
- '192.168.7.74:9090'
- '192.168.7.47:9090'


# file: prometheus.service

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/prometheus/data \
--storage.tsdb.retention=168h \ # 数据持久化时间
--web.enable-lifecycle # 启用配置热更新

[Install]
WantedBy=multi-user.target
  • alertmanager配置示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# file: alertmanager.yml

global:
smtp_smarthost: "smtp.server.org:25"
smtp_from: "[email protected]"
smtp_auth_username: "[email protected]"
smtp_auth_password: "password"
smtp_require_tls: false
templates:
- '/opt/alertmanager/templates/*.tmpl'
route:
receiver: 'email_alert'
group_by: ['alertname', 'instance', 'service', 'severity'] # 聚合报警类别
group_wait: 10s # 聚合等待时间,超过这个时间开始发送报警
group_interval: 3m # 已经存在的group等待group_interval这个时间段看报警问题是否解决
repeat_interval: 5m # 再次报警间隔
routes:
- match: # 报警媒介匹配规则
severity: critical # 匹配label
receiver: 'multi_alert'
receivers: # 报警发送媒介
- name: 'email_alert' # email
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'multi_alert' # 多重媒介
webhook_configs:
- send_resolved: true # 电话报警
url: 'http://192.168.7.105:8765/phoneWarn/callUp/v1/prome'
email_configs: # email
- to: '[email protected]'
send_resolved: true


# file: alertmanager.service

[Unit]
Description=Prometheus Alertmanager Service
Wants=network-online.target
After=network.target

[Service]
Type=simple
ExecStart=/opt/alertmanager/alertmanager \
--config.file /opt/alertmanager/alertmanager.yml \
--storage.path /opt/alertmanager/data
Restart=always

[Install]
WantedBy=multi-user.target

二、电话报警配置说明示例

  • 电话报警原理:电话报警其实就是把prometheus发送给alertmanager的报警信息重新封装,推给已有的电话报警接口里(此处我们公司用的是封装的阿里云的电话接口)
  • prometheus发送原始信息示例(需二次封装)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
"status": "firing",
"groupLabels": {
"alertname": "pod_container_restart"
},
"groupKey": "{}:{alertname=\"pod_container_restart\"}",
"commonAnnotations": {
"description": "Pod coredns-6fc7b84544-58ff6 in namespace kube-system has a container resstart for more than 5 times"
},
"alerts": [
{
"status": "firing",
"labels": {
"k8s_app": "kube-state-metrics",
"container": "coredns",
"severity": "Warning",
"kubernetes_namespace": "kube-system",
"namespace": "kube-system",
"instance": "192.168.171.25:8080",
"job": "kube-state-metrics",
"alertname": "pod_container_restart",
"pod": "coredns-6fc7b84544-58ff6"
},
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://k8s-master-71:9090/graph?g0.expr=kube_pod_container_status_restarts_total+%3E+5&g0.tab=1",
"startsAt": "2019-08-27T19:22:54.730949237+08:00",
"annotations": {
"description": "Pod coredns-6fc7b84544-58ff6 in namespace kube-system has a container resstart for more than 5 times"
}
}
],
"version": "4",
"receiver": "phone_alert",
"externalURL": "http://k8s-master-71:9093",
"commonLabels": {
"k8s_app": "kube-state-metrics",
"container": "coredns",
"severity": "Warning",
"kubernetes_namespace": "kube-system",
"namespace": "kube-system",
"instance": "192.168.171.25:8080",
"job": "kube-state-metrics",
"alertname": "pod_container_restart",
"pod": "coredns-6fc7b84544-58ff6"
}
}
  • 电话报警功能封装代码示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# url: http://192.168.7.105:8765/phoneWarn/callUp/v1/prome

@app.route(api_prefix+'prome', methods=['POST'])
def prome_alert():
content_name = "您好"
phonenums = ["131xxxxxxxx","141xxxxxxxx"]
alert_info = json.loads(request.data)
call_content = "容器集群" + alert_info['groupLabels']['alertname']
alert_status = alert_info['status']
if alert_status == "firing":
for phonenum in phonenums:
params = {"taskname":"{}".format(call_content), "name":"{}".format(content_name),
"phonenum": "{}".format(phonenum)}
result = call_method.tts_call(phonenum, json.dumps(params)) # 自己封装的阿里云接口
elif alert_status == "resolved":
params = {"result": "恢复不电话报警.."}
logger.info(json.dumps(params, ensure_ascii=False))
return jsonify(json.loads(result))

参考: