Kubernetes: k8s 运维篇-Prometheus 常用监控规则

TAGS: Kubernetes

k8s- Prometheus

监控项目

服务优先级：

梳理目标：

各个应用的监控入口，便于值班人员进行分析查看
核心告警是否具备：
- 系统告警
- 业务告警

应用系统

Category	Application	Incoming API TPS/RT/ErrorRate	Outgoing API TPS/RT/ErrorRate	Pod CPU/MEM/JVM	MySQL Metrics	Redis Metrics	RocketMQ Metrics	Kafka Metric	Business Metric

中间件、基础设施

Service	应用对象	监控入口

Pod

OOMEvents

https://songrgg.github.io/operation/how-to-alert-for-Pod-Restart-OOMKilled-in-Kubernetes/

当容器因为 OOMKilled 而被杀死时，容器的退出原因将填充为 OOMKilled，同时它会发出一个 gauge： kube_pod_container_status_last_terminated_reason → Gauge Describes the last reason the container was in the terminated state.

当 OOMKill 来自子进程而不是主进程时，不会发出此指标，因此更可靠的方法是侦听 Kubernetes OOMKill 事件并基于此构建指标。

kubernetes 1.24 版本新增标指 container_oom_events_total container_oom_events_total → counter Describes the container’s OOM events.

# prometheus, fetch the counter of the containers OOM events.
container_oom_events_total{name="<some-container>"}

# OR if your cadvisor is below v3.9.1
# prometheus, fetch the gauge of the containers terminated by OOMKilled in the specific namespace.
kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="$PROJECT"}

low-capacity alerts

如果一个应用程序有 10 个 pod，其中 8 个可以承载正常流量，则 80% 可以是一个合适的阈值。在另一种情况下，如果 pod 总数很低，警报可以是有多少 pod 应该存活。

# Use Prometheus as data source
kube_deployment_status_replicas_available{namespace="$PROJECT"} / kube_deployment_spec_replicas{namespace="$PROJECT"}

Pod container restart rate too high

# prometheus
increase(kube_pod_container_status_restarts_total{namespace="$PROJECT", pod=~".*$APP.*"}[1h])

sum by(pod, namespace, container) (changes(kube_pod_container_status_restarts_total{container!="filebeat-sidecar",namespace=~"poker"}[2m])) >= 1

其他

P1

集群PersistentVolume出现异常

当集群PersistentVolume出现异常次数大于 0 次时，满足告警条件。

集群PersistentVolume出现异常。PersistentVolume: {{$labels.persistentvolume}}，当前状态: {{$labels.phase}}

集群Pod出现CrachLooping异常

当集群Pod出现CrachLooping异常，Pod在 5 分钟内重启次数大于或等于 3 次时，满足告警条件。
```
集群Pod出现CrachLooping异常。命名空间: {{$labels.namespace}}，容器副本Pod: {{$labels.pod}}
```

集群Pod状态异常

当集群Pod出现CrachLooping异常，Pod在 3 分钟内异常次数大于 0 次时，满足告警条件。

min_over_time(sum by (namespace,pod,phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",job="_kube-state-metrics"})[3m:1m]) > 0

集群应用容器副本Pod状态出现异常。命名空间: {{$labels.namespace}}，容器副本Pod: {{$labels.pod}}， Pod状态:  {{$labels.phase}}

集群守护进程集Daemonset调度异常

当集群守护进程集Daemonset调度异常错误数大于 0时，满足告警条件。
```
集群守护进程集Daemonset调度异常
```
集群守护进程集Daemonset调度运行状态异常

当集群守护进程集Daemonset调度运行状态异常错误率大于 0%时，满足告警条件。
```
集群守护进程集Daemonset调度运行状态异常
```

集群无状态应用Deployment副本拉起异常

当集群无状态应用Deployment副本拉起异常时，满足告警条件。

kube_deployment_spec_replicas{job="_kube-state-metrics"} != kube_deployment_status_replicas_available{job="_kube-state-metrics"}

集群无状态应用Deployment副本拉起异常，命名空间: {{$labels.namespace}}，Deployment: {{$labels.deployment}}

集群Job运行失败

当集群Job运行失败次数大于 0 次时，满足告警条件。
```
集群Job执行失败。命名空间: {{$labels.namespace}}/Job: {{$labels.job_name}}
```
Node Status Abnormal

When Node status abnormal ，Meet the alarm conditions。
```
Node {{$labels.node}} is unavailable for more than 10 minutes
```

Deployment Pod Availability

When the number of Deployment instances Greater than 9 And Proportion of available deployment instances in total instances Less than 70% ，Meet the alarm conditions。

Namespace: {{$labels.namespace}} / Deployment: {{$labels.deployment}} Pod Availability {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Number of currently unavailable Pods {{ $value }}

Pod Status Abnormal

When Pod status abnormal ，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} Stay in {{$labels.phase}} State for more than 10 minutes

P2

Node Memory Usage

When Node memory usage Greater than 90% ，Meet the alarm conditions。

Node {{ $labels.instance }} Memory usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%，Current memory usage {{ $value }}%

Pod Startup Timeout Failure_

When Pod Startup Timeout Failure ，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}Failed to start for more than 15 minutes，Wait reason {{$labels.reason}}

Pod Frequent Restart

When Pod is in 5 minutes，If the number of restarts is greater than 3 times，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} ${metrics_params_time} Minutes within restart more than{{ $labels.metrics_params_value}} times，Current restarts {{ $value }} times

P3

Node Disk Usage

When Node disk usage Greater than 90% ，Meet the alarm conditions。

Node {{ $labels.instance }} Disk {{ $labels.device }} Usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%，Current disk usage {{ $value }}%

Node CPU Usage

When Node cpu usage Greater than 90% ，Meet the alarm conditions。

Node {{ $labels.instance }} CPU usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%，Current cpu usage {{ $value }}%

Job Execution Failed

When Job execution failed，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Job: {{$labels.job_name}} Execution Failed

Container Memory Usage

When Container memory usage Greater than 80% ，Meet the alarm conditions。

Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} Memory Usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%

Container CPU Usage

When Container CPU Usage Greater than 80%，Meet the alarm conditions。

Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU Usage{{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%