Kubernetes: k8s 运维篇-Prometheus 常用监控规则
- TAGS: Kubernetes
监控项目
服务优先级:
- 1
梳理目标:
- 各个应用的监控入口,便于值班人员进行分析查看
- 核心告警是否具备:
- 系统告警
- 业务告警
应用系统
| Category | Application | Incoming API TPS/RT/ErrorRate | Outgoing API TPS/RT/ErrorRate | Pod CPU/MEM/JVM | MySQL Metrics | Redis Metrics | RocketMQ Metrics | Kafka Metric | Business Metric |
|---|---|---|---|---|---|---|---|---|---|
中间件、基础设施
| Service | 应用对象 | 监控入口 |
|---|---|---|
Kubernetes
Pod
Kubernetes Container oom killer
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
team: ops
annotations:
summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
https://songrgg.github.io/operation/how-to-alert-for-Pod-Restart-OOMKilled-in-Kubernetes/
当容器因为 OOMKilled 而被杀死时,容器的退出原因将填充为 OOMKilled,同时它会发出一个 gauge:
kube_pod_container_status_last_terminated_reason → Gauge
Describes the last reason the container was in the terminated state.
当 OOMKill 来自子进程而不是主进程时,不会发出此指标,因此更可靠的方法是侦听 Kubernetes OOMKill 事件并基于此构建指标。
kubernetes 1.24 版本新增标指 container_oom_events_total
container_oom_events_total → counter
Describes the container’s OOM events.
# prometheus, fetch the counter of the containers OOM events.
container_oom_events_total{name="<some-container>"}
# OR if your cadvisor is below v3.9.1
# prometheus, fetch the gauge of the containers terminated by OOMKilled in the specific namespace.
kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="$PROJECT"}
Kubernetes pod crash looping
Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[5m]) >= 3
for: 2m
labels:
severity: warning
team: ops
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
集群Pod出现CrachLooping异常
- 当 集群Pod出现CrachLooping异常,Pod在 5 分钟内重启次数 大于或等于 3 次时,满足告警条件。
- 集群Pod出现CrachLooping异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}
其他计算方式
increase(kube_pod_container_status_restarts_total{namespace="$PROJECT", pod=~".*$APP.*"}[1h]) sum by(pod, namespace, container) (changes(kube_pod_container_status_restarts_total{container!="filebeat-sidecar",namespace=~"poker"}[2m])) >= 1
Kubernetes Pod not healthy
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.
- alert: KubernetesPodNotHealthy
#expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
expr: min_over_time(sum by (namespace,pod,phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",job="kube-state-metrics"})[3m:1m]) > 0
for: 3m
labels:
severity: critical
team: ops
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 3 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
集群Pod状态异常
- 当 集群Pod出现CrachLooping异常,Pod在 3 分钟内异常次数 大于 0 次时,满足告警条件。
- 集群应用容器副本Pod状态出现异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}, Pod状态: {{$labels.phase}}
Container High Memory usage
Container Memory usage is above 90%
- alert: ContainerHighMemoryUsage
expr: |-
(
sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", container!="", image!=""}) by (container,namespace,pod)
/
sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_limits{}) by (container,namespace,pod)
* 100
) > 90
for: 2m
labels:
severity: warning
team: ops
annotations:
summary: Container High Memory usage (container {{ $labels.container }})
description: "Container Memory usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Container Memory Usage
- When Container memory usage Greater than 80% ,Meet the alarm conditions。
Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} Memory Usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%
Container High CPU utilization
Container CPU utilization is above 90%
- alert: ContainerHighCpuUtilization
expr: |-
(
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{container!=""}) by (container,namespace,pod)
/
sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits{container!=""}) by (container,namespace,pod)
* 100
) >95
for: 2m
labels:
severity: warning
team: ops
annotations:
summary: Container High CPU utilization (container {{ $labels.container }})
description: "Container CPU utilization is above 95%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Container CPU Usage
- When Container CPU Usage Greater than 80%,Meet the alarm conditions。
Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU Usage{{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%
Pod Status Abnormal
When Pod status abnormal ,Meet the alarm conditions。
Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} Stay in {{$labels.phase}} State for more than 10 minutes
Pod Startup Timeout Failure_
When Pod Startup Timeout Failure ,Meet the alarm conditions。
Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}Failed to start for more than 15 minutes,Wait reason {{$labels.reason}}
Workload
Deployment Pod Availability
- alert: LowDeploymentAvailability
# 表达式返回不可用 Pod 数量,但通过标签关联其他值.
expr: |
# 计算可用副本比例 < 70% 且总副本数 > 9 的 Deployment
(
kube_deployment_spec_replicas{job="kube-state-metrics"}
-
kube_deployment_status_replicas_available{job="kube-state-metrics"}
)
# 添加条件过滤
and on(namespace, deployment)
(
kube_deployment_spec_replicas{job="kube-state-metrics"} > 9
and
(
kube_deployment_status_replicas_available{job="kube-state-metrics"}
/
kube_deployment_spec_replicas{job="kube-state-metrics"}
) < 0.7
)
for: 5m
labels:
severity: warning
team: ops
annotations:
description: |-
Namespace: {{ $labels.namespace }} / Deployment: {{ $labels.deployment }}
Total Replicas: {{ with printf "kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }}
Available Replicas: {{ with printf "kube_deployment_status_replicas_available{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }}
Pod Availability: {{ with printf "100 * kube_deployment_status_replicas_available{namespace='%s',deployment='%s'} / kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment $labels.namespace $labels.deployment | query }}{{ . | first | value | printf "%.2f" }}%{{ end }}
Unavailable Pods: {{ $value }}
summary: "Deployment {{ $labels.deployment }} has low availability"
如果一个应用程序有 10 个 pod,其中 7 个可以承载正常流量,则 70% 可以是一个合适的阈值。在另一种情况下,如果 pod 总数很低,警报可以是有多少 pod 应该存活。
kube_deployment_status_replicas_available{} / kube_deployment_spec_replicas{} < 70%
Kubernetes Deployment replicas mismatch
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch
- alert: KubernetesDeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes Deployment replicas mismatch (deployment {{ $labels.deployment }})
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
集群无状态应用Deployment副本拉起异常
当 集群无状态应用Deployment副本拉起异常 时,满足告警条件。
kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}集群无状态应用Deployment副本拉起异常,命名空间: {{$labels.namespace}},Deployment: {{$labels.deployment}}
Kubernetes Job failed
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete
- alert: KubeJobFailed
annotations:
description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
Removing failed job after investigation should clear this alert.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed
summary: Job failed to complete.
expr: kube_job_failed{job="kube-state-metrics", namespace=~".*"} > 0
for: 15m
labels:
severity: warning
集群Job运行失败
当 集群Job运行失败次数 大于 0 次时,满足告警条件。
集群Job执行失败。命名空间: {{$labels.namespace}}/Job: {{$labels.job_name}}
Kubernetes Job failed
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete
- alert: KubeJobNotCompleted
annotations:
description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
than {{ "43200" | humanizeDuration }} to complete.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted
summary: Job did not complete in time
expr: |-
time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"}
and
kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200
labels:
severity: warning
Job Execution Failed
When Job execution failed,Meet the alarm conditions。
Namespace: {{$labels.namespace}}/Job: {{$labels.job_name}} Execution Failed集群守护进程集Daemonset调度异常
当 集群守护进程集Daemonset调度异常 错误数 大于 0时,满足告警条件。
集群守护进程集Daemonset调度异常
集群守护进程集Daemonset调度运行状态异常
当 集群守护进程集Daemonset调度运行状态异常 错误率 大于 0%时,满足告警条件。
集群守护进程集Daemonset调度运行状态异常
Node
KubePersistentVolumeInodesFillingUp
- alert: PersistentVolume_FillingUp
annotations:
description: 'warning | Based on recent sampling, the PersistentVolume claimed
by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }}
is expected to fill up within four days. Currently {{ $value | humanizePercentage
}} is used [Duration Time: 1h].'
summary: '[Infra] PersistentVolume is filling up'
expr: "sum by (persistentvolumeclaim, instance, namespace, )((\n kubelet_volume_stats_used_bytes{
job=\"kubelet\", namespace=~\".*\"}\n / \n kubelet_volume_stats_capacity_bytes{
job=\"kubelet\", namespace=~\".*\"}\n) > 0.85\nand\nkubelet_volume_stats_used_bytes{
job=\"kubelet\", namespace=~\".*\"} > 0 \nand\npredict_linear(kubelet_volume_stats_available_bytes{
job=\"kubelet\", namespace=~\".*\"}[6h], 4 * 24 * 3600) < 0)\n"
for: 1h
labels:
severity: warning
集群PersistentVolume出现异常
当 集群PersistentVolume出现异常次数 大于 0 次时,满足告警条件。
集群PersistentVolume出现异常。PersistentVolume: {{$labels.persistentvolume}},当前状态: {{$labels.phase}}Node Status Abnormal
When Node status abnormal ,Meet the alarm conditions。
Node {{$labels.node}} is unavailable for more than 10 minutesNode Memory Usage
When Node memory usage Greater than 90% ,Meet the alarm conditions。
Node {{ $labels.instance }} Memory usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%,Current memory usage {{ $value }}%Node Disk Usage
When Node disk usage Greater than 90% ,Meet the alarm conditions。
Node {{ $labels.instance }} Disk {{ $labels.device }} Usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current disk usage {{ $value }}%Node CPU Usage
When Node cpu usage Greater than 90% ,Meet the alarm conditions。
Node {{ $labels.instance }} CPU usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current cpu usage {{ $value }}%