Kubernetes: k8s 运维篇-Prometheus 常用监控规则

TAGS: Kubernetes

监控项目

服务优先级：

梳理目标：

各个应用的监控入口，便于值班人员进行分析查看
核心告警是否具备：
- 系统告警
- 业务告警

应用系统

Category	Application	Incoming API TPS/RT/ErrorRate	Outgoing API TPS/RT/ErrorRate	Pod CPU/MEM/JVM	MySQL Metrics	Redis Metrics	RocketMQ Metrics	Kafka Metric	Business Metric

中间件、基础设施

Service	应用对象	监控入口

Kubernetes

Pod

Kubernetes Container oom killer

Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.

- alert: KubernetesContainerOomKiller
  expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
  for: 0m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
    description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

https://songrgg.github.io/operation/how-to-alert-for-Pod-Restart-OOMKilled-in-Kubernetes/

当容器因为 OOMKilled 而被杀死时，容器的退出原因将填充为 OOMKilled，同时它会发出一个 gauge：

kube_pod_container_status_last_terminated_reason → Gauge

Describes the last reason the container was in the terminated state.

当 OOMKill 来自子进程而不是主进程时，不会发出此指标，因此更可靠的方法是侦听 Kubernetes OOMKill 事件并基于此构建指标。

kubernetes 1.24 版本新增标指 container_oom_events_total

container_oom_events_total → counter

Describes the container’s OOM events.

# prometheus, fetch the counter of the containers OOM events.
container_oom_events_total{name="<some-container>"}

# OR if your cadvisor is below v3.9.1
# prometheus, fetch the gauge of the containers terminated by OOMKilled in the specific namespace.
kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="$PROJECT"}

Kubernetes pod crash looping

Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping

- alert: KubernetesPodCrashLooping                                                                                                                                                         
  expr: increase(kube_pod_container_status_restarts_total[5m]) >= 3
  for: 2m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
    description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

集群Pod出现CrachLooping异常

当集群Pod出现CrachLooping异常，Pod在 5 分钟内重启次数大于或等于 3 次时，满足告警条件。
集群Pod出现CrachLooping异常。命名空间: {{$labels.namespace}}，容器副本Pod: {{$labels.pod}}

其他计算方式

increase(kube_pod_container_status_restarts_total{namespace="$PROJECT", pod=~".*$APP.*"}[1h])

sum by(pod, namespace, container) (changes(kube_pod_container_status_restarts_total{container!="filebeat-sidecar",namespace=~"poker"}[2m])) >= 1

Kubernetes Pod not healthy

Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.

- alert: KubernetesPodNotHealthy
  #expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
  expr: min_over_time(sum by (namespace,pod,phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",job="kube-state-metrics"})[3m:1m]) > 0
  for: 3m
  labels:
    severity: critical
    team: ops
  annotations:
    summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
    description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 3 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

集群Pod状态异常

当集群Pod出现CrachLooping异常，Pod在 3 分钟内异常次数大于 0 次时，满足告警条件。
集群应用容器副本Pod状态出现异常。命名空间: {{$labels.namespace}}，容器副本Pod: {{$labels.pod}}， Pod状态: {{$labels.phase}}

Container High Memory usage

Container Memory usage is above 90%

- alert: ContainerHighMemoryUsage
  expr: |-
    (
    sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor",  container!="", image!=""}) by (container,namespace,pod) 
    / 
    sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_limits{}) by (container,namespace,pod)
    * 100
    ) > 90
  for: 2m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Container High Memory usage (container {{ $labels.container }})
    description: "Container Memory usage is above 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Container Memory Usage

When Container memory usage Greater than 80% ，Meet the alarm conditions。

Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} Memory Usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%

Container High CPU utilization

Container CPU utilization is above 90%

- alert: ContainerHighCpuUtilization
  expr: |-
    (
    sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{container!=""}) by (container,namespace,pod) 
    /
    sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits{container!=""}) by (container,namespace,pod) 
    * 100
    ) >95
  for: 2m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Container High CPU utilization (container {{ $labels.container }})
    description: "Container CPU utilization is above 95%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Container CPU Usage

When Container CPU Usage Greater than 80%，Meet the alarm conditions。

Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU Usage{{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%

Pod Status Abnormal

When Pod status abnormal ，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} Stay in {{$labels.phase}} State for more than 10 minutes

Pod Startup Timeout Failure_

When Pod Startup Timeout Failure ，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}Failed to start for more than 15 minutes，Wait reason {{$labels.reason}}

PodEvent

找到集群Pod异常事件

ceil(increase(eventer_events_error_total{namespace="default", name=~".*", kind="Pod"}[1m]))>0 
or 
ceil(increase(eventer_events_warning_total{namespace="default", name=~".*", kind="Pod"}[1m]))>0

排除更新时旧pod状态，旧pod关闭时kube_pod_deletion_timestamp指标有值，其他时间无此指标。

sum by(namespace, name, reason) (
  (
    ceil(increase(eventer_events_error_total{namespace="default", name=~".*", kind="Pod"}[1m])) > 0
    or
    ceil(increase(eventer_events_warning_total{namespace="default", name=~".*", kind="Pod"}[1m])) > 0
  )
  unless on(namespace, name)
  (
    label_replace(
      kube_pod_deletion_timestamp{ pod =~".*"},
      "name", "$1", "pod", "(.*)"
    )
  )
)

集群Pod事件。容器副本Pod: {{$labels.name}}，事件: {{$labels.reason}}, 次数: {{ $value }}

PodDiskUsage

容器指标来自kubelet的cAdvisor

- alert: PodDiskUsageHigh
  expr: |
    (container_fs_usage_bytes{container!="", container!~"POD|pause", pod!=""} / 
     container_fs_limit_bytes{container!="", container!~"POD|pause", pod!=""}) * 100 > 85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pod磁盘使用率高 (实例 {{ $labels.pod }})"
    description: "{{ $labels.pod }} 磁盘使用率超过85% (当前值: {{ $value }}%)"

PVC

# 按PVC查看使用率
(kubelet_volume_stats_used_bytes / 
 kubelet_volume_stats_capacity_bytes) * 100

# PVC使用量增长趋势（1小时内）
rate(kubelet_volume_stats_used_bytes[1h]) * 3600

Workload

Deployment Pod Availability

- alert: LowDeploymentAvailability
  # 表达式返回不可用 Pod 数量，但通过标签关联其他值.
  expr: |
    # 计算可用副本比例 < 70% 且总副本数 > 9 的 Deployment
    (
      kube_deployment_spec_replicas{job="kube-state-metrics"}
      -
      kube_deployment_status_replicas_available{job="kube-state-metrics"}
    )
    # 添加条件过滤
    and on(namespace, deployment)
    (
      kube_deployment_spec_replicas{job="kube-state-metrics"} > 9
      and
      (
        kube_deployment_status_replicas_available{job="kube-state-metrics"} 
        /
        kube_deployment_spec_replicas{job="kube-state-metrics"}
      ) < 0.7
    )
  for: 5m
  labels:
    severity: warning
    team: ops
  annotations:
    description: |-
      Namespace: {{ $labels.namespace }} / Deployment: {{ $labels.deployment }}
      Total Replicas: {{ with printf "kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }}
      Available Replicas: {{ with printf "kube_deployment_status_replicas_available{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }}
      Pod Availability: {{ with printf "100 * kube_deployment_status_replicas_available{namespace='%s',deployment='%s'} / kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment $labels.namespace $labels.deployment | query }}{{ . | first | value | printf "%.2f" }}%{{ end }}
      Unavailable Pods: {{ $value }}
    summary: "Deployment {{ $labels.deployment }} has low availability"

如果一个应用程序有 10 个 pod，其中 7 个可以承载正常流量，则 70% 可以是一个合适的阈值。在另一种情况下，如果 pod 总数很低，警报可以是有多少 pod 应该存活。

kube_deployment_status_replicas_available{} / kube_deployment_spec_replicas{} < 70%

Kubernetes Deployment replicas mismatch

Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch

- alert: KubernetesDeploymentReplicasMismatch
  expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Kubernetes Deployment replicas mismatch (deployment {{ $labels.deployment }})
    description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

集群无状态应用Deployment副本拉起异常

当集群无状态应用Deployment副本拉起异常时，满足告警条件。

kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}

集群无状态应用Deployment副本拉起异常，命名空间: {{$labels.namespace}}，Deployment: {{$labels.deployment}}

Kubernetes Job failed

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete

- alert: KubeJobFailed
  annotations:
    description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
      Removing failed job after investigation should clear this alert.
    runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed
    summary: Job failed to complete.
  expr: kube_job_failed{job="kube-state-metrics", namespace=~".*"}  > 0
  for: 15m
  labels:
    severity: warning

集群Job运行失败

当集群Job运行失败次数大于 0 次时，满足告警条件。
```
集群Job执行失败。命名空间: {{$labels.namespace}}/Job: {{$labels.job_name}}
```

Kubernetes Job failed

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete

- alert: KubeJobNotCompleted
  annotations:
    description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
      than {{ "43200" | humanizeDuration }} to complete.
    runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted
    summary: Job did not complete in time
  expr: |-
    time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"}
      and
    kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200
  labels:
    severity: warning

Job Execution Failed

When Job execution failed，Meet the alarm conditions。

Namespace: {{$labels.namespace}}/Job: {{$labels.job_name}} Execution Failed

集群守护进程集Daemonset调度异常

当集群守护进程集Daemonset调度异常错误数大于 0时，满足告警条件。
```
集群守护进程集Daemonset调度异常
```
集群守护进程集Daemonset调度运行状态异常

当集群守护进程集Daemonset调度运行状态异常错误率大于 0%时，满足告警条件。
```
集群守护进程集Daemonset调度运行状态异常
```

Node

KubePersistentVolumeInodesFillingUp

- alert: PersistentVolume_FillingUp
  annotations:
    description: 'warning | Based on recent sampling, the PersistentVolume claimed
      by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }}
      is expected to fill up within four days. Currently {{ $value | humanizePercentage
      }} is used [Duration Time: 1h].'
    summary: '[Infra] PersistentVolume is filling up'
  expr: "sum by (persistentvolumeclaim, instance, namespace, )((\n  kubelet_volume_stats_used_bytes{
    job=\"kubelet\", namespace=~\".*\"}\n    / \n  kubelet_volume_stats_capacity_bytes{
    job=\"kubelet\", namespace=~\".*\"}\n) > 0.85\nand\nkubelet_volume_stats_used_bytes{
    job=\"kubelet\", namespace=~\".*\"} > 0 \nand\npredict_linear(kubelet_volume_stats_available_bytes{
    job=\"kubelet\", namespace=~\".*\"}[6h], 4 * 24 * 3600) < 0)\n"
  for: 1h
  labels:
    severity: warning

集群PersistentVolume出现异常

当集群PersistentVolume出现异常次数大于 0 次时，满足告警条件。

集群PersistentVolume出现异常。PersistentVolume: {{$labels.persistentvolume}}，当前状态: {{$labels.phase}}

Node Status Abnormal

When Node status abnormal ，Meet the alarm conditions。
```
Node {{$labels.node}} is unavailable for more than 10 minutes
```

Node Memory Usage

When Node memory usage Greater than 90% ，Meet the alarm conditions。

Node {{ $labels.instance }} Memory usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%，Current memory usage {{ $value }}%

Node Disk Usage

When Node disk usage Greater than 90% ，Meet the alarm conditions。

Node {{ $labels.instance }} Disk {{ $labels.device }} Usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%，Current disk usage {{ $value }}%

Node CPU Usage

When Node cpu usage Greater than 90% ，Meet the alarm conditions。

Node {{ $labels.instance }} CPU usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%，Current cpu usage {{ $value }}%