Kubernetes: k8s 运维篇-Prometheus 常用监控规则

监控项目

服务优先级:

  • 1

梳理目标:

  • 各个应用的监控入口,便于值班人员进行分析查看
  • 核心告警是否具备:
    • 系统告警
    • 业务告警

应用系统

Category Application Incoming API TPS/RT/ErrorRate Outgoing API TPS/RT/ErrorRate Pod CPU/MEM/JVM MySQL Metrics Redis Metrics RocketMQ Metrics Kafka Metric Business Metric
                   

中间件、基础设施

Service 应用对象 监控入口
     

Kubernetes

Pod

Kubernetes Container oom killer

Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.

- alert: KubernetesContainerOomKiller
  expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
  for: 0m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
    description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"     

https://songrgg.github.io/operation/how-to-alert-for-Pod-Restart-OOMKilled-in-Kubernetes/

当容器因为 OOMKilled 而被杀死时,容器的退出原因将填充为 OOMKilled,同时它会发出一个 gauge:

kube_pod_container_status_last_terminated_reason → Gauge

Describes the last reason the container was in the terminated state.

当 OOMKill 来自子进程而不是主进程时,不会发出此指标,因此更可靠的方法是侦听 Kubernetes OOMKill 事件并基于此构建指标。

kubernetes 1.24 版本新增标指 container_oom_events_total

container_oom_events_total → counter

Describes the container’s OOM events.

# prometheus, fetch the counter of the containers OOM events.
container_oom_events_total{name="<some-container>"}

# OR if your cadvisor is below v3.9.1
# prometheus, fetch the gauge of the containers terminated by OOMKilled in the specific namespace.
kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="$PROJECT"}

Kubernetes pod crash looping

Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping

- alert: KubernetesPodCrashLooping                                                                                                                                                         
  expr: increase(kube_pod_container_status_restarts_total[5m]) >= 3
  for: 2m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
    description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

集群Pod出现CrachLooping异常

  • 当 集群Pod出现CrachLooping异常,Pod在 5 分钟内重启次数 大于或等于 3 次时,满足告警条件。
  • 集群Pod出现CrachLooping异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}

其他计算方式

increase(kube_pod_container_status_restarts_total{namespace="$PROJECT", pod=~".*$APP.*"}[1h])

sum by(pod, namespace, container) (changes(kube_pod_container_status_restarts_total{container!="filebeat-sidecar",namespace=~"poker"}[2m])) >= 1

Kubernetes Pod not healthy

Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.

- alert: KubernetesPodNotHealthy
  #expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
  expr: min_over_time(sum by (namespace,pod,phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",job="kube-state-metrics"})[3m:1m]) > 0
  for: 3m
  labels:
    severity: critical
    team: ops
  annotations:
    summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
    description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 3 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

集群Pod状态异常

  • 当 集群Pod出现CrachLooping异常,Pod在 3 分钟内异常次数 大于 0 次时,满足告警条件。
  • 集群应用容器副本Pod状态出现异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}, Pod状态: {{$labels.phase}}

Container High Memory usage

Container Memory usage is above 90%

- alert: ContainerHighMemoryUsage
  expr: |-
    (
    sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor",  container!="", image!=""}) by (container,namespace,pod) 
    / 
    sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_limits{}) by (container,namespace,pod)
    * 100
    ) > 90
  for: 2m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Container High Memory usage (container {{ $labels.container }})
    description: "Container Memory usage is above 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Container Memory Usage

  • When Container memory usage Greater than 80% ,Meet the alarm conditions。
Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} Memory Usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%

Container High CPU utilization

Container CPU utilization is above 90%

- alert: ContainerHighCpuUtilization
  expr: |-
    (
    sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{container!=""}) by (container,namespace,pod) 
    /
    sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits{container!=""}) by (container,namespace,pod) 
    * 100
    ) >95
  for: 2m
  labels:
    severity: warning
    team: ops
  annotations:
    summary: Container High CPU utilization (container {{ $labels.container }})
    description: "Container CPU utilization is above 95%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Container CPU Usage

  • When Container CPU Usage Greater than 80%,Meet the alarm conditions。
Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU Usage{{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%
  • Pod Status Abnormal

    When Pod status abnormal ,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} Stay in {{$labels.phase}} State for more than 10 minutes
    
  • Pod Startup Timeout Failure_

    When Pod Startup Timeout Failure ,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}Failed to start for more than 15 minutes,Wait reason {{$labels.reason}}
    

Workload

Deployment Pod Availability

- alert: LowDeploymentAvailability
  # 表达式返回不可用 Pod 数量,但通过标签关联其他值.
  expr: |
    # 计算可用副本比例 < 70% 且总副本数 > 9 的 Deployment
    (
      kube_deployment_spec_replicas{job="kube-state-metrics"}
      -
      kube_deployment_status_replicas_available{job="kube-state-metrics"}
    )
    # 添加条件过滤
    and on(namespace, deployment)
    (
      kube_deployment_spec_replicas{job="kube-state-metrics"} > 9
      and
      (
        kube_deployment_status_replicas_available{job="kube-state-metrics"} 
        /
        kube_deployment_spec_replicas{job="kube-state-metrics"}
      ) < 0.7
    )
  for: 5m
  labels:
    severity: warning
    team: ops
  annotations:
    description: |-
      Namespace: {{ $labels.namespace }} / Deployment: {{ $labels.deployment }}
      Total Replicas: {{ with printf "kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }}
      Available Replicas: {{ with printf "kube_deployment_status_replicas_available{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }}
      Pod Availability: {{ with printf "100 * kube_deployment_status_replicas_available{namespace='%s',deployment='%s'} / kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment $labels.namespace $labels.deployment | query }}{{ . | first | value | printf "%.2f" }}%{{ end }}
      Unavailable Pods: {{ $value }}
    summary: "Deployment {{ $labels.deployment }} has low availability"

如果一个应用程序有 10 个 pod,其中 7 个可以承载正常流量,则 70% 可以是一个合适的阈值。在另一种情况下,如果 pod 总数很低,警报可以是有多少 pod 应该存活。

kube_deployment_status_replicas_available{} / kube_deployment_spec_replicas{} < 70%

Kubernetes Deployment replicas mismatch

Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch

- alert: KubernetesDeploymentReplicasMismatch
  expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Kubernetes Deployment replicas mismatch (deployment {{ $labels.deployment }})
    description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  • 集群无状态应用Deployment副本拉起异常

    当 集群无状态应用Deployment副本拉起异常 时,满足告警条件。

    kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}
    
    集群无状态应用Deployment副本拉起异常,命名空间: {{$labels.namespace}},Deployment: {{$labels.deployment}}
    

Kubernetes Job failed

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete

- alert: KubeJobFailed
  annotations:
    description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
      Removing failed job after investigation should clear this alert.
    runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed
    summary: Job failed to complete.
  expr: kube_job_failed{job="kube-state-metrics", namespace=~".*"}  > 0
  for: 15m
  labels:
    severity: warning
  • 集群Job运行失败

    当 集群Job运行失败次数 大于 0 次时,满足告警条件。

    集群Job执行失败。命名空间: {{$labels.namespace}}/Job: {{$labels.job_name}}
    

Kubernetes Job failed

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete

- alert: KubeJobNotCompleted
  annotations:
    description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
      than {{ "43200" | humanizeDuration }} to complete.
    runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted
    summary: Job did not complete in time
  expr: |-
    time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"}
      and
    kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200
  labels:
    severity: warning
  • Job Execution Failed

    When Job execution failed,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Job: {{$labels.job_name}} Execution Failed
    
  • 集群守护进程集Daemonset调度异常

    当 集群守护进程集Daemonset调度异常 错误数 大于 0时,满足告警条件。

    集群守护进程集Daemonset调度异常
    
  • 集群守护进程集Daemonset调度运行状态异常

    当 集群守护进程集Daemonset调度运行状态异常 错误率 大于 0%时,满足告警条件。

    集群守护进程集Daemonset调度运行状态异常
    

Node

KubePersistentVolumeInodesFillingUp

- alert: PersistentVolume_FillingUp
  annotations:
    description: 'warning | Based on recent sampling, the PersistentVolume claimed
      by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }}
      is expected to fill up within four days. Currently {{ $value | humanizePercentage
      }} is used [Duration Time: 1h].'
    summary: '[Infra] PersistentVolume is filling up'
  expr: "sum by (persistentvolumeclaim, instance, namespace, )((\n  kubelet_volume_stats_used_bytes{
    job=\"kubelet\", namespace=~\".*\"}\n    / \n  kubelet_volume_stats_capacity_bytes{
    job=\"kubelet\", namespace=~\".*\"}\n) > 0.85\nand\nkubelet_volume_stats_used_bytes{
    job=\"kubelet\", namespace=~\".*\"} > 0 \nand\npredict_linear(kubelet_volume_stats_available_bytes{
    job=\"kubelet\", namespace=~\".*\"}[6h], 4 * 24 * 3600) < 0)\n"
  for: 1h
  labels:
    severity: warning
  • 集群PersistentVolume出现异常

    当 集群PersistentVolume出现异常次数 大于 0 次时,满足告警条件。

    集群PersistentVolume出现异常。PersistentVolume: {{$labels.persistentvolume}},当前状态: {{$labels.phase}}
    
  • Node Status Abnormal

    When Node status abnormal ,Meet the alarm conditions。

    Node {{$labels.node}} is unavailable for more than 10 minutes
    
  • Node Memory Usage

    When Node memory usage Greater than 90% ,Meet the alarm conditions。

    Node {{ $labels.instance }} Memory usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%,Current memory usage {{ $value }}%
    
  • Node Disk Usage

    When Node disk usage Greater than 90% ,Meet the alarm conditions。

    Node {{ $labels.instance }} Disk {{ $labels.device }} Usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current disk usage {{ $value }}%
    
  • Node CPU Usage

    When Node cpu usage Greater than 90% ,Meet the alarm conditions。

    Node {{ $labels.instance }} CPU usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current cpu usage {{ $value }}%
    
emacs

Emacs

org-mode

Orgmode

Donations

打赏

Copyright

© 2025 Jasper Hsu

Creative Commons

Creative Commons

Attribute

Attribute

Noncommercial

Noncommercial

Share Alike

Share Alike