Drollery Medieval drollery of a knight on a horse

🏆 欢迎来到本站: https://xuchangwei.com/希望这里有你感兴趣的内容

flowery border with man falling
flowery border with man falling

Kubernetes: k8s 运维篇-Prometheus 常用监控规则

k8s- Prometheus

监控项目

服务优先级:

  • 1

梳理目标:

  • 各个应用的监控入口,便于值班人员进行分析查看
  • 核心告警是否具备:
    • 系统告警
    • 业务告警

应用系统

Category Application Incoming API TPS/RT/ErrorRate Outgoing API TPS/RT/ErrorRate Pod CPU/MEM/JVM MySQL Metrics Redis Metrics RocketMQ Metrics Kafka Metric Business Metric
                   

中间件、基础设施

Service 应用对象 监控入口
     

Pod

OOMEvents

https://songrgg.github.io/operation/how-to-alert-for-Pod-Restart-OOMKilled-in-Kubernetes/

当容器因为 OOMKilled 而被杀死时,容器的退出原因将填充为 OOMKilled,同时它会发出一个 gauge: kube_pod_container_status_last_terminated_reason → Gauge Describes the last reason the container was in the terminated state.

当 OOMKill 来自子进程而不是主进程时,不会发出此指标,因此更可靠的方法是侦听 Kubernetes OOMKill 事件并基于此构建指标。

kubernetes 1.24 版本新增标指 container_oom_events_total container_oom_events_total → counter Describes the container’s OOM events.

# prometheus, fetch the counter of the containers OOM events.
container_oom_events_total{name="<some-container>"}

# OR if your cadvisor is below v3.9.1
# prometheus, fetch the gauge of the containers terminated by OOMKilled in the specific namespace.
kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="$PROJECT"}

low-capacity alerts

如果一个应用程序有 10 个 pod,其中 8 个可以承载正常流量,则 80% 可以是一个合适的阈值。在另一种情况下,如果 pod 总数很低,警报可以是有多少 pod 应该存活。

# Use Prometheus as data source
kube_deployment_status_replicas_available{namespace="$PROJECT"} / kube_deployment_spec_replicas{namespace="$PROJECT"}

Pod container restart rate too high

# prometheus
increase(kube_pod_container_status_restarts_total{namespace="$PROJECT", pod=~".*$APP.*"}[1h])

sum by(pod, namespace, container) (changes(kube_pod_container_status_restarts_total{container!="filebeat-sidecar",namespace=~"poker"}[2m])) >= 1

其他

P1

  • 集群PersistentVolume出现异常

    当 集群PersistentVolume出现异常次数 大于 0 次时,满足告警条件。

    集群PersistentVolume出现异常。PersistentVolume: {{$labels.persistentvolume}},当前状态: {{$labels.phase}}
    
  • 集群Pod出现CrachLooping异常

    当 集群Pod出现CrachLooping异常,Pod在 5 分钟内重启次数 大于或等于 3 次时,满足告警条件。

    集群Pod出现CrachLooping异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}
    
  • 集群Pod状态异常

    当 集群Pod出现CrachLooping异常,Pod在 3 分钟内异常次数 大于 0 次时,满足告警条件。

    min_over_time(sum by (namespace,pod,phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",job="_kube-state-metrics"})[3m:1m]) > 0
    
    集群应用容器副本Pod状态出现异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}, Pod状态:  {{$labels.phase}}
    
  • 集群守护进程集Daemonset调度异常

    当 集群守护进程集Daemonset调度异常 错误数 大于 0时,满足告警条件。

    集群守护进程集Daemonset调度异常
    
  • 集群守护进程集Daemonset调度运行状态异常

    当 集群守护进程集Daemonset调度运行状态异常 错误率 大于 0%时,满足告警条件。

    集群守护进程集Daemonset调度运行状态异常
    
  • 集群无状态应用Deployment副本拉起异常

    当 集群无状态应用Deployment副本拉起异常 时,满足告警条件。

    kube_deployment_spec_replicas{job="_kube-state-metrics"} != kube_deployment_status_replicas_available{job="_kube-state-metrics"}
    
    集群无状态应用Deployment副本拉起异常,命名空间: {{$labels.namespace}},Deployment: {{$labels.deployment}}
    
  • 集群Job运行失败

    当 集群Job运行失败次数 大于 0 次时,满足告警条件。

    集群Job执行失败。命名空间: {{$labels.namespace}}/Job: {{$labels.job_name}}
    
  • Node Status Abnormal

    When Node status abnormal ,Meet the alarm conditions。

    Node {{$labels.node}} is unavailable for more than 10 minutes
    
  • Deployment Pod Availability

    When the number of Deployment instances Greater than 9 And Proportion of available deployment instances in total instances Less than 70% ,Meet the alarm conditions。

    Namespace: {{$labels.namespace}} / Deployment: {{$labels.deployment}} Pod Availability {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Number of currently unavailable Pods {{ $value }}
    
  • Pod Status Abnormal

    When Pod status abnormal ,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} Stay in {{$labels.phase}} State for more than 10 minutes
    

P2

  • Node Memory Usage

    When Node memory usage Greater than 90% ,Meet the alarm conditions。

    Node {{ $labels.instance }} Memory usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%,Current memory usage {{ $value }}%
    
  • Pod Startup Timeout Failure_

    When Pod Startup Timeout Failure ,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}Failed to start for more than 15 minutes,Wait reason {{$labels.reason}}
    
  • Pod Frequent Restart

    When Pod is in 5 minutes,If the number of restarts is greater than 3 times,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} ${metrics_params_time} Minutes within restart more than{{ $labels.metrics_params_value}} times,Current restarts {{ $value }} times
    

P3

  • Node Disk Usage

    When Node disk usage Greater than 90% ,Meet the alarm conditions。

    Node {{ $labels.instance }} Disk {{ $labels.device }} Usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current disk usage {{ $value }}%
    
  • Node CPU Usage

    When Node cpu usage Greater than 90% ,Meet the alarm conditions。

    Node {{ $labels.instance }} CPU usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current cpu usage {{ $value }}%
    
  • Job Execution Failed

    When Job execution failed,Meet the alarm conditions。

    Namespace: {{$labels.namespace}}/Job: {{$labels.job_name}} Execution Failed
    
  • Container Memory Usage

    When Container memory usage Greater than 80% ,Meet the alarm conditions。

    Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} Memory Usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%
    
  • Container CPU Usage

    When Container CPU Usage Greater than 80%,Meet the alarm conditions。

    Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU Usage{{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%