Kubernetes: k8s 运维篇-Prometheus 常用监控规则
- TAGS: Kubernetes
监控项目
服务优先级:
- 1
梳理目标:
- 各个应用的监控入口,便于值班人员进行分析查看
- 核心告警是否具备:
- 系统告警
- 业务告警
应用系统
Category | Application | Incoming API TPS/RT/ErrorRate | Outgoing API TPS/RT/ErrorRate | Pod CPU/MEM/JVM | MySQL Metrics | Redis Metrics | RocketMQ Metrics | Kafka Metric | Business Metric |
---|---|---|---|---|---|---|---|---|---|
中间件、基础设施
Service | 应用对象 | 监控入口 |
---|---|---|
Kubernetes
Pod
Kubernetes Container oom killer
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.
- alert: KubernetesContainerOomKiller expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1 for: 0m labels: severity: warning team: ops annotations: summary: Kubernetes Container oom killer (instance {{ $labels.instance }}) description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
https://songrgg.github.io/operation/how-to-alert-for-Pod-Restart-OOMKilled-in-Kubernetes/
当容器因为 OOMKilled 而被杀死时,容器的退出原因将填充为 OOMKilled,同时它会发出一个 gauge:
kube_pod_container_status_last_terminated_reason → Gauge
Describes the last reason the container was in the terminated state.
当 OOMKill 来自子进程而不是主进程时,不会发出此指标,因此更可靠的方法是侦听 Kubernetes OOMKill 事件并基于此构建指标。
kubernetes 1.24 版本新增标指 container_oom_events_total
container_oom_events_total → counter
Describes the container’s OOM events.
# prometheus, fetch the counter of the containers OOM events. container_oom_events_total{name="<some-container>"} # OR if your cadvisor is below v3.9.1 # prometheus, fetch the gauge of the containers terminated by OOMKilled in the specific namespace. kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="$PROJECT"}
Kubernetes pod crash looping
Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping
- alert: KubernetesPodCrashLooping expr: increase(kube_pod_container_status_restarts_total[5m]) >= 3 for: 2m labels: severity: warning team: ops annotations: summary: Kubernetes pod crash looping (instance {{ $labels.instance }}) description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
集群Pod出现CrachLooping异常
- 当 集群Pod出现CrachLooping异常,Pod在 5 分钟内重启次数 大于或等于 3 次时,满足告警条件。
- 集群Pod出现CrachLooping异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}
其他计算方式
increase(kube_pod_container_status_restarts_total{namespace="$PROJECT", pod=~".*$APP.*"}[1h]) sum by(pod, namespace, container) (changes(kube_pod_container_status_restarts_total{container!="filebeat-sidecar",namespace=~"poker"}[2m])) >= 1
Kubernetes Pod not healthy
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.
- alert: KubernetesPodNotHealthy #expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0 expr: min_over_time(sum by (namespace,pod,phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed",job="kube-state-metrics"})[3m:1m]) > 0 for: 3m labels: severity: critical team: ops annotations: summary: Kubernetes Pod not healthy (instance {{ $labels.instance }}) description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 3 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
集群Pod状态异常
- 当 集群Pod出现CrachLooping异常,Pod在 3 分钟内异常次数 大于 0 次时,满足告警条件。
- 集群应用容器副本Pod状态出现异常。命名空间: {{$labels.namespace}},容器副本Pod: {{$labels.pod}}, Pod状态: {{$labels.phase}}
Container High Memory usage
Container Memory usage is above 90%
- alert: ContainerHighMemoryUsage expr: |- ( sum(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", container!="", image!=""}) by (container,namespace,pod) / sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_limits{}) by (container,namespace,pod) * 100 ) > 90 for: 2m labels: severity: warning team: ops annotations: summary: Container High Memory usage (container {{ $labels.container }}) description: "Container Memory usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Container Memory Usage
- When Container memory usage Greater than 80% ,Meet the alarm conditions。
Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} Memory Usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%
Container High CPU utilization
Container CPU utilization is above 90%
- alert: ContainerHighCpuUtilization expr: |- ( sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{container!=""}) by (container,namespace,pod) / sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits{container!=""}) by (container,namespace,pod) * 100 ) >95 for: 2m labels: severity: warning team: ops annotations: summary: Container High CPU utilization (container {{ $labels.container }}) description: "Container CPU utilization is above 95%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Container CPU Usage
- When Container CPU Usage Greater than 80%,Meet the alarm conditions。
Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU Usage{{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value{{ printf "%.2f" $value }}%
Pod Status Abnormal
When Pod status abnormal ,Meet the alarm conditions。
Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}} Stay in {{$labels.phase}} State for more than 10 minutes
Pod Startup Timeout Failure_
When Pod Startup Timeout Failure ,Meet the alarm conditions。
Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}Failed to start for more than 15 minutes,Wait reason {{$labels.reason}}
Workload
Deployment Pod Availability
- alert: LowDeploymentAvailability # 表达式返回不可用 Pod 数量,但通过标签关联其他值. expr: | # 计算可用副本比例 < 70% 且总副本数 > 9 的 Deployment ( kube_deployment_spec_replicas{job="kube-state-metrics"} - kube_deployment_status_replicas_available{job="kube-state-metrics"} ) # 添加条件过滤 and on(namespace, deployment) ( kube_deployment_spec_replicas{job="kube-state-metrics"} > 9 and ( kube_deployment_status_replicas_available{job="kube-state-metrics"} / kube_deployment_spec_replicas{job="kube-state-metrics"} ) < 0.7 ) for: 5m labels: severity: warning team: ops annotations: description: |- Namespace: {{ $labels.namespace }} / Deployment: {{ $labels.deployment }} Total Replicas: {{ with printf "kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }} Available Replicas: {{ with printf "kube_deployment_status_replicas_available{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment | query }}{{ . | first | value }}{{ end }} Pod Availability: {{ with printf "100 * kube_deployment_status_replicas_available{namespace='%s',deployment='%s'} / kube_deployment_spec_replicas{namespace='%s',deployment='%s'}" $labels.namespace $labels.deployment $labels.namespace $labels.deployment | query }}{{ . | first | value | printf "%.2f" }}%{{ end }} Unavailable Pods: {{ $value }} summary: "Deployment {{ $labels.deployment }} has low availability"
如果一个应用程序有 10 个 pod,其中 7 个可以承载正常流量,则 70% 可以是一个合适的阈值。在另一种情况下,如果 pod 总数很低,警报可以是有多少 pod 应该存活。
kube_deployment_status_replicas_available{} / kube_deployment_spec_replicas{} < 70%
Kubernetes Deployment replicas mismatch
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch
- alert: KubernetesDeploymentReplicasMismatch expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 10m labels: severity: warning annotations: summary: Kubernetes Deployment replicas mismatch (deployment {{ $labels.deployment }}) description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
集群无状态应用Deployment副本拉起异常
当 集群无状态应用Deployment副本拉起异常 时,满足告警条件。
kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}
集群无状态应用Deployment副本拉起异常,命名空间: {{$labels.namespace}},Deployment: {{$labels.deployment}}
Kubernetes Job failed
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete
- alert: KubeJobFailed annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed summary: Job failed to complete. expr: kube_job_failed{job="kube-state-metrics", namespace=~".*"} > 0 for: 15m labels: severity: warning
集群Job运行失败
当 集群Job运行失败次数 大于 0 次时,满足告警条件。
集群Job执行失败。命名空间: {{$labels.namespace}}/Job: {{$labels.job_name}}
Kubernetes Job failed
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete
- alert: KubeJobNotCompleted annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "43200" | humanizeDuration }} to complete. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted summary: Job did not complete in time expr: |- time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"} and kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200 labels: severity: warning
Job Execution Failed
When Job execution failed,Meet the alarm conditions。
Namespace: {{$labels.namespace}}/Job: {{$labels.job_name}} Execution Failed
集群守护进程集Daemonset调度异常
当 集群守护进程集Daemonset调度异常 错误数 大于 0时,满足告警条件。
集群守护进程集Daemonset调度异常
集群守护进程集Daemonset调度运行状态异常
当 集群守护进程集Daemonset调度运行状态异常 错误率 大于 0%时,满足告警条件。
集群守护进程集Daemonset调度运行状态异常
Node
KubePersistentVolumeInodesFillingUp
- alert: PersistentVolume_FillingUp annotations: description: 'warning | Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is used [Duration Time: 1h].' summary: '[Infra] PersistentVolume is filling up' expr: "sum by (persistentvolumeclaim, instance, namespace, )((\n kubelet_volume_stats_used_bytes{ job=\"kubelet\", namespace=~\".*\"}\n / \n kubelet_volume_stats_capacity_bytes{ job=\"kubelet\", namespace=~\".*\"}\n) > 0.85\nand\nkubelet_volume_stats_used_bytes{ job=\"kubelet\", namespace=~\".*\"} > 0 \nand\npredict_linear(kubelet_volume_stats_available_bytes{ job=\"kubelet\", namespace=~\".*\"}[6h], 4 * 24 * 3600) < 0)\n" for: 1h labels: severity: warning
集群PersistentVolume出现异常
当 集群PersistentVolume出现异常次数 大于 0 次时,满足告警条件。
集群PersistentVolume出现异常。PersistentVolume: {{$labels.persistentvolume}},当前状态: {{$labels.phase}}
Node Status Abnormal
When Node status abnormal ,Meet the alarm conditions。
Node {{$labels.node}} is unavailable for more than 10 minutes
Node Memory Usage
When Node memory usage Greater than 90% ,Meet the alarm conditions。
Node {{ $labels.instance }} Memory usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%,Current memory usage {{ $value }}%
Node Disk Usage
When Node disk usage Greater than 90% ,Meet the alarm conditions。
Node {{ $labels.instance }} Disk {{ $labels.device }} Usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current disk usage {{ $value }}%
Node CPU Usage
When Node cpu usage Greater than 90% ,Meet the alarm conditions。
Node {{ $labels.instance }} CPU usage {{$labels.metrics_params_opt}} {{$labels.metrics_params_value}}%,Current cpu usage {{ $value }}%