Prometheus监控报警规则包括ARMS报警规则、K8s报警规则、MongoDB报警规则、MySQL报警规则、Nginx报警规则、Redis报警规则。

ARMS报警规则

 
报警名称表达式采集数据时间(分钟)报警触发条件
PodCpu75100 * (sum(rate(container_cpu_usage_seconds_total[1m])) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_cpu_cores, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>757Pod的CPU使用率大于75%。
PodMemory75100 * (sum(container_memory_working_set_bytes) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_memory_bytes, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>755Pod的内存使用率大于75%。
pod_status_no_runningsum (kube_pod_status_phase{phase!="Running"}) by (pod,phase)5Pod的状态为未运行。
PodMem4GbRestart(sum (container_memory_working_set_bytes{id!="/"})by (pod_name,container_name) /1024/1024/1024)>45Pod的内存大于4GB。
PodRestartsum (increase (kube_pod_container_status_restarts_total{}[2m])) by (namespace,pod) >05Pod重启。

K8s报警规则

 
报警名称表达式采集数据时间(分钟)报警触发条件
KubeStateMetricsListErrors(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.0115Metric List出错。
KubeStateMetricsWatchErrors(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.0115Metric Watch出错。
NodeFilesystemAlmostOutOfSpace( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )60Node文件系统即将无空间。
NodeFilesystemSpaceFillingUp( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )60Node文件系统空间即将占满。
NodeFilesystemFilesFillingUp( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )60Node文件系统文件即将占满。
NodeFilesystemAlmostOutOfFiles( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )60Node文件系统几乎无文件。
NodeNetworkReceiveErrsincrease(node_network_receive_errs_total[2m]) > 1060Node网络接收错误。
NodeNetworkTransmitErrsincrease(node_network_transmit_errs_total[2m]) > 1060Node网络传输错误。
NodeHighNumberConntrackEntriesUsed(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75使用大量Conntrack条目。
NodeClockSkewDetected( node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0 ) or ( node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0 )10出现时间偏差。
NodeClockNotSynchronisingmin_over_time(node_timex_sync_status[5m]) == 010出现时间不同步。
KubePodCrashLoopingrate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 015出现循环崩溃。
KubePodNotReadysum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 015Pod未准备好。
KubeDeploymentGenerationMismatchkube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"}15出现部署版本不匹配。
KubeDeploymentReplicasMismatch( kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} ) and ( changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 )15出现部署副本不匹配。
KubeStatefulSetReplicasMismatch( kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} ) and ( changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 )15状态集副本不匹配。
KubeStatefulSetGenerationMismatchkube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"}15状态集版本不匹配。
KubeStatefulSetUpdateNotRolledOutmax without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} )15状态集更新未推出。
KubeDaemonSetRolloutStuckkube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.0015DaemonSet推出回退。
KubeContainerWaitingsum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 060容器等待。
KubeDaemonSetNotScheduledkube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 010DaemonSet无计划。
KubeDaemonSetMisScheduledkube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 015Daemon缺失计划。
KubeCronJobRunningtime() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 360060若Cron任务完成时间大于1小。
KubeJobCompletionkube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 060任务完成。
KubeJobFailedkube_job_failed{job="kube-state-metrics"} > 015任务失败。
KubeHpaReplicasMismatch(kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 015HPA副本不匹配。
KubeHpaMaxedOutkube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"}15HPA副本超过最大值。
KubeCPUOvercommitsum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{}) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)5CPU过载。
KubeMemoryOvercommitsum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{}) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes)5存储过载。
KubeCPUQuotaOvercommitsum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.55CPU额度过载。
KubeMemoryQuotaOvercommitsum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.55存储额度过载。
KubeQuotaExceededkube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.9015若配额超过限制。
CPUThrottlingHighsum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 )15CPU过热。
KubePersistentVolumeFillingUpkubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet", metrics_path="/metrics"} < 0.031存储卷容量即将不足。
KubePersistentVolumeErrorskube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 05存储卷容量出错。
KubeVersionMismatchcount(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 115版本不匹配。
KubeClientErrors(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) > 0.0115客户端出错。
KubeAPIErrorBudgetBurnsum(apiserver_request:burnrate1h) > (14.40 * 0.01000) and sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)2API错误过多。
KubeAPILatencyHigh( cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} > on (verb) group_left() ( avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) + 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) ) ) > on (verb) group_left() 1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) and on (verb,resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99"} > 15API延迟过高。
KubeAPIErrorsHighsum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.0510API错误过多。
KubeClientCertificateExpirationapiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800客户端认证过期。
AggregatedAPIErrorssum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2聚合API出错。
AggregatedAPIDownsum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 05聚合API下线。
KubeAPIDownabsent(up{job="apiserver"} == 1)15API下线。
KubeNodeNotReadykube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 015Node未准备好。
KubeNodeUnreachablekube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 12Node无法获取。
KubeletTooManyPodsmax(max(kubelet_running_pod_count{job="kubelet", metrics_path="/metrics"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by(node) > 0.9515Pod过多。
KubeNodeReadinessFlappingsum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 215准备状态变更次数过多。
KubeletPlegDurationHighnode_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 105PLEG持续时间过长。
KubeletPodStartUpLatencyHighhistogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 6015Pod启动延迟过高。
KubeletDownabsent(up{job="kubelet", metrics_path="/metrics"} == 1)15Kubelet下线。
KubeSchedulerDownabsent(up{job="kube-scheduler"} == 1)15Kubelet日程下线。
KubeControllerManagerDownabsent(up{job="kube-controller-manager"} == 1)15Controller Manager下线。
TargetDown100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 1010目标下线。
NodeNetworkInterfaceFlappingchanges(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 22网络接口状态变更过频繁。

MongoDB报警规则

 
报警名称表达式采集数据时间(分钟)报警触发条件
MongodbReplicationLagavg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 105复制延迟过长。
MongodbReplicationHeadroom(avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp) - (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}))) <= 05复制余量不足。
MongodbReplicationStatus3mongodb_replset_member_state == 35复制状态为3。
MongodbReplicationStatus6mongodb_replset_member_state == 65复制状态为6。
MongodbReplicationStatus8mongodb_replset_member_state == 85复制状态为8。
MongodbReplicationStatus10mongodb_replset_member_state == 105复制状态为10。
MongodbNumberCursorsOpenmongodb_metrics_cursor_open{state="total_open"} > 100005打开数字光标数量过多。
MongodbCursorsTimeoutssum (increase increase(mongodb_metrics_cursor_timed_out_total[10m]) > 1005若光标超。
MongodbTooManyConnectionsmongodb_connections{state="current"} > 5005连接过多。
MongodbVirtualMemoryUsage(sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 35虚拟内存使用率过高。

MySQL报警规则

 
报警名称表达式采集数据时间(分钟)报警触发条件
MySQL is downmysql_up == 01MySQL下线。
open files highmysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.751打开文件数量偏高。
Read buffer size is bigger than max. allowed packet sizemysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet1读取缓存区超过数据包最大限制。
Sort buffer possibly missconfiguredmysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*10241排序缓冲区可能存在配置错误。
Thread stack size is too smallmysql_global_variables_thread_stack <1966081线程堆栈太小。
Used more than 80% of max connections limitedmysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.81使用超过80%连接限制。
InnoDB Force Recovery is enabledmysql_global_variables_innodb_force_recovery != 01启用强制恢复。
InnoDB Log File size is too smallmysql_global_variables_innodb_log_file_size < 167772161日志文件过小。
InnoDB Flush Log at Transaction Commitmysql_global_variables_innodb_flush_log_at_trx_commit != 11在事务提交时刷新日志。
Table definition cache too smallmysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache1表定义缓存过小。
Table open cache too smallmysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/1001表打开缓存过小。
Thread stack size is possibly too smallmysql_global_variables_thread_stack < 2621441线程堆栈可能过小。
InnoDB Buffer Pool Instances is too smallmysql_global_variables_innodb_buffer_pool_instances == 11缓冲池实例过小。
InnoDB Plugin is enabledmysql_global_variables_ignore_builtin_innodb == 11插件启用。
Binary Log is disabledmysql_global_variables_log_bin != 11二进制日志禁用。
Binlog Cache size too smallmysql_global_variables_binlog_cache_size < 10485761缓存过小。
Binlog Statement Cache size too smallmysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 01声明缓存过小。
Binlog Transaction Cache size too smallmysql_global_variables_binlog_cache_size <10485761交易缓存过小。
Sync Binlog is enabledmysql_global_variables_sync_binlog == 11二进制日志启用。
IO thread stoppedmysql_slave_status_slave_io_running != 11IO线程停止。
SQL thread stoppedmysql_slave_status_slave_sql_running == 01SQL线程停止。
Mysql_Too_Many_Connectionsrate(mysql_global_status_threads_connected[5m])>2005连接过多。
Mysql_Too_Many_slow_queriesrate(mysql_global_status_slow_queries[5m])>35慢查询过多。
Slave lagging behind Masterrate(mysql_slave_status_seconds_behind_master[1m]) >301从机表现落后于主机。
Slave is NOT read only(Please ignore this warning indicator.)mysql_global_variables_read_only != 01从机权限不是只读。

Nginx报警规则

 
报警名称表达式采集数据时间(分钟)报警触发条件
NginxHighHttp4xxErrorRatesum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 55HTTP 4xx错误率过高。
NginxHighHttp5xxErrorRatesum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 55HTTP 5xx错误率过高。
NginxLatencyHighhistogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 105延迟过高。

Redis报警规则

 
报警名称表达式采集数据时间(分钟)报警触发条件
RedisDownredis_up == 05Redis下线。
RedisMissingMastercount(redis_instance_info{role="master"}) == 05Master缺失。
RedisTooManyMasterscount(redis_instance_info{role="master"}) > 15Master过多。
RedisDisconnectedSlavescount without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 15Slave连接断开。
RedisReplicationBrokendelta(redis_connected_slaves[1m]) < 05复制中断。
RedisClusterFlappingchanges(redis_connected_slaves[5m]) > 25副本连接识别变更。
RedisMissingBackuptime() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 245备份中断。
RedisOutOfMemoryredis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 905内存不足。
RedisTooManyConnectionsredis_connected_clients > 1005连接过多。
RedisNotEnoughConnectionsredis_connected_clients < 55连接不足。
RedisRejectedConnectionsincrease(redis_rejected_connections_total[1m]) > 05连接被拒绝。
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐