2022学习0427【K8S 集群IO高,导致服务挂掉,排障记录】
庆祝一下,今天解决了一个困扰我一个多月的问题!【问题背景】:2核2G的腾讯云S5服务器上通过minikube部署了K8S集群。上面有nodejs,grafana两个deployment,以及filebeat一个ds【问题现像】:1.开始观察到grafanna常常无法访问,master节点NotReady。2.docker ps发现apiserver和scheduler频繁重启(约6mins,不影响
庆祝一下,今天解决了一个困扰我一个多月的问题!
【问题背景】:2核2G的腾讯云S5服务器上通过minikube部署了K8S集群。上面有nodejs,grafana两个deployment,以及filebeat一个ds
【问题现像】:
1.开始观察到grafanna常常无法访问,master节点NotReady。
2.docker ps发现apiserver和scheduler频繁重启(约6mins,不影响服务)
3.docker ps发现proxy节点有过重启记录 (雪崩的根因,proxy一崩,master就NotReady了)
4.IO经常打满,甚至导致服务器无法登录,重启后大概能好一个小时。然后再次循环问题。
【异常日志汇总】:
日志都很长,就只截取了有代表性的,不重复的异常日志
apiserver:我开始只是关注了pod层面,kubectl logs apiserver,而忘记关注container层面了,
http: TLS handshake error from 172.17.0.36:37502: read tcp 172.17.0.36:8443-\u003e172.17.0.36:37502: read: connection reset by peer
{"log":"I0423 21:50:56.150640 1 log.go:172] http: TLS handshake error from 172.17.0.36:37756: EOF\n","stream":"stderr","time":"2022-04-23T21:50:56.150805501Z"}
{"log":"E0423 21:50:56.602528 1 status.go:71] apiserver received an error that is not an metav1.Status: \u0026errors.errorString{s:\"context canceled\"}\n","stream":"stderr","time":"2022-04-23T21:50:56.60266719Z"}
scheduler: 同apiserver,这俩container重启异常频繁,我开始居然没关注这里,单纯的以为就是不稳定。
E0423 22:54:15.673221 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: Get https://control-plane.minikube.internal:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s: context deadline exceeded
I0423 22:54:17.350345 1 leaderelection.go:277] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0423 22:54:17.491836 1 server.go:244] leaderelection lost
controller:
W0423 22:52:23.100798 1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
E0423 22:52:41.395453 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: Get https://control-plane.minikube.internal:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
I0423 22:52:43.318502 1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F0423 22:52:43.662661 1 controllermanager.go:279] leaderelection lost
etcd: — 这里我被误导了,range request 我一直觉得是etcd写日志,磁盘扛不住导致的,
这里还有个很大的误区,我不停尝试去优化etcd,下载了etcd-client。
ETCD优化措施方案
(其实我单节点并不需要Raft,但是为了学习,硬着头皮用上了。)
主要从3方面优化,
1.提高磁盘IO优先级
2.增加快照频率,减少日志累积,加快压缩
3.网络,这个暂且不提
哦, 最后证明没用,跟etcd优化毫无关系
{"log":"2022-04-23 21:47:38.959628 I | embed: rejected connection from \"127.0.0.1:44528\" (error \"read tcp 127.0.0.1:2379-\u003e127.0.0.1:44528: read: connection reset by peer\", ServerName \"\")\n","stream":"stderr","time":"2022-04-23T21:47:38.959858512Z"}
{"log":"2022-04-23 21:47:47.873513 W | etcdserver: read-only range request \"key:\\\"/registry/services/endpoints/kube-system/kube-scheduler\\\" \" with result \"range_response_count:1 size:590\" took too long (1.083816676s) to execute\n","stream":"stderr","time":"2022-04-23T21:47:47.873683358Z"}
根据退出时间判断下报错时间线,这里能看出来报错的一些dependency,
创建时间越早证明上一轮崩的越早。
由此可见,除了proxy之外。 先后顺序为:controller>scheduler>apiserver
[root@VM-0-36-centos ~]# docker ps -a | grep xit
1bf9080b4c48 "/dashboard --insecu…" 15 minutes ago Exited (2) 15 minutes ago k8s_kubernetes-dashboard_kubernetes-dashboard-696dbcc666-42fns_kubernetes-dashboard_2568a5cd-72d1-433b-b5b2-27885d2d943e_42
395c8890dd51 "kube-apiserver --ad…" 19 minutes ago Exited (0) 15 minutes ago k8s_kube-apiserver_kube-apiserver-vm-0-36-centos_kube-system_e83e2db116420e21a35b9d31a383202d_91
7fbee5fd1757 "kube-scheduler --au…" 20 minutes ago Exited (255) 15 minutes ago k8s_kube-scheduler_kube-scheduler-vm-0-36-centos_kube-system_c63a370803ea358d14eb11f27c64756f_240
0e7d81be25d9 "kube-controller-man…" 27 minutes ago Exited (255) 15 minutes ago k8s_kube-controller-manager_kube-controller-manager-vm-0-36-centos_kube-system_0d5c3746cb0a798a6fc95c8dab3bff0b_245
7aee6e991973 "/usr/bin/dumb-init …" 3 hours ago Exited (137) 2 hours ago k8s_nginx-ingress-controller_nginx-ingress-controller-6d746cd945-f67xn_ingress-nginx_0c01ad47-3f93-4d82-892e-e87b6a361db5_0
28bb20e1ce79 "/coredns -conf /etc…" 3 hours ago Exited (0) 20 minutes ago k8s_coredns_coredns-546565776c-4ct42_kube-system_c389cc98-3f18-4351-8e5e-346b374a47a2_54
1e61dd0aef3d "/coredns -conf /etc…" 3 hours ago Exited (0) 20 minutes ago k8s_coredns_coredns-546565776c-2sfml_kube-system_c3870b1b-66cb-4179-808e-6956ecc92ebe_51
最重要的是proxy的container的日志,当时没保存,没办法复现了,大概内容如下
dial tcp 172.***** connect reset
其实在apiserver里已经反映出来了,内网通信172有问题。
网段冲突
这时候一篇文章救了我。
是我服务器内网网段地址和容器网络地址有冲突!我开始并没有在意,觉得这个跟IO没直接关系,修复了又能如何
但是秉着试一试,死马当活马医的态度,我切换了服务器内网网段。(切换完我配置的configmap什么的都没了,要重新创建。)
同时起minikube的时候还报错了, minikube delete 就可以了,重新来一次。
couldn't retrieve DNS addon deployments:
再改完这部分之后,服务神奇的起来了, 现在已经保持两个小时,docker环境干干净净~ 继续观察。
原来网段冲突才是罪魁祸首!
更多推荐
所有评论(0)