K8s 异常问题处理
问题1:etcd、apiserver启动异常现象:[root@hdss7-22 supervisord.d]# supervisorctl statusetcd-server-22FATAL Exited too quickly (process log may have details)kube-apiserver-7-22FATAL Exited too quickly (proce...
问题1:etcd、apiserver启动异常
现象:
[root@hdss7-22 supervisord.d]# supervisorctl status
etcd-server-22 FATAL Exited too quickly (process log may have details)
kube-apiserver-7-22 FATAL Exited too quickly (process log may have details)
kube-controller-manager-7-22 RUNNING pid 1032, uptime 0:10:51
kube-kubelet-7-22 RUNNING pid 1031, uptime 0:10:51
kube-proxy-7-22 RUNNING pid 1036, uptime 0:10:51
kube-scheduler-7-22 RUNNING pid 1039, uptime 0:10:51
分析:
1、通过查看kube-apiserver日志,提示与etcd机器的2379无法连接
[root@hdss7-22]#tail -200 /data/logs/kubernetes/kube-apiserver/
grpc: addrConn.createTransport failed to connect to {10.4.7.22:2379 0 <nil>}
2、这个时候我们就知道了,apiserver是要把对应的配置信息调度发给etcd,但是由于etcd启动不起,无法通讯导致apiserver启动不了,所以解决etcd就解决所有问题
3、通过查看etcd日志,提示没有实质的错误,只有最后一句能查到问题
[root@hdss7-22]#tail -200 /data/logs/etcd-server/etcd.stdout.log
2020-11-26 16:45:46.362582 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2020-11-26 16:45:46.363653 I | embed: listening for peers on https://10.4.7.12:2380
2020-11-26 16:45:46.363686 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2020-11-26 16:45:46.363691 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2020-11-26 16:45:46.363719 I | embed: listening for client requests on 127.0.0.1:2379
2020-11-26 16:45:46.363738 I | embed: listening for client requests on 10.4.7.12:2379
2020-11-26 16:45:46.368358 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-11-26 16:45:46.370241 I | etcdserver: recovered store from snapshot at index 890089
2020-11-26 16:45:46.408948 I | mvcc: restore compact to 666642
2020-11-26 16:45:46.409020 C | mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key
2020-11-26 16:45:47.443185 I | etcdmain: etcd Version: 3.1.20
2020-11-26 16:45:47.443266 I | etcdmain: Git SHA: 992dbd4d1
2020-11-26 16:45:47.443271 I | etcdmain: Go Version: go1.8.7
2020-11-26 16:45:47.443275 I | etcdmain: Go OS/Arch: linux/amd64
2020-11-26 16:45:47.443279 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2020-11-26 16:45:47.443325 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-11-26 16:45:47.443347 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2020-11-26 16:45:47.444338 I | embed: listening for peers on https://10.4.7.12:2380
2020-11-26 16:45:47.444370 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2020-11-26 16:45:47.444375 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2020-11-26 16:45:47.444402 I | embed: listening for client requests on 127.0.0.1:2379
2020-11-26 16:45:47.444420 I | embed: listening for client requests on 10.4.7.12:2379
2020-11-26 16:45:47.449024 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-11-26 16:45:47.450752 I | etcdserver: recovered store from snapshot at index 890089
2020-11-26 16:45:47.452869 I | mvcc: restore compact to 666642
2020-11-26 16:45:47.452932 C | mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key
4、通过查看最后一句,在网上有对应的解决方案,提示是服务器非正常关机(意外掉电、强制拔电)后 etcd 数据损坏(https://blog.csdn.net/dengxiafubi/article/details/102627341)
查看etcd 的 member list :
[root@hdss7-22 ]# cd /opt/etcd/
[root@hdss7-22 etcd]# ./etcdctl member list
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
; error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused
修复方法:
通过如下方式可以把损坏的 etcd 节点先移除再重新加入集群:
1、在故障节点停止 etcd服务并删除损坏的 etcd 数据
[root@hdss7-22 ]# supervisord stop etcd
[root@hdss7-22 etcd]# ll /data/etcd/etcd-server/member/
drwx------ 2 etcd etcd 62 11月 26 17:42 snap
drwx------ 2 etcd etcd 64 11月 26 17:18 wal
[root@hdss7-22 ]#mv /data/etcd/etcd-server/member/* ~/etcd-backup
2、重新启动etcd
[root@hdss7-22 ]#systemctl restart supervisord
[root@hdss7-22 ]#supervisorctl update
异常处理:
正常应该是已经解决完毕,但是还是启动不了。并且在上述日志中,还有这个错误也有问题。
peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
此时在查看etcd日志
[root@hdss7-22]#tail -200 /data/logs/etcd-server/etcd.stdout.log
2020-11-26 16:45:52.536182 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2020-11-26 16:45:52.537232 I | embed: listening for peers on https://10.4.7.12:2380
2020-11-26 16:45:52.537264 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2020-11-26 16:45:52.537269 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2020-11-26 16:45:52.537290 I | embed: listening for client requests on 127.0.0.1:2379
2020-11-26 16:45:52.537368 I | embed: listening for client requests on 10.4.7.12:2379
2020-11-26 16:45:52.542002 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-11-26 16:45:52.543972 I | etcdserver: recovered store from snapshot at index 890089
2020-11-26 16:45:52.547173 I | mvcc: restore compact to 666642
2020-11-26 16:45:52.547254 C | mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key
2020-11-26 16:49:28.787552 I | etcdmain: etcd Version: 3.1.20
2020-11-26 16:49:28.787694 I | etcdmain: Git SHA: 992dbd4d1
2020-11-26 16:49:28.787710 I | etcdmain: Go Version: go1.8.7
2020-11-26 16:49:28.787715 I | etcdmain: Go OS/Arch: linux/amd64
2020-11-26 16:49:28.787720 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2020-11-26 16:49:28.787779 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-11-26 16:49:28.787808 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
并且通过上述的./etcdctl member list,报错是embed: listening for client requests on 127.0.0.1:23799(监听127.0.0.1:2379上的客户端请求)
通过网上搜索,listening for client requests on 127.0.0.1:2379(https://blog.csdn.net/weixin_30896657/article/details/97033139)反馈是
etcd 要与本机自己2379进行通信(http://127.0.0.1:2379)。但配置文件中确实已经写了。
[root@hdss7-22 etcd]# cat /opt/etcd/etcd-server-startup.sh
--listen-client-urls http://10.4.7.22:2379,http://127.0.0.1:2379 \
于是我又重新做了部署等操作还是未解决,再看日志提示跟CA有关,所以直接把https改成http,不做ssl认证。
修复方法:把所有的https都改成http
1、[root@hdss7-22 etcd]# vi /opt/etcd/etcd-server-startup.sh
--listen-peer-urls http://10.4.7.22:2380 \
--listen-client-urls http://10.4.7.22:2379,http://127.0.0.1:2379 \
--initial-advertise-peer-urls http://10.4.7.22:2380 \
--advertise-client-urls http://10.4.7.22:2379,http://127.0.0.1:2379 \
--initial-cluster etcd-server-7-12=http://10.4.7.12:2380,etcd-server-7-21=http://10.4.7.21:2380,etcd-server-7-22=http://10.4.7.22:2380 \
2、把apiserver 做客户端,etcd作为服务端,apiserver要访问etc的地址从https改成http
[root@hdss7-22 etcd]# vi /opt/kubernetes/server/bin/kube-apiserver-startup.sh
--etcd-servers http://10.4.7.12:2379,https://10.4.7.21:2379,http://10.4.7.22:2379 \
3、重新启动
[root@hdss7-22 ]#systemctl restart supervisord
[root@hdss7-22 ]#supervisorctl update
到目前为止,所有服务正常启动,但是具体https的问题还是未解决。
问题2:异常导致网卡启动不了,修改网卡后导致apiserver启动异常
网卡异常:
ifconfig启动不了网卡,网卡配置无任何问题,systemctl restart network启动异常报错,查看报错journalctl -xe 提示报错RTNETLINK answers: File exists
解决方案:
第一种: NetworkManager 服务有冲突,这个好解决,直接关闭 NetworkManger 服务就好了, service NetworkManager stop,并且禁止开机启动 chkconfig NetworkManager off 。之后重启就好了。
第二种:和配置文件的MAC地址不匹配,这个也好解决,直接修改 /etc/udev/rules.d/70-persistent-net.rules文件的MAC地址和 /etc/sysconfig/network-scripts/ifcfg-eth0一样就好了。
这两种方法整了半天,还是没弄出来,结果从哪个不知名的帖子上发现了这么一条命令 ip addr flush dev eth0,结果就好了。
apiserver启动异常:
查看apiserver.stdout.log日志
error: unable to find suitable network address.error='no default routes found in
"/proc/net/route" or "/proc/net/ipv6_route"'. Try to set the AdvertiseAddress directly or
provide a valid BindAddress to fix this
解决方案:默认网关没有配置的问题,假如我默认网关是10.4.7.254
[root@hdss7-21 kube-apiserver]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 ens33
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 ens33
172.7.21.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0
172.7.22.0 10.4.7.22 255.255.255.0 UG 0 0 0 ens33
在命令行上添加默认网关:route add default gw 10.4.7.254
[root@hdss7-21 kube-apiserver]# route add default gw 10.4.7.254
[root@hdss7-21 kube-apiserver]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.4.7.254 0.0.0.0 UG 0 0 0 ens33
10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 ens33
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 ens33
172.7.21.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0
172.7.22.0 10.4.7.22 255.255.255.0 UG 0 0 0 ens33
确认网关已经配置好后,再一次启动apiserver,发现apiserver可以正确运行(可以使用route -n查看默认网关是否已经配置好了)
问题3:交付Prometheus Server后,targets显示不全
下图是正常,不正常的情况下,可能只显示etcd
查看容器日志:
【解决】cannot list resource \"services\" in API group \"\" at the cluster scope",更改prometheus集群访问权限即可
修改之前的rbac.yaml
略
name: prometheus
rules:
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
略
修改之后的
略
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
略
更多推荐
所有评论(0)