问题1:etcd、apiserver启动异常

现象:

[root@hdss7-22 supervisord.d]# supervisorctl status
etcd-server-22                            FATAL     Exited too quickly (process log may have details)
kube-apiserver-7-22                   FATAL     Exited too quickly (process log may have details)
kube-controller-manager-7-22   RUNNING   pid 1032, uptime 0:10:51
kube-kubelet-7-22                      RUNNING   pid 1031, uptime 0:10:51
kube-proxy-7-22                        RUNNING   pid 1036, uptime 0:10:51
kube-scheduler-7-22                 RUNNING   pid 1039, uptime 0:10:51

分析:

1、通过查看kube-apiserver日志,提示与etcd机器的2379无法连接
[root@hdss7-22]#tail -200 /data/logs/kubernetes/kube-apiserver/
grpc: addrConn.createTransport failed to connect to {10.4.7.22:2379 0 <nil>}

2、这个时候我们就知道了,apiserver是要把对应的配置信息调度发给etcd,但是由于etcd启动不起,无法通讯导致apiserver启动不了,所以解决etcd就解决所有问题

3、通过查看etcd日志,提示没有实质的错误,只有最后一句能查到问题
[root@hdss7-22]#tail -200 /data/logs/etcd-server/etcd.stdout.log

2020-11-26 16:45:46.362582 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2020-11-26 16:45:46.363653 I | embed: listening for peers on https://10.4.7.12:2380
2020-11-26 16:45:46.363686 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2020-11-26 16:45:46.363691 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2020-11-26 16:45:46.363719 I | embed: listening for client requests on 127.0.0.1:2379
2020-11-26 16:45:46.363738 I | embed: listening for client requests on 10.4.7.12:2379
2020-11-26 16:45:46.368358 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-11-26 16:45:46.370241 I | etcdserver: recovered store from snapshot at index 890089
2020-11-26 16:45:46.408948 I | mvcc: restore compact to 666642
2020-11-26 16:45:46.409020 C | mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key
2020-11-26 16:45:47.443185 I | etcdmain: etcd Version: 3.1.20
2020-11-26 16:45:47.443266 I | etcdmain: Git SHA: 992dbd4d1
2020-11-26 16:45:47.443271 I | etcdmain: Go Version: go1.8.7
2020-11-26 16:45:47.443275 I | etcdmain: Go OS/Arch: linux/amd64
2020-11-26 16:45:47.443279 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2020-11-26 16:45:47.443325 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-11-26 16:45:47.443347 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2020-11-26 16:45:47.444338 I | embed: listening for peers on https://10.4.7.12:2380
2020-11-26 16:45:47.444370 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2020-11-26 16:45:47.444375 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2020-11-26 16:45:47.444402 I | embed: listening for client requests on 127.0.0.1:2379
2020-11-26 16:45:47.444420 I | embed: listening for client requests on 10.4.7.12:2379
2020-11-26 16:45:47.449024 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-11-26 16:45:47.450752 I | etcdserver: recovered store from snapshot at index 890089
2020-11-26 16:45:47.452869 I | mvcc: restore compact to 666642
2020-11-26 16:45:47.452932 C | mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key

4、通过查看最后一句,在网上有对应的解决方案,提示是服务器非正常关机(意外掉电、强制拔电)后 etcd 数据损坏(https://blog.csdn.net/dengxiafubi/article/details/102627341

查看etcd 的 member list :
[root@hdss7-22 ]# cd /opt/etcd/
[root@hdss7-22 etcd]# ./etcdctl member list
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
; error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused

修复方法:
通过如下方式可以把损坏的 etcd 节点先移除再重新加入集群:
1、在故障节点停止 etcd服务并删除损坏的 etcd 数据
[root@hdss7-22 ]# supervisord stop etcd
[root@hdss7-22 etcd]# ll /data/etcd/etcd-server/member/
drwx------ 2 etcd etcd 62 11月 26 17:42 snap
drwx------ 2 etcd etcd 64 11月 26 17:18 wal
[root@hdss7-22 ]#mv /data/etcd/etcd-server/member/* ~/etcd-backup

2、重新启动etcd
[root@hdss7-22 ]#systemctl restart supervisord
[root@hdss7-22 ]#supervisorctl update

异常处理:

正常应该是已经解决完毕,但是还是启动不了。并且在上述日志中,还有这个错误也有问题。
peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true

此时在查看etcd日志
[root@hdss7-22]#tail -200 /data/logs/etcd-server/etcd.stdout.log

2020-11-26 16:45:52.536182 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2020-11-26 16:45:52.537232 I | embed: listening for peers on https://10.4.7.12:2380
2020-11-26 16:45:52.537264 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2020-11-26 16:45:52.537269 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2020-11-26 16:45:52.537290 I | embed: listening for client requests on 127.0.0.1:2379
2020-11-26 16:45:52.537368 I | embed: listening for client requests on 10.4.7.12:2379
2020-11-26 16:45:52.542002 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-11-26 16:45:52.543972 I | etcdserver: recovered store from snapshot at index 890089
2020-11-26 16:45:52.547173 I | mvcc: restore compact to 666642
2020-11-26 16:45:52.547254 C | mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key
2020-11-26 16:49:28.787552 I | etcdmain: etcd Version: 3.1.20
2020-11-26 16:49:28.787694 I | etcdmain: Git SHA: 992dbd4d1
2020-11-26 16:49:28.787710 I | etcdmain: Go Version: go1.8.7
2020-11-26 16:49:28.787715 I | etcdmain: Go OS/Arch: linux/amd64
2020-11-26 16:49:28.787720 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2020-11-26 16:49:28.787779 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-11-26 16:49:28.787808 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true

并且通过上述的./etcdctl member list,报错是embed: listening for client requests on 127.0.0.1:23799(监听127.0.0.1:2379上的客户端请求)

​通过网上搜索,listening for client requests on 127.0.0.1:2379(https://blog.csdn.net/weixin_30896657/article/details/97033139)反馈是
etcd 要与本机自己2379进行通信(http://127.0.0.1:2379)。但配置文件中确实已经写了。
[root@hdss7-22 etcd]# cat /opt/etcd/etcd-server-startup.sh 
--listen-client-urls http://10.4.7.22:2379,http://127.0.0.1:2379 \   

于是我又重新做了部署等操作还是未解决,再看日志提示跟CA有关,所以直接把https改成http,不做ssl认证。

修复方法:把所有的https都改成http
1、[root@hdss7-22 etcd]# vi /opt/etcd/etcd-server-startup.sh 
  --listen-peer-urls http://10.4.7.22:2380 \
  --listen-client-urls http://10.4.7.22:2379,http://127.0.0.1:2379 \
  --initial-advertise-peer-urls http://10.4.7.22:2380 \
  --advertise-client-urls http://10.4.7.22:2379,http://127.0.0.1:2379 \
  --initial-cluster  etcd-server-7-12=http://10.4.7.12:2380,etcd-server-7-21=http://10.4.7.21:2380,etcd-server-7-22=http://10.4.7.22:2380 \

2、把apiserver 做客户端,etcd作为服务端,apiserver要访问etc的地址从https改成http
[root@hdss7-22 etcd]# vi /opt/kubernetes/server/bin/kube-apiserver-startup.sh
    --etcd-servers http://10.4.7.12:2379,https://10.4.7.21:2379,http://10.4.7.22:2379 \

3、重新启动
[root@hdss7-22 ]#systemctl restart supervisord
[root@hdss7-22 ]#supervisorctl update

到目前为止,所有服务正常启动,但是具体https的问题还是未解决。

问题2:异常导致网卡启动不了,修改网卡后导致apiserver启动异常

网卡异常:

ifconfig启动不了网卡,网卡配置无任何问题,systemctl restart network启动异常报错,查看报错journalctl -xe 提示报错RTNETLINK answers: File exists

解决方案:

第一种:  NetworkManager 服务有冲突,这个好解决,直接关闭 NetworkManger 服务就好了, service NetworkManager stop,并且禁止开机启动 chkconfig NetworkManager off 。之后重启就好了。
第二种:和配置文件的MAC地址不匹配,这个也好解决,直接修改 /etc/udev/rules.d/70-persistent-net.rules文件的MAC地址和 /etc/sysconfig/network-scripts/ifcfg-eth0一样就好了。
这两种方法整了半天,还是没弄出来,结果从哪个不知名的帖子上发现了这么一条命令 ip addr flush dev eth0,结果就好了。

apiserver启动异常:

查看apiserver.stdout.log日志
error: unable to find suitable network address.error='no default routes found in
"/proc/net/route" or "/proc/net/ipv6_route"'. Try to set the AdvertiseAddress directly or 
provide a valid BindAddress to fix this

解决方案:默认网关没有配置的问题,假如我默认网关是10.4.7.254
[root@hdss7-21 kube-apiserver]# route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.0.0.0        0.0.0.0         255.0.0.0       U     0      0        0 ens33
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 ens33
172.7.21.0      0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.7.22.0      10.4.7.22       255.255.255.0   UG    0      0        0 ens33

在命令行上添加默认网关:route add default gw 10.4.7.254
[root@hdss7-21 kube-apiserver]# route add default gw 10.4.7.254

[root@hdss7-21 kube-apiserver]# route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.4.7.254      0.0.0.0         UG    0      0        0 ens33
10.0.0.0        0.0.0.0         255.0.0.0       U     0      0        0 ens33
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 ens33
172.7.21.0      0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.7.22.0      10.4.7.22       255.255.255.0   UG    0      0        0 ens33

确认网关已经配置好后,再一次启动apiserver,发现apiserver可以正确运行(可以使用route -n查看默认网关是否已经配置好了)

问题3:交付Prometheus Server后,targets显示不全

下图是正常,不正常的情况下,可能只显示etcd

查看容器日志:

【解决】cannot list resource \"services\" in API group \"\" at the cluster scope"更改prometheus集群访问权限即可

修改之前的rbac.yaml

略
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
略

修改之后的

略
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/metrics
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
略

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐