问题描述
由于断电停机,kubernetes集群挂掉,使用任意kubectl 命令会报错:The connection to the server ip:6443 was refused - did you specify the right host or port,重启kubelet也不能恢复,etcd读取数据报错,数据文件损坏

排查过程
首先检查服务是否启动有无报错

[root@pengfei-master1 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since 三 2023-07-19 17:47:06 CST; 35min ago
Docs: https://kubernetes.io/docs/
Main PID: 3087 (kubelet)
Tasks: 14
Memory: 46.8M
CGroup: /system.slice/kubelet.service
└─3087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/con…

7月 19 18:22:36 pengfei-master1 kubelet[3087]: E0719 18:22:36.296785 3087 kubelet.go:2448] “Error getting node” err=“node “pengfei-master1” not found”
7月 19 18:22:37 pengfei-master1 kubelet[3087]: E0719 18:22:37.105683 3087 kubelet.go:2448] “Error getting node” err=“node “pengfei-master1” not found”
7月 19 18:22:37 pengfei-master1 kubelet[3087]: E0719 18:22:37.149596 3087 eviction_manager.go:256] “Eviction manager: failed to get summary stats” err=“fa…not found”
有报错,找不到master节点,继续查看kubelet还是有报错

[root@pengfei-master1 ~]# journalctl -u kubelet
7月 19 17:34:00 pengfei-master1 kubelet[3480]: E0719 17:34:00.711540 3480 kubelet.go:2448] “Error getting node” err=“node “pengfei-master1” not found”
7月 19 17:34:00 pengfei-master1 kubelet[3480]: E0719 17:34:00.813511 3480 kubelet.go:2448] “Error getting node” err=“node “pengfei-master1” not found”
在确认aip-server有没有挂掉,如果挂了去查看日志

查看apiserver是否挂掉
注意:如果你使用的是docker,执行这个docker ps -a| grep kube-apiserver

[root@pengfei-master1 ~]# crictl ps -a| grep kube-apiserver
I0719 18:02:28.083463 4551 util_unix.go:103] “Using this endpoint is deprecated, please consider using full URL format” endpoint=“/run/containerd/containerd.sock” URL=“unix:///run/containerd/containerd.sock”
e6284e624f40a 4d2edfd10d3e3 2 minutes ago Exited kube-apiserver 62 0b9b24371d25f kube-apiserver-pengfei-master1
可以看出apiserver已经退出了

查看etcd是否挂掉
注意:如果你使用的是docker,执行这个docker ps -a| grep etcd

[root@pengfei-master1 ~]# crictl ps -a| grep etcd
I0719 18:03:10.350597 4563 util_unix.go:103] “Using this endpoint is deprecated, please consider using full URL format” endpoint=“/run/containerd/containerd.sock” URL=“unix:///run/containerd/containerd.sock”
9bc45fb63f604 a8a176a5d5d69 5 minutes ago Exited etcd 90 614e62c0c8ed0 etcd-pengfei-master1
可以看出etcd也已经退出了

接下来查看etcd日志,分析为什么会退出

/var/log/pods/kube-system_etcd
(此处使用tab补全,进入后查看相应的日志报错,根据相应日志去处理对应问题)
我这里是/var/log/pods/kube-system_etcd-pengfei-master1_df91a367268810494a84463207726090/etcd
异常信息

2023-07-19T18:03:17.046456705+08:00 stderr F panic: freepages: failed to get all reachable pages (page 700: multiple references)
2023-07-19T18:03:17.046481181+08:00 stderr F
2023-07-19T18:03:17.046483923+08:00 stderr F goroutine 78 [running]:
2023-07-19T18:03:17.046485818+08:00 stderr F go.etcd.io/bbolt.(*DB).freepages.func2(0xc00007c5a0)
2023-07-19T18:03:17.046487431+08:00 stderr F /go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
2023-07-19T18:03:17.046524506+08:00 stderr F created by go.etcd.io/bbolt.(*DB).freepages
2023-07-19T18:03:17.046528444+08:00 stderr F /go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd
ectd 在读取数据时发生了错误,导致启动失败。继而api-server也无法启动

etcd的数据文件损坏了,要做数据恢复,而我这是实验环境,没搞etcd备份就只能重置集群了

注意,线上使用etcd一定要做高可用和定期备份,否则就悲催了

重置k8s集群
需要在每台机器上执行
kubeadm reset

删除$HOME/.kube

rm -rf $HOME/.kube
初始化集群
近期文章:K8s集群1.25.0版本安装-基于Containerd和网络插件calico

在master节点执行即可
[root@master ~]# kubeadm init
–kubernetes-version=v1.17.4
–pod-network-cidr=10.244.0.0/16
–service-cidr=10.96.0.0/12
–image-repository registry.aliyuncs.com/google_containers
–apiserver-advertise-address=192.168.1.50

创建必要文件

[root@master ~]# mkdir -p $HOME/.kube
[root@master ~]# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@master ~]# sudo chown ( i d − u ) : (id -u): (idu):(id -g) $HOME/.kube/config
下面的操作只需要在node节点上执行即可
kubeadm join 192.168.0.100:6443 --token awk15p.t6bamck54w69u4s8
–discovery-token-ca-cert-hash sha256:a94fa09562466d32d29523ab6cff122186f1127599fa4dcd5fa0152694f17117
在master上查看节点信息
[root@master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 3m22s v1.17.4
v6 Ready 84s v1.17.4
v7 Ready 91s v1.17.4

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐