搭建k8s集群所遇到的坑

原因是网络插件未安装，下载 kube-flannel.yml或者calico.yaml文件，并执行kubectl apply -f calico.yaml或者执行kubectl apply -f kube-flannel.yml。如果报错i存在多个 CRI 的终端，同理命令最后加上参数–cri-socket=unix:///var/run/cri-dockerd.sock（换成你实际使用的终端）查

听风逝夜425

1061人浏览 · 2024-02-19 23:30:15

听风逝夜425 · 2024-02-19 23:30:15 发布

1.创建Kubernetes证书时错误

环境：需要在k8s上集群自签APIServer SSL证书+master1配置

cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kubernetes-csr.json | cfssljson -bare kubernetes

报错

Failed to load config file: {"code":5200,"message":"failed to unmarshal configuration: invalid character '\"' after array element"}Failed to parse input: unexpected end of JSON input

原因是因为ca-config.json和ca-csr.json文件为空，后写入内容再执行，成功。

cfssl gencert -initca ca-csr.json | cfssljson -bare ca -

cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes server-csr.json | cfssljson -bare server

cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy

搭建是参考别人的文档，原文链接：搭建一个完整的Kubernetes集群–自签APIServer SSL证书+master1配置

但是我最后执行到systemctl start kube-scheduler报错

Failed to start kube-scheduler.service: Unit not found.

然后查看 kube-apiserver状态

[root@localhost kubernetes]# systemctl status kube-apiserver
● kube-apiserver.service - Kubernetes API Server
   Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Sat 2020-08-15 15:43:51 CST; 10s ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
  Process: 56292 ExecStart=/usr/bin/kube-apiserver $KUBE_LOGTOSTDERR $KUBE_LOG_LEVEL $KUBE_ETCD_SERVERS $KUBE_API_ADDRESS $KUBE_API_PORT $KUBELET_PORT $KUBE_ALLOW_PRIV $KUBE_SERVICE_ADDRESSES $KUBE_ADMISSION_CONTROL $KUBE_API_ARGS (code=exited, status=255)
 Main PID: 56292 (code=exited, status=255)
 
Aug 15 15:43:50 localhost.localdomain systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
Aug 15 15:43:50 localhost.localdomain systemd[1]: Unit kube-apiserver.service entered failed state.
Aug 15 15:43:50 localhost.localdomain systemd[1]: kube-apiserver.service failed.
Aug 15 15:43:51 localhost.localdomain systemd[1]: kube-apiserver.service holdoff time over, scheduling restart.
Aug 15 15:43:51 localhost.localdomain systemd[1]: Stopped Kubernetes API Server.
Aug 15 15:43:51 localhost.localdomain systemd[1]: start request repeated too quickly for kube-apiserver.service
Aug 15 15:43:51 localhost.localdomain systemd[1]: Failed to start Kubernetes API Server.
Aug 15 15:43:51 localhost.localdomain systemd[1]: Unit kube-apiserver.service entered failed state.
Aug 15 15:43:51 localhost.localdomain systemd[1]: kube-apiserver.service failed.

查看日志发现报错kube-scheduler启动失败导致kube-apiserver启动失败，
此时kubectl get pods也会报错

E0919 14:57:21.242964 414467 memcache.go:265] couldn’t get current server API group list: Get “http://localhost:8080/api?timeout=32s 58”: dial tcp 127.0.0.1:8080: connect: connection refused
E0919 14:57:21.243436 414467 memcache.go:265] couldn’t get current server API group list: Get “http://localhost:8080/api?timeout=32s 58”: dial tcp 127.0.0.1:8080: connect: connection refused
E0919 14:57:21.245295 414467 memcache.go:265] couldn’t get current server API group list: Get “http://localhost:8080/api?timeout=32s 58”: dial tcp 127.0.0.1:8080: connect: connection refused
E0919 14:57:21.246960 414467 memcache.go:265] couldn’t get current server API group list: Get “http://localhost:8080/api?timeout=32s 58”: dial tcp 127.0.0.1:8080: connect: connection refused
E0919 14:57:21.248726 414467 memcache.go:265] couldn’t get current server API group list: Get “http://localhost:8080/api?timeout=32s 58”: dial tcp 127.0.0.1:8080: connect: connection refused

kubectl报错通过 journalctl -f -u kubelet 查看日志

[root@localhost rpm]# tail -100f /var/log/messages|grep kube
Jun 24 23:35:39 localhost kubelet: F0624 23:35:39.190803   25600 server.go:190] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
Jun 24 23:35:39 localhost systemd: kubelet.service: main process exited, code=exited, status=255/n/a
Jun 24 23:35:39 localhost systemd: Unit kubelet.service entered failed state.
Jun 24 23:35:39 localhost systemd: kubelet.service failed.
Jun 24 23:35:49 localhost systemd: kubelet.service holdoff time over, scheduling restart.
Jun 24 23:35:49 localhost systemd: Started kubelet: The Kubernetes Node Agent.
Jun 24 23:35:49 localhost systemd: Starting kubelet: The Kubernetes Node Agent...

原因是kubelet 找不到配置文件/var/lib/kubelet/config.yaml，不知道为什么被删了，需要执行

kubeadm init

然后查看目录

[root@localhost rpm]# cd  /var/lib/kubelet/
[root@localhost kubelet]# ls -l
total 12
-rw-r--r--. 1 root root 1609 Jun 24 23:40 config.yaml
-rw-r--r--. 1 root root   40 Jun 24 23:40 cpu_manager_state
drwxr-xr-x. 2 root root   61 Jun 24 23:41 device-plugins
-rw-r--r--. 1 root root  124 Jun 24 23:40 kubeadm-flags.env
drwxr-xr-x. 2 root root  170 Jun 24 23:41 pki
drwx------. 2 root root    6 Jun 24 23:40 plugin-containers
drwxr-x---. 2 root root    6 Jun 24 23:40 plugins
drwxr-x---. 7 root root  210 Jun 24 23:41 pods
[root@localhost kubelet]#

再次启动kubectl运行正常

root@localhost kubelet]# systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2021-06-24 23:40:36 PDT; 4min 31s ago
     Docs: http://kubernetes.io/docs/
 Main PID: 26379 (kubelet)
   Memory: 40.1M
   CGroup: /system.slice/kubelet.service
           └─26379 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubele...
 
Jun 24 23:44:46 localhost.localdomain kubelet[26379]: W0624 23:44:46.705701   26379 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 24 23:44:46 localhost.localdomain kubelet[26379]: E0624 23:44:46.705816   26379 kubelet.go:2110] Container runtime network not ready: NetworkReady=fals...itialized
Jun 24 23:44:51 localhost.localdomain kubelet[26379]: W0624 23:44:51.706635   26379 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 24 23:44:51 localhost.localdomain kubelet[26379]: E0624 23:44:51.706743   26379 kubelet.go:2110] Container runtime network not ready: NetworkReady=fals...itialized
Jun 24 23:44:56 localhost.localdomain kubelet[26379]: W0624 23:44:56.708389   26379 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 24 23:44:56 localhost.localdomain kubelet[26379]: E0624 23:44:56.708506   26379 kubelet.go:2110] Container runtime network not ready: NetworkReady=fals...itialized
Jun 24 23:45:01 localhost.localdomain kubelet[26379]: W0624 23:45:01.710355   26379 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 24 23:45:01 localhost.localdomain kubelet[26379]: E0624 23:45:01.710556   26379 kubelet.go:2110] Container runtime network not ready: NetworkReady=fals...itialized
Jun 24 23:45:06 localhost.localdomain kubelet[26379]: W0624 23:45:06.712081   26379 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 24 23:45:06 localhost.localdomain kubelet[26379]: E0624 23:45:06.712241   26379 kubelet.go:2110] Container runtime network not ready: NetworkReady=fals...itialized
Hint: Some lines were ellipsized, use -l to show in full.
[root@localhost kubelet]#

但是运行systemctl start kube-scheduler一直报错，apiserver起不起来

Failed to start kube-scheduler.service: Unit not found.

因为笔者不是专门研究k8s的，只是借助k8s搭建服务，所以就直接重新运行了
Master初始化

kubeadm init --node-name=master --image-repository=registry.aliyuncs.com/google_containers --cri-socket=unix:///var/run/cri-dockerd.sock --apiserver-advertise-address=<服务器IP> --pod-network-cidr=10.244.0.0/16 --service-cidr=10.96.0.0/12

此后一切都正常
小插曲：
使用kubeadm init时报错10259端口一直被占用，卡死之后初始化还是会报错端口被占用
查看10259端口哪个服务在用，发现是是kube-apiserver这个服务占用了这个端口，于是执行

systemctl stop kube-apiserver

初始化成功

笔者个人觉得，自己需要在k8s上搭建服务但又不太了解k8s（比如我），集群上又没有业务和重要的服务，可以考虑将集群重启，可以参考下文，原文链接：Kubernetes集群如何重启

2.执行 kubeadm reset 命令时，系统发现主机上存在多个 CRI 的终端

[root@master kubernetes]# kubeadm reset 
Found multiple CRI endpoints on the host. Please define which one do you wish to use by setting the 'criSocket' field in the kubeadm configuration file: unix:///var/run/containerd/containerd.sock, unix:///var/run/cri-dockerd.sock

命令后加上–cri-socket=unix:///var/run/cri-dockerd.sock（换成你实际使用的终端）

3.kubeadm需要重新生成admin.conf文件

执行

kubeadm init phase kubeconfig admin

如果报错i存在多个 CRI 的终端，同理命令最后加上参数–cri-socket=unix:///var/run/cri-dockerd.sock（换成你实际使用的终端）

4.master节点始终处于notReady状态

通过journalctl -xeu kubelet命令查看日志发现如下所示
“Unable to update cni config”
原因是网络插件未安装，下载 kube-flannel.yml或者calico.yaml文件，并执行kubectl apply -f calico.yaml或者执行kubectl apply -f kube-flannel.yml

5.Pod 一直处于 Pending 状态

检查 Node 是否存在 Pod 没有容忍的污点

然后执行kubectl describe node master，发现了污点
在这里插入图片描述

然后用kubectl删除相关deployment文件，添加容忍后pod成功running
这里也推荐一篇文章，原文链接：Pod 一直处于 Pending 状态

推荐一篇关于k8s如何排查错误的文章，笔者觉得很不错，原文链接如下：
Kubernetes集群管理-故障排错指南

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub