记录一次k8s v1.17故障[证书过期]
最后只好上了点魔法,从我的位于东南亚Singapore的Azure vm中下载然后传到我的腾讯云私镜像仓库,以后就用私人仓库吧。通过docker查看控制平面容器状态:controller-manager启动成功,etcd启动成功,scheduler启动成功。查看容器日志 提示连接etcd的时候 证书过期失效TLS认证失败, 但是,前两天还好好的,怎么突然失效呢?虚拟机启动后 kubectl get
基本信息
环境:
vbox 虚拟机
admin 192.168.56.3
node1 192.168.56.4
node2 192.168.56.5
版本:
1.17
现象:
虚拟机启动后 kubectl get po 提示无法连接到apiserver, 查看端口 6443端口没有启动
[root@admin ~ 11:58:45]$netstat -tln |grep 6443
[root@admin ~ 11:59:12]$
排查思路
查看kubelet进程状态,发现不断的重新启动,pid不断更新:
[root@admin ~ 13:20:48]$systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2024-05-15 13:20:22 CST; 34s ago
Docs: https://kubernetes.io/docs/
Main PID: 2894 (kubelet)
Tasks: 23
Memory: 40.8M
CGroup: /system.slice/kubelet.service
├─2894 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=reg...
└─3144 /opt/cni/bin/calicoMay 15 13:20:55 admin kubelet[2894]: E0515 13:20:55.967529 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.067700 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.176891 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.277128 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.379297 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.479875 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.581421 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.682193 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.783861 2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.884883 2894 kubelet.go:2263] node "admin" not found
最终处于失败状态
[root@admin ~ 13:22:34]$systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2024-05-15 13:31:30 CST; 2s ago
Docs: https://kubernetes.io/docs/
Process: 11179 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 11179 (code=exited, status=255)May 15 13:31:30 admin systemd[1]: Unit kubelet.service entered failed state.
May 15 13:31:30 admin systemd[1]: kubelet.service failed.
查看Kubelet 日志 提示连接不上apiserver:
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.320171 22948 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to get node info: node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.374682 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.475293 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.575481 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.676269 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.776713 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.877310 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.978229 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.078848 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.179013 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.279419 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.379966 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.480930 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.586956 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.687451 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.787881 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.889449 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.991167 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.091457 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: I0515 13:47:17.147951 22948 trace.go:116] Trace[746406383]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46 (started: 2024-05-15 13:47:06.52086871 +0800 CST m=+61.834308778) (total time: 10.627068067s):
May 15 13:47:17 admin kubelet[22948]: Trace[746406383]: [10.627068067s] [10.627068067s] END
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.147964 22948 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://192.168.56.3:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dadmin&limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148130 22948 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get https://192.168.56.3:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: I0515 13:47:17.148163 22948 trace.go:116] Trace[2040972696]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/kubelet.go:458 (started: 2024-05-15 13:47:06.914382679 +0800 CST m=+62.227822762) (total time: 10.233774272s):
May 15 13:47:17 admin kubelet[22948]: Trace[2040972696]: [10.233774272s] [10.233774272s] END
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148167 22948 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://192.168.56.3:6443/api/v1/nodes?fieldSelector=metadata.name%3Dadmin&limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: I0515 13:47:17.148196 22948 trace.go:116] Trace[543846238]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2024-05-15 13:47:07.112090149 +0800 CST m=+62.425530232) (total time: 10.036100643s):
May 15 13:47:17 admin kubelet[22948]: Trace[543846238]: [10.036100643s] [10.036100643s] END
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148199 22948 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.CSIDriver: Get https://192.168.56.3:6443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148244 22948 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://192.168.56.3:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148279 22948 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: Get https://192.168.56.3:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/admin?timeout=10s: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.201038 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.317900 22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.434319 22948 kubelet.go:2263] node "admin" not found
查看控制平面容器状态
通过docker查看控制平面容器状态:controller-manager启动成功,etcd启动成功,scheduler启动成功
只有apiserver容器启动失败,然后反复创建
[root@admin ~ 12:00:01]$d ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2864d001eb03 0cae8d5cc64c "kube-apiserver --ad…" 18 seconds ago Up 18 seconds k8s_kube-apiserver_kube-apiserver-admin_kube-system_7080dfd5265027f761070484838677be_169
63755f762ebc 5eb3b7486872 "kube-controller-man…" About an hour ago Up About an hour k8s_kube-controller-manager_kube-controller-manager-admin_kube-system_9ea7baeb1ddc4fd0384e5e9496eb4bf1_85
b4a09a395aeb registry.aliyuncs.com/google_containers/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_kube-controller-manager-admin_kube-system_9ea7baeb1ddc4fd0384e5e9496eb4bf1_74
de4d9d70e8f9 78c190f736b1 "kube-scheduler --au…" About an hour ago Up About an hour k8s_kube-scheduler_kube-scheduler-admin_kube-system_ef597d905c3006a0826f3e90c95561d5_85
61c6450c9bad 303ce5db0e90 "etcd --advertise-cl…" About an hour ago Up About an hour k8s_etcd_etcd-admin_kube-system_280de17b76b115c8fe2656f963856a33_76
b147f01426b8 registry.aliyuncs.com/google_containers/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_kube-scheduler-admin_kube-system_ef597d905c3006a0826f3e90c95561d5_74
dfe0fe24ecb5 registry.aliyuncs.com/google_containers/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_kube-apiserver-admin_kube-system_7080dfd5265027f761070484838677be_74
e06500d4ba65 registry.aliyuncs.com/google_containers/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_etcd-admin_kube-system_280de17b76b115c8fe2656f963856a33_74
查看容器日志 提示连接etcd的时候 证书过期失效 TLS 认证失败, 但是,前两天还好好的,怎么突然失效呢?
[root@admin ~ 13:53:33]$d logs daf
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0515 05:53:28.128881 1 server.go:596] external host was not specified, using 192.168.56.3
I0515 05:53:28.129007 1 server.go:150] Version: v1.17.0
I0515 05:53:28.472443 1 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0515 05:53:28.472454 1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0515 05:53:28.472951 1 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0515 05:53:28.472957 1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0515 05:53:28.474094 1 client.go:361] parsed scheme: "endpoint"
I0515 05:53:28.474119 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0 <nil>}]
W0515 05:53:28.476416 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
I0515 05:53:29.472266 1 client.go:361] parsed scheme: "endpoint"
I0515 05:53:29.472294 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0 <nil>}]
W0515 05:53:29.476935 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0515 05:53:29.479691 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0515 05:53:30.480415 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
手动检查证书,发现确实是一年就过期:
[root@admin /etc/kubernetes/pki 18:35:36]$openssl x509 -in apiserver.crt -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 5970826557764692186 (0x52dca31e970868da)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=kubernetes
Validity
Not Before: May 12 00:21:35 2023 GMT
Not After : May 11 00:21:35 2024 GMT
手动更新
$kubeadm alpha certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[renew] Error reading configuration from the Cluster. Falling back to default configurationW0623 18:41:11.297634 8269 validation.go:28] Cannot validate kube-proxy config - no validator is available
W0623 18:41:11.297665 8269 validation.go:28] Cannot validate kubelet config - no validator is available
certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed
[root@admin /etc/kubernetes/pki 18:46:34]$openssl x509 -in apiserver.crt -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 4838880848778631499 (0x4327287e95e4194b)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=kubernetes
Validity
Not Before: May 12 00:21:35 2023 GMT
Not After : Jun 23 10:41:11 2025 GMT
复制新的config到.kube 下
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
查看pod 状态,发现apiserver和其它Pod逐渐恢复,但是,其中有部分停在imagepullbackoff
查看一个es pod ,发现拉取dockerhub镜像超时,好吧,6月份开始,国内镜像源好像是集体挂了。
最后只好上了点魔法,从我的位于东南亚Singapore的Azure vm中下载然后传到我的腾讯云私镜像仓库,以后就用私人仓库吧。
更多推荐
所有评论(0)