基本信息

环境:

vbox 虚拟机 

admin  192.168.56.3

node1 192.168.56.4

node2 192.168.56.5

版本:

1.17

现象:

虚拟机启动后 kubectl get po 提示无法连接到apiserver, 查看端口 6443端口没有启动

[root@admin ~ 11:58:45]$netstat -tln |grep 6443
[root@admin ~ 11:59:12]$

排查思路

查看kubelet进程状态,发现不断的重新启动,pid不断更新:

[root@admin ~ 13:20:48]$systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Wed 2024-05-15 13:20:22 CST; 34s ago
     Docs: https://kubernetes.io/docs/
 Main PID: 2894 (kubelet)
    Tasks: 23
   Memory: 40.8M
   CGroup: /system.slice/kubelet.service
           ├─2894 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=reg...
           └─3144 /opt/cni/bin/calico

May 15 13:20:55 admin kubelet[2894]: E0515 13:20:55.967529    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.067700    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.176891    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.277128    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.379297    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.479875    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.581421    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.682193    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.783861    2894 kubelet.go:2263] node "admin" not found
May 15 13:20:56 admin kubelet[2894]: E0515 13:20:56.884883    2894 kubelet.go:2263] node "admin" not found

最终处于失败状态

[root@admin ~ 13:22:34]$systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2024-05-15 13:31:30 CST; 2s ago
     Docs: https://kubernetes.io/docs/
  Process: 11179 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
 Main PID: 11179 (code=exited, status=255)

May 15 13:31:30 admin systemd[1]: Unit kubelet.service entered failed state.
May 15 13:31:30 admin systemd[1]: kubelet.service failed.

查看Kubelet 日志 提示连接不上apiserver:

May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.320171   22948 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to get node info: node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.374682   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.475293   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.575481   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.676269   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.776713   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.877310   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:15 admin kubelet[22948]: E0515 13:47:15.978229   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.078848   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.179013   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.279419   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.379966   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.480930   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.586956   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.687451   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.787881   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.889449   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:16 admin kubelet[22948]: E0515 13:47:16.991167   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.091457   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: I0515 13:47:17.147951   22948 trace.go:116] Trace[746406383]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46 (started: 2024-05-15 13:47:06.52086871 +0800 CST m=+61.834308778) (total time: 10.627068067s):
May 15 13:47:17 admin kubelet[22948]: Trace[746406383]: [10.627068067s] [10.627068067s] END
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.147964   22948 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://192.168.56.3:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dadmin&limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148130   22948 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get https://192.168.56.3:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: I0515 13:47:17.148163   22948 trace.go:116] Trace[2040972696]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/kubelet.go:458 (started: 2024-05-15 13:47:06.914382679 +0800 CST m=+62.227822762) (total time: 10.233774272s):
May 15 13:47:17 admin kubelet[22948]: Trace[2040972696]: [10.233774272s] [10.233774272s] END
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148167   22948 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://192.168.56.3:6443/api/v1/nodes?fieldSelector=metadata.name%3Dadmin&limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: I0515 13:47:17.148196   22948 trace.go:116] Trace[543846238]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2024-05-15 13:47:07.112090149 +0800 CST m=+62.425530232) (total time: 10.036100643s):
May 15 13:47:17 admin kubelet[22948]: Trace[543846238]: [10.036100643s] [10.036100643s] END
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148199   22948 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.CSIDriver: Get https://192.168.56.3:6443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148244   22948 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://192.168.56.3:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 192.168.56.3:6443: connect: connection refused

May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.148279   22948 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: Get https://192.168.56.3:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/admin?timeout=10s: dial tcp 192.168.56.3:6443: connect: connection refused
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.201038   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.317900   22948 kubelet.go:2263] node "admin" not found
May 15 13:47:17 admin kubelet[22948]: E0515 13:47:17.434319   22948 kubelet.go:2263] node "admin" not found

查看控制平面容器状态

通过docker查看控制平面容器状态:controller-manager启动成功,etcd启动成功,scheduler启动成功

只有apiserver容器启动失败,然后反复创建

[root@admin ~ 12:00:01]$d ps
CONTAINER ID        IMAGE                                               COMMAND                  CREATED             STATUS              PORTS               NAMES
2864d001eb03        0cae8d5cc64c                                        "kube-apiserver --ad…"   18 seconds ago      Up 18 seconds                           k8s_kube-apiserver_kube-apiserver-admin_kube-system_7080dfd5265027f761070484838677be_169
63755f762ebc        5eb3b7486872                                        "kube-controller-man…"   About an hour ago   Up About an hour                        k8s_kube-controller-manager_kube-controller-manager-admin_kube-system_9ea7baeb1ddc4fd0384e5e9496eb4bf1_85
b4a09a395aeb        registry.aliyuncs.com/google_containers/pause:3.1   "/pause"                 About an hour ago   Up About an hour                        k8s_POD_kube-controller-manager-admin_kube-system_9ea7baeb1ddc4fd0384e5e9496eb4bf1_74
de4d9d70e8f9        78c190f736b1                                        "kube-scheduler --au…"   About an hour ago   Up About an hour                        k8s_kube-scheduler_kube-scheduler-admin_kube-system_ef597d905c3006a0826f3e90c95561d5_85
61c6450c9bad        303ce5db0e90                                        "etcd --advertise-cl…"   About an hour ago   Up About an hour                        k8s_etcd_etcd-admin_kube-system_280de17b76b115c8fe2656f963856a33_76
b147f01426b8        registry.aliyuncs.com/google_containers/pause:3.1   "/pause"                 About an hour ago   Up About an hour                        k8s_POD_kube-scheduler-admin_kube-system_ef597d905c3006a0826f3e90c95561d5_74
dfe0fe24ecb5        registry.aliyuncs.com/google_containers/pause:3.1   "/pause"                 About an hour ago   Up About an hour                        k8s_POD_kube-apiserver-admin_kube-system_7080dfd5265027f761070484838677be_74
e06500d4ba65        registry.aliyuncs.com/google_containers/pause:3.1   "/pause"                 About an hour ago   Up About an hour                        k8s_POD_etcd-admin_kube-system_280de17b76b115c8fe2656f963856a33_74

查看容器日志 提示连接etcd的时候 证书过期失效  TLS  认证失败, 但是,前两天还好好的,怎么突然失效呢?

[root@admin ~ 13:53:33]$d logs daf
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0515 05:53:28.128881       1 server.go:596] external host was not specified, using 192.168.56.3
I0515 05:53:28.129007       1 server.go:150] Version: v1.17.0
I0515 05:53:28.472443       1 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0515 05:53:28.472454       1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0515 05:53:28.472951       1 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0515 05:53:28.472957       1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0515 05:53:28.474094       1 client.go:361] parsed scheme: "endpoint"
I0515 05:53:28.474119       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0  <nil>}]
W0515 05:53:28.476416       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
I0515 05:53:29.472266       1 client.go:361] parsed scheme: "endpoint"
I0515 05:53:29.472294       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0  <nil>}]
W0515 05:53:29.476935       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0515 05:53:29.479691       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0515 05:53:30.480415       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
 

手动检查证书,发现确实是一年就过期:

[root@admin /etc/kubernetes/pki 18:35:36]$openssl x509 -in apiserver.crt  -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 5970826557764692186 (0x52dca31e970868da)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=kubernetes
        Validity
            Not Before: May 12 00:21:35 2023 GMT
            Not After : May 11 00:21:35 2024 GMT

手动更新

$kubeadm alpha certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

W0623 18:41:11.297634    8269 validation.go:28] Cannot validate kube-proxy config - no validator is available
W0623 18:41:11.297665    8269 validation.go:28] Cannot validate kubelet config - no validator is available
certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

 [root@admin /etc/kubernetes/pki 18:46:34]$openssl x509  -in apiserver.crt -noout  -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 4838880848778631499 (0x4327287e95e4194b)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=kubernetes
        Validity
            Not Before: May 12 00:21:35 2023 GMT
            Not After : Jun 23 10:41:11 2025 GMT

复制新的config到.kube 下

cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config

查看pod 状态,发现apiserver和其它Pod逐渐恢复,但是,其中有部分停在imagepullbackoff

 查看一个es pod ,发现拉取dockerhub镜像超时,好吧,6月份开始,国内镜像源好像是集体挂了。

 最后只好上了点魔法,从我的位于东南亚Singapore的Azure vm中下载然后传到我的腾讯云私镜像仓库,以后就用私人仓库吧。

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐