两套集群,同样的云平台,操作系统版本,k8s版本,docker版本,k8s镜像,配置。集群甲顺利部署,集群乙部署后,网络组件异常,容器内文件操作权限不足导致异常退出。目前排查到差别就是集群乙已经创建了业务用户(uid1000),这个会有影响吗?

单独docker run那个镜像,在里面操作文件内问题,应该是k8s这边。怀疑selinux,但也关了。

集群甲组件情况

[root@kfzx-yyfwq01 ~]# uname -a
Linux kfzx-yyfwq01 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@kfzx-yyfwq01 ~]# kubectl get pod -A
NAMESPACE      NAME                                               READY STATUS      RESTARTS    AGE IP               NODE
default        rocketmq-operator-6f54dbb5d8-xbmmn                 0/1   CrashLoopBackOff 185         16h 10.244.41.135    worker1
ingress-controller ingress-nginx-admission-create-tfdwt           0/1   Completed   0           20h 10.244.61.2      worker2
ingress-controller ingress-nginx-admission-patch-gxfm4            0/1   Completed   2           20h 10.244.61.1      worker2
ingress-controller ingress-nginx-controller-dz4sb                 0/1   CrashLoopBackOff 138         11h 192.168.9.47     worker3
ingress-controller ingress-nginx-controller-nnsrv                 0/1   CrashLoopBackOff 138         11h 192.168.9.45     worker1
ingress-controller ingress-nginx-controller-pzhzg                 0/1   CrashLoopBackOff 138         11h 192.168.9.46     worker2
kube-system    calico-kube-controllers-6ff5c664c4-5jk68           0/1   CrashLoopBackOff 219         13h 10.244.61.7      worker2
kube-system    calico-node-8qbps                                  1/1   Running     0           20h 192.168.9.45     worker1
kube-system    calico-node-bwwhn                                  1/1   Running     3           20h 192.168.9.46     worker2
kube-system    calico-node-xr285                                  1/1   Running     3           20h 192.168.9.47     worker3
kube-system    calico-typha-6576ff658-5wtqq                       1/1   Running     1           20h 192.168.9.46     worker2
kube-system    calico-typha-6576ff658-626vr                       1/1   Running     1           20h 192.168.9.47     worker3
kube-system    calicoctl-hdfp2                                    1/1   Running     1           20h 192.168.9.46     worker2
kube-system    calicoctl-l9wk5                                    1/1   Running     0           20h 192.168.9.45     worker1
kube-system    calicoctl-m75qj                                    1/1   Running     1           20h 192.168.9.47     worker3
[root@kfzx-yyfwq01 ~]# kdesc ingress-nginx-controller|grep Image
    Image:         registry.aliyuncs.com/kubeadm-ha/ingress-nginx_controller:v0.47.0
    Image ID:      docker-pullable://registry.aliyuncs.com/kubeadm-ha/ingress-nginx_controller@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
[root@kfzx-yyfwq01 ~]# kdesc calico-kube-controllers|grep Image
    Image:          registry.aliyuncs.com/kubeadm-ha/calico_kube-controllers:v3.19.1
    Image ID:       docker-pullable://registry.aliyuncs.com/kubeadm-ha/calico_kube-controllers@sha256:904458fe1bd56f995ef76e2c4d9a6831c506cc80f79e8fc0182dc059b1db25a4

集群乙组件情况

[root@ytzn-saas06 ~]# uname -a
Linux ytzn-saas06 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

# kubectl get pod -A
ingress-controller ingress-nginx-controller-4k7kq                1/1   Running     0          105d 192.168.9.73     worker1
ingress-controller ingress-nginx-controller-cz6jv                1/1   Running     0        105d 192.168.9.75     worker3
ingress-controller ingress-nginx-controller-pzgb4                1/1   Running     0          105d 192.168.9.74     worker2
kube-system    calico-kube-controllers-6d75fbc96d-cgq4g           1/1   Running     6          105d 10.244.52.10     master2
kube-system    calico-node-2nllh                                  1/1   Running     0          105d 192.168.9.73     worker1
kube-system    calico-node-8jr77                                  1/1   Running     1          105d 192.168.9.70     master1
kube-system    calico-node-hqb6b                                  1/1   Running     21         105d 192.168.9.75     worker3
kube-system    calico-node-kcg44                                  1/1   Running     0          105d 192.168.9.71     master2
kube-system    calico-node-nlnmd                                  1/1   Running     0          105d 192.168.9.72     master3
kube-system    calico-node-thzw2                                  1/1   Running     0          105d 192.168.9.74     worker2
kube-system    calico-typha-6576ff658-8jr9n                       1/1   Running     0           33d 192.168.9.75     worker3
kube-system    calico-typha-6576ff658-d5h8z                       1/1   Running     0          105d 192.168.9.74     worker2
kube-system    calicoctl-8jzgt                                    1/1   Running     1          105d 192.168.9.70     master1
kube-system    calicoctl-hqp24                                    1/1   Running     0          105d 192.168.9.73     worker1

[root@ytzn-saas06 ~]# kdesc ingress-nginx-controller-4k7kq|grep Image
    Image:         registry.aliyuncs.com/kubeadm-ha/ingress-nginx_controller:v0.47.0
    Image ID:      docker-pullable://registry.aliyuncs.com/kubeadm-ha/ingress-nginx_controller@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
[root@ytzn-saas06 ~]# kdesc calico-kube-controllers|grep Image
    Image:          registry.aliyuncs.com/kubeadm-ha/calico_kube-controllers:v3.19.1
    Image ID:       docker-pullable://registry.aliyuncs.com/kubeadm-ha/calico_kube-controllers@sha256:904458fe1bd56f995ef76e2c4d9a6831c506cc80f79e8fc0182dc059b1db25a4

ingress-nginx-controller 日志

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v0.47.0
  Build:         7201e37633485d1f14dbe9cd7b22dd380df00a07
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.20.1

-------------------------------------------------------------------------------

I0329 00:31:51.771283       7 flags.go:208] "Watching for Ingress" class="nginx"
W0329 00:31:51.771341       7 flags.go:213] Ingresses with an empty class will also be processed by this Ingress controller
W0329 00:31:51.771634       7 client_config.go:614] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0329 00:31:51.771862       7 main.go:241] "Creating API client" host="https://10.244.64.1:443"
I0329 00:31:51.789762       7 main.go:285] "Running in Kubernetes cluster" major="1" minor="21" git="v1.21.5" state="clean" commit="aea7bbadd2fc0cd689de94a54e5b7b758869d691" platform="linux/amd64"
F0329 00:31:51.962345       7 ssl.go:389] unexpected error storing fake SSL Cert: could not create PEM certificate file /etc/ingress-controller/ssl/default-fake-certificate.pem: open /etc/ingress-controller/ssl/default-fake-certificate.pem: permission denied
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0001b4200, 0x103, 0x1e1)
        k8s.io/klog/v2@v2.4.0/klog.go:1026 +0xb9

calico-kube-controllers 日志

2023-03-29 00:31:29.141 [INFO][1] main.go 92: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0329 00:31:29.142583       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2023-03-29 00:31:29.143 [INFO][1] main.go 113: Ensuring Calico datastore is initialized
2023-03-29 00:31:29.157 [INFO][1] main.go 153: Getting initial config snapshot from datastore
2023-03-29 00:31:29.173 [INFO][1] main.go 156: Got initial config snapshot
2023-03-29 00:31:29.173 [INFO][1] watchersyncer.go 89: Start called
2023-03-29 00:31:29.173 [INFO][1] main.go 173: Starting status report routine
2023-03-29 00:31:29.173 [INFO][1] main.go 182: Starting Prometheus metrics server on port 9094
2023-03-29 00:31:29.173 [INFO][1] main.go 418: Starting controller ControllerType="Node"
2023-03-29 00:31:29.173 [INFO][1] node_controller.go 143: Starting Node controller
2023-03-29 00:31:29.173 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2023-03-29 00:31:29.173 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-03-29 00:31:29.173 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied

日志都指向无法写入文件这个权限不足问题

解决方案

重装k8s集群后故障消失(注:从1.21.5改为1.21.14),原有不明

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐