k8s安装3节点集群Fate v1.8.0
单个Master集群部署KubeFate
采用k8s,而非minikube, 在3个centos系统的节点上安装fate集群。
本人安装这个v1.8.0版本后,能登陆fateboard,但无法传输数据,问题无法解决。于是选择安装v1.7.2版本,配置更为具体,步骤更为清晰,请参考《k8s安装3节点的联邦学习Fate集群 v1.7.2(全网最细-解决N多坑)》:
https://blog.csdn.net/Acecai01/article/details/128253844?spm=1001.2014.3001.5502
集群配置信息
3节点配置信息如下图:
当kubefate最新版是1.9.0时,依赖的k8s和ingress-ngnix版本如下:
Recommended version of dependent software:
Kubernetes: v1.23.5
Ingress-nginx: v1.1.3
升级K8S到1.23.5
如果你的集群k8s版本高于1.19.0,可以直接跳过本步骤。
参考博客
https://blog.csdn.net/RivenDong/article/details/121213109
https://www.cnblogs.com/cloud-yongqing/p/16629666.html
以下步骤多次操作,逐级将K8S从1.18.x升级到1.23.5
master节点
yum install -y kubeadm-1.19.16-0 --disableexcludes=kubernetes
kubeadm version
kubectl drain harbor.clife.io --delete-emptydir-data --ignore-daemonsets
kubeadm upgrade plan --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration
kubeadm upgrade apply v1.19.16 --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration
yum install -y kubelet-1.19.16-0 kubectl-1.19.16-0
systemctl daemon-reload
systemctl restart kubelet
kubectl uncordon harbor.clife.io
节点gpu-51
master节点执行: kubectl drain gpu-51 --ignore-daemonsets
yum install -y kubeadm-1.20.15-0 --disableexcludes=kubernetes
kubeadm upgrade node
yum install -y kubelet-1.20.15-0 kubectl-1.20.15-0 --disableexcludes=kubernetes
systemctl daemon-reload
systemctl restart kubelet
master节点执行: kubectl uncordon gpu-51
删除旧版Fate
如果你的集群未安装过Fate,跳过本步骤
查看之前已安装的f旧版fate,将其删除:
查看:
kubectl get ns
NAME STATUS AGE
default Active 504d
fate-10000 Active 459d
fate-9999 Active 459d
ingress-nginx Active 465d
istio-system Active 497d
kube-fate Active 465d
kube-node-lease Active 504d
kube-public Active 504d
kube-system Active 504d
kubernetes-dashboard Terminating 504d
kubernetes-dashboard2 Active 4d17h
kubesphere-controls-system Active 489d
kubesphere-monitoring-federated Active 489d
kubesphere-monitoring-system Active 489d
minio Active 363d
monitoring Active 362d
seldon Active 159d
seldon-system Active 502d
删除:
kubectl delete namespace fate-10000
kubectl delete namespace fate-9999
kubectl delete namespace kube-fate
kate下载
链接: link
软件包:kubefate-k8s-v1.8.0.tar.gz
接下来的操作都在Master节点上完成。
部署ingress-nginx
参考:https://blog.csdn.net/qq_41296573/article/details/125809696
以下deploy.yaml为部署ingress-nginx(1.1.3版本,当时最新1.5.0)的文件,可能需要翻墙才能下载:
https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.3/deploy/static/provider/cloud/deploy.yaml
以上文件中有2个翻墙才能下载的镜像,将镜像改成国内的镜像(3处地方):
k8s.gcr.io/ingress-nginx/controller:v1.1.3@sha256:31f47c1e202b39fadecf822a9b76370bd4baed199a005b3e7d4d1455f4fd3fe2
改为:
registry.cn-hangzhou.aliyuncs.com/google_containers/nginx-ingress-controller:v1.1.3
k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660
改为:
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-webhook-certgen:v1.1.1
然后部署ingress-nginx:
kubectl apply -f ./deploy.yaml
查看ingress-nginx是否成功:
[root@harbor kubefate]# kubectl get pods -n ingress-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-admission-create-zh96h 0/1 Completed 0 2d23h 10.244.1.26 gpu-51 <none> <none>
ingress-nginx-admission-patch-hmgr5 0/1 Completed 1 2d23h 10.244.1.27 gpu-51 <none> <none>
ingress-nginx-controller-6995ffb95b-m87gh 1/1 Running 0 2d18h 172.17.0.8 k8s-node02 <none> <none>
可见ingress-nginx被安装到了k8s-node02节点,而不是master节点,这个是正常的(即便是在master操作,也会安装到别处)
输入如下命令,检查配置是否生效:
kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.196.14 <pending> 80:30428/TCP,443:30338/TCP 16m
ingress-nginx-controller-admission ClusterIP 10.1.32.33 <none> 443/TCP 16m
可以看到ingress-nginx-controller的EXTERNAL-IP为pending状态,经查阅资料,借鉴如下博客:
链接: link
修改 service中ingress-nginx-controller的EXTERNAL-IP为k8s-node02节点的IP:
kubectl edit -n ingress-nginx service/ingress-nginx-controller
在大概如下位置添加externalIPs:
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.1.86.240
clusterIPs:
- 10.1.86.240
externalIPs:
- 10.6.17.106
再次查看,EXTERNAL-IP已经有了:
[root@harbor kubefate]# kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.86.240 10.6.17.106 80:31872/TCP,443:32412/TCP 2d23h
ingress-nginx-controller-admission ClusterIP 10.1.41.173 <none> 443/TCP 2d23h
部署Kubefate服务
1.载入Kubefate服务镜像
接着,我们下载KubeFATE服务镜像v1.4.4:
curl -LO https://github.com/FederatedAI/KubeFATE/releases/download/v1.8.0/kubefate-v1.4.4.docker
注意:前边是v1.8.0后边是v1.4.4
然后读入本地Docker环境
docker load < kubefate-v1.4.4.docker
创建目录
mkdir /home/FATE_V180
将kubefate-k8s-v1.8.0.tar.gz拷贝到新目录中解压
tar -zxvf kubefate-k8s-v1.8.0.tar.gz
解压后的目录,可见可执行文件KubeFATE,可以直接移动到path目录方便使用:
chmod +x ./kubefate && sudo mv ./kubefate /usr/bin
测试下kubefate命令是否可用:
kubefate version
* kubefate commandLine version=v1.4.4
* kubefate service connection error, resp.StatusCode=404, error: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>404 - Not Found</title>
</head>
<body>
<h1>404 - Not Found</h1>
<script type="text/javascript" src="//wpc.75674.betacdn.net/0075674/www/ec_tpm_bcon.js"></script>
</body>
</html>
以上提示的问题算正常,后面会解决。
执行rbac-config.yaml–为 KubeFATE服务创建命名空间
kubectl apply -f ./rbac-config.yaml
因为近期Dockerhub调整了下载限制服务条例 Dockerhub latest limitation, 我建议使用国内网易云的镜像仓库代替官方Dockerhub
1、将kubefate.yaml内镜像federatedai/kubefate:v1.4.4改成hub.c.163.com/federatedai/kubefate:v1.4.4
2、sed 's/mariadb:10/hub.c.163.com\/federatedai\/mariadb:10/g' kubefate.yaml > kubefate_163.yaml
3、sed 's/registry: ""/registry: "hub.c.163.com\/federatedai"/g' cluster.yaml > cluster_163.yaml
在kube-fate命名空间里部署KubeFATE服务,相关的yaml文件也已经准备在工作目录,直接使用kubectl apply:
kubectl apply -f ./kubefate_163.yaml
【注】如果你是删除了kubefate和ingress-ngnix重新执行这一步,可能会发生一个错误,解决办法参考:https://blog.csdn.net/qq_39218530/article/details/115372879
稍等一会,大概10几秒后用下面命令看下KubeFATE服务是否部署好:
kubectl get all,ingress -n kube-fate
可能的问题会导致kubefate pod crash:
Startup probe failed: Get "http://10.244.1.34:8080/": dial tcp 10.244.1.34:8080: connect: connection refused
如果返回类似下面的信息(特别是pod的STATUS显示的是Running状态),则KubeFATE的服务就已经部署好并正常运行:
[root@harbor kubefate]# kubectl get all,ingress -n kube-fate
NAME READY STATUS RESTARTS AGE
pod/kubefate-5bf485957b-9wltd 0/1 Evicted 0 2d20h
pod/kubefate-5bf485957b-bh774 0/1 ContainerStatusUnknown 1 3d1h
pod/kubefate-5bf485957b-bs8zc 0/1 Evicted 0 2d20h
pod/kubefate-5bf485957b-cj7j7 0/1 Evicted 0 2d20h
pod/kubefate-5bf485957b-hn2xm 0/1 Evicted 0 2d20h
pod/kubefate-5bf485957b-m4hn6 0/1 Evicted 0 2d20h
pod/kubefate-5bf485957b-ncbc2 0/1 Evicted 0 2d20h
pod/kubefate-5bf485957b-tznw6 1/1 Running 0 2d20h
pod/mariadb-574d4679f8-f5wc2 1/1 Running 0 2d20h
pod/mariadb-574d4679f8-mw9np 0/1 ContainerStatusUnknown 1 3d1h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubefate NodePort 10.1.151.34 <none> 8080:30053/TCP 3d1h
service/mariadb ClusterIP 10.1.150.151 <none> 3306/TCP 3d1h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kubefate 1/1 1 1 3d1h
deployment.apps/mariadb 1/1 1 1 3d1h
NAME DESIRED CURRENT READY AGE
replicaset.apps/kubefate-5bf485957b 1 1 1 3d1h
replicaset.apps/mariadb-574d4679f8 1 1 1 3d1h
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/kubefate nginx example.com 10.6.17.106 80 3d1h
.添加example.com到hosts文件
因为我们要用 example.com 域名来访问KubeFATE服务(该域名在ingress中定义,有需要可自行修改),需要在运行kubefate命令行所在的机器配置hosts文件(注意不是Kubernetes所在的机器,而是ingress-ngnix所在的机器,前面安装ingress-ngnix部分有讲)。 另外下文中部署的FATE集群默认也是使用example.com作为默认域名, 如果网络环境有域名解析服务,可配置example.com域名指向master机器的IP地址,这样就不用配置hosts文件。(IP地址一定要换成你自己的)
sudo -- sh -c "echo \"10.6.17.106 example.com\" >> /etc/hosts"
[root@harbor kubefate]# ping example.com
PING example.com (10.6.17.106) 56(84) bytes of data.
64 bytes from k8s-master (10.6.17.106): icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from k8s-master (10.6.17.106): icmp_seq=2 ttl=64 time=0.054 ms
64 bytes from k8s-master (10.6.17.106): icmp_seq=3 ttl=64 time=0.050 ms
^C
--- example.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.041/0.048/0.054/0.007 ms
使用vi修改config.yaml的内容。只需要修改serviceurl: example.com:31872加上映射的端口,如果忘记了重新查看一下80端口对应的映射端口:
[root@harbor kubefate]# kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.130.161 10.6.17.106 80:32415/TCP,443:32491/TCP 3d19h
ingress-nginx-controller-admission ClusterIP 10.1.78.36 <none> 443/TCP 3d19h
修改完成查看一下,显示如下:
[root@harbor kubefate]# kubefate version
* kubefate commandLine version=v1.4.4
* kubefate service version=v1.4.4
使用KubeFATE安装FATE
按照前面的计划,我们需要安装3联盟方,ID分别9998、9999与10000。现实情况,这3方应该是完全独立、隔绝的组织,为了模拟现实情况,所以我们需要先为他们在Kubernetes上创建各自独立的命名空间(namespace)。 我们创建命名空间fate-9998用来部署9998,fate-9999用来部署9999,fate-10000部署10000
kubectl create namespace fate-9998
kubectl create namespace fate-9999
kubectl create namespace fate-10000
在exmaple目录下,预先设置了3个例子:/kubefate/examples/party-9998/和/kubefate/examples/party-9999/ 和 /kubefate/examples/party-10000 对于/kubefate/examples/party-9999/cluster.yaml,我们可以将其修改如下:
party-9998:
name: fate-9998
namespace: fate-9998
chartName: fate
chartVersion: v1.8.0
partyId: 9998
registry: "hub.c.163.com/federatedai" # 换成国内镜像库
imageTag: 1.8.0-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
backend: eggroll
ingress:
fateboard:
hosts:
- name: party9998.fateboard.example.com
client:
hosts:
- name: party9998.notebook.example.com
rollsite:
type: NodePort
nodePort: 30081
partyList:
- partyId: 10000
partyIp: 10.6.17.104
partyPort: 30101
- partyId: 9999
partyIp: 10.6.17.106
partyPort: 30091
python:
type: NodePort
httpNodePort: 30087
grpcNodePort: 30082
logLevel: INFO
servingIp: 10.6.14.13
servingPort: 30085
party-9999:
name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.8.0
partyId: 9999
registry: "hub.c.163.com/federatedai"
imageTag: 1.8.0-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
backend: eggroll
ingress:
fateboard:
hosts:
- name: party9999.fateboard.example.com
client:
hosts:
- name: party9999.notebook.example.com
rollsite:
type: NodePort
nodePort: 30091
partyList:
- partyId: 10000
partyIp: 10.6.17.104
partyPort: 30101
- partyId: 9998
partyIp: 10.6.14.13
partyPort: 30081
python:
type: NodePort
httpNodePort: 30097
grpcNodePort: 30092
logLevel: INFO
servingIp: 10.6.17.106
servingPort: 30095
party-10000:
name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.8.0
partyId: 10000
registry: "hub.c.163.com/federatedai"
imageTag: 1.8.0-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
backend: eggroll
ingress:
fateboard:
hosts:
- name: party10000.fateboard.example.com
client:
hosts:
- name: party10000.notebook.example.com
rollsite:
type: NodePort
nodePort: 30101
partyList:
- partyId: 9999
partyIp: 10.6.17.106
partyPort: 30091
- partyId: 9998
partyIp: 10.6.14.13
partyPort: 30081
python:
type: NodePort
httpNodePort: 30107
grpcNodePort: 30102
logLevel: INFO
servingIp: 10.6.17.104
servingPort: 30105
安装FATE集群
如果一切没有问题,那就可以使用kubefate cluster install来部署两个fate集群了,(没遇到坑的步骤按照官方的执行就可以)
kubefate cluster install -f ./examples/party-10000/cluster10000.yaml
kubefate cluster install -f ./examples/party-9999/cluster9999.yaml
kubefate cluster install -f ./examples/party-9998/cluster9998.yaml
这时候,KubeFATE会创建3个任务去分别部署两个FATE集群。我们可以通过kubefate job ls来查看任务,或者直接watch KubeFATE中集群的状态,直至变成Running
[root@harbor kubefate]# watch kubefate cluster ls
UUID NAME NAMESPACE REVISION STATUS CHART ChartVERSION AGE
7bca70c1-236c-4931-81f8-1350cce579d4 fate-9998 fate-9998 1 Running fate v1.8.0 18m
143378db-b84d-4045-8615-11d36335d5b2 fate-9999 fate-9999 0 Creating fate v1.8.0 17m
d3e27a39-c8de-4615-96f2-29012f3edc68 fate-10000 fate-10000 0 Creating fate v1.8.0 17m
因为这个步骤需要到网易云镜像仓库去下载约10G的镜像,所以第一次执行视乎你的网络情况需要一定时间(耐心等待上述下载过程,直至状态变成Running)。 检查下载的进度可以用
kubectl get po -n fate-9998
kubectl get po -n fate-9999
kubectl get po -n fate-10000
全部的镜像下载完成后,结果会呈现如下样子:
[root@harbor kubefate]# kubectl get po -n fate-9998
NAME READY STATUS RESTARTS AGE
client-7ccbc89559-rfr2l 1/1 Running 0 20m
clustermanager-fcb86747f-z9vq9 1/1 Running 0 20m
mysql-6d546bd578-r5fl2 1/1 Running 0 20m
nodemanager-0-66dfd58cdc-6z7mc 2/2 Running 0 20m
nodemanager-1-7b7c65c685-fz9bb 2/2 Running 0 20m
python-594cd5c47b-5l88p 2/2 Running 0 20m
rollsite-6b77d9f5f7-ll9sv 1/1 Running 0 20m
验证FATE的部署
通过以上的 kubefate cluster ls 命令, 我们得到 fate-9998 的集群ID是 7bca70c1-236c-4931-81f8-1350cce579d4, fate-9999 的集群ID是 143378db-b84d-4045-8615-11d36335d5b2, 而 fate-10000 的集群ID是 d3e27a39-c8de-4615-96f2-29012f3edc68. 我们可以通过kubefate cluster describe查询集群的具体访问信息:
[root@harbor kubefate]# kubefate cluster describe 7bca70c1-236c-4931-81f8-1350cce579d4
UUID 7bca70c1-236c-4931-81f8-1350cce579d4
Name fate-9998
NameSpace fate-9998
ChartName fate
ChartVersion v1.8.0
Revision 1
Age 27m
Status Running
Spec backend: eggroll
chartName: fate
chartVersion: v1.8.0
imagePullSecrets:
- name: myregistrykey
imageTag: 1.8.0-release
ingress:
client:
hosts:
- name: party9998.notebook.example.com
fateboard:
hosts:
- name: party9998.fateboard.example.com
ingressClassName: nginx
istio:
enabled: false
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
name: fate-9998
namespace: fate-9998
partyId: 9998
persistence: false
podSecurityPolicy:
enabled: false
pullPolicy: null
python:
grpcNodePort: 30082
httpNodePort: 30087
logLevel: INFO
type: NodePort
registry: hub.c.163.com/federatedai
rollsite:
nodePort: 30081
partyList:
- partyId: 10000
partyIp: 10.6.17.104
partyPort: 30101
- partyId: 9999
partyIp: 10.6.17.106
partyPort: 30091
type: NodePort
servingIp: 10.6.14.13
servingPort: 30085
Info dashboard:
- party9998.notebook.example.com
- party9998.fateboard.example.com
ip: 10.6.17.106
port: 30081
status:
containers:
client: Running
clustermanager: Running
fateboard: Running
mysql: Running
nodemanager-0: Running
nodemanager-0-eggrollpair: Running
nodemanager-1: Running
nodemanager-1-eggrollpair: Running
python: Running
rollsite: Running
deployments:
client: Available
clustermanager: Available
mysql: Available
nodemanager-0: Available
nodemanager-1: Available
python: Available
rollsite: Available
从返回的内容中,我们看到Info->dashboard里包含了:
- Jupyter Notebook的访问地址: party9998.notebook.example.com。这个是我们准备让数据科学家进行建模分析的平台。已经集成了FATE-Clients;
- FATEBoard的访问地址: party9998.fateboard.example.com。我们可以通过FATEBoard来查询当前训练的状态。
同样的查看 fate-10000的信息,可以看到 dashboard的网址虽然不同,但是ip都是10.6.17.106,也就是ingress-ngnix的地址,所以即使是访问party10000.fateboard.example.com,也是先访问10.6.17.106,而不是fate-10000所在的主机10.6.17.104。
在浏览器访问FATE集群的机器上配置相关的Host信息
如果是Windows机器,我们需要把相关域名解析配置到C:\WINDOWS\system32\drivers\etc\hosts:
10.6.17.106 party9998.notebook.example.com
10.6.17.106 party9998.fateboard.example.com
10.6.17.106 party9999.notebook.example.com
10.6.17.106 party9999.fateboard.example.com
10.6.17.106 party10000.notebook.example.com
10.6.17.106 party10000.fateboard.example.com
注意以上网址都是设置IP为10.6.17.106
用网址party10000.fateboard.example.com:32415,登陆party10000的fateboard,用户名和密码如下图:
问题:
1、过了1天,发现命名空间fate-9998和fate-10000对应的fateboard界面访问不了了,只有fate-9999的可以访问,经检查:
root@harbor kubefate]# kubectl get pods -n fate-9998
NAME READY STATUS RESTARTS AGE
client-7ccbc89559-njr2m 1/1 Running 0 3d21h
clustermanager-fcb86747f-8zzh7 1/1 Running 0 3d21h
mysql-6d546bd578-9mfvn 1/1 Running 37 (117m ago) 3d21h
nodemanager-0-66dfd58cdc-76wqc 2/2 Running 0 3d21h
nodemanager-1-7b7c65c685-jb2gs 2/2 Running 0 3d21h
python-594cd5c47b-vl4mb 1/2 CrashLoopBackOff 473 (117s ago) 3d21h
rollsite-6b77d9f5f7-lk6dm 1/1 Running 0 3d21h
查看到python这个podCrashLoopBackOff,其内部由两容器fateboard和ping-mysql,查看其ping-mysql容器:
root@harbor kubefate]# kubectl logs -f python-594cd5c47b-vl4mb -n fate-9998 -c ping-mysql
得知mysql有问题,于是直接重新部署fate-9998的mysql:
kubectl rollout restart deployment mysql -n fate-9998
再重新部署fate-9998的python:
kubectl rollout restart deployment python -n fate-9998
问题解决。
参考:
https://blog.csdn.net/qq_32202885/article/details/125998028
https://blog.csdn.net/haveanybody/article/details/108253667
更多推荐
所有评论(0)