参考文档:

Install Kubeflow v1.3

注: 要在本地安装,您只需安装 MicroK8s 并启用 Kubeflow 插件。

本指南列出了在任何符合标准的 Kubernetes(包括 AKS、EKS、GKE、Openshift 和任何 kubeadm 部署的集群)上安装 Kubeflow 所需的步骤,前提是您可以通过 kubectl 访问它。

在这里插入图片描述

1 安装juju客户端

在 Linux 上,使用以下命令通过 snap 安装 juju:

snap install juju --classic

或者,在 macOS 上 brew install juju 或下载 Windows 安装程序。

2 将 Juju 连接到您的 Kubernetes 集群

为了使用 Juju 操作 Kubernetes 集群中的工作负载,您必须通过 add-k8s 命令将集群添加到 juju 的云列表中。

如果您的 Kubernetes 配置文件位于标准位置(Linux 上的 ~/.kube/config),并且您只有一个集群,则只需运行:

juju add-k8s myk8s

注,要按照前文的安装portainer的方法,获取配置文件和安装openebs。

如果您的 kubectl 配置文件包含多个集群,您可以按名称指定合适的集群:

juju add-k8s myk8s --cluster-name=foo

最后,要使用不同的配置文件,您可以将 KUBECONFIG 环境变量设置为指向相关文件。例如:

KUBECONFIG=path/to/file juju add-k8s myk8s

有关更多详细信息,请参阅 Juju 文档

3 创建控制器

为了在 Kubernetes 集群上运行工作负载,Juju 使用控制器。您可以使用 bootstrap 命令创建控制器:

juju bootstrap myk8s my-controller

此命令将在 my-controller 命名空间下创建几个 pod。您可以使用juju controllers命令查看您的控制器。

您可以在 Juju 文档中阅读有关控制器的更多信息。

4 创建模型

Juju 中的模型是一个空白画布,您的操作员将在其中部署,它与 Kubernetes 命名空间保持 1:1 的关系。

您可以创建一个模型并为其命名,例如kubeflow,使用 add-model 命令,您还将创建一个同名的 Kubernetes 命名空间:

juju add-model kubeflow

您可以使用 juju models 命令列出您的模型。

5 部署 Kubeflow

要求:
部署 kubeflow 所需的最低资源是:50Gb 磁盘空间、14Gb RAM 和 2 个可用于 Linux 机器或 VM 的 CPU。
如果您的资源较少,请部署 kubeflow-lite 或 kubeflow-edge.

拥有模型后,您可以简单地 juju 将任何提供的 Kubeflow 包部署到您的集群中,并在前面加上 cs。
例如,对于 Kubeflow lite 包,运行:

juju deploy cs:kubeflow-lite

恭喜,Kubeflow 正在安装!

您可以使用以下命令观察您的 Kubeflow 部署:

watch -c juju status --color

6 在身份验证方法中设置 URL

启用 Kubeflow 仪表板访问权限的最后一步是通过以下命令将仪表板公共 URL 提供给 dex-auth 和 oidc-gatekeeper:

juju config dex-auth public-url=http://<URL>
juju config oidc-gatekeeper public-url=http://<URL>

其中 是 Kubeflow 仪表板响应的主机名。例如,在典型的 MicroK8s 安装中,此 URL 是 http://10.64.140.43.nip.io。请注意,当您设置 DNS 时,您应该使用 istio-ingressgateway 使用的可解析地址。

7 添加 RBAC 角色

目前,为了在启用 RBAC 时正确设置 Kubeflow 和 Istio,您需要提供 istio-ingressgateway 操作员对 Kubernetes 资源的访问权限。以下命令将创建适当的角色:

kubectl patch role -n kubeflow istio-ingressgateway-operator -p '{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"name":"istio-ingressgateway-operator"},"rules":[{"apiGroups":["*"],"resources":["*"],"verbs":["*"]}]}'

有问题?
如果您在遵循这些说明时遇到任何困难,请在[此处](https://github.com/juju-solutions/bundle-kubeflow/issues)创建问题。


以下是实际过程:

1 由于已经有了juju客户端,第一步略过。

2 添加模型

由于 Kubernetes 配置文件位于标准位置(Linux 上的 ~/.kube/config),并且只有一个集群,所以使用以下命令添加模型:

juju add-k8s myk8s

3 创建控制器:

juju bootstrap myk8s my-controller --debug

不幸的是,出现了以下错误:

 ERROR juju.cmd.juju.commands bootstrap.go:883 failed to bootstrap model: creating controller stack: creating statefulset for controller: timed out waiting for controller pod: pending:  -
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:884 (error details: [{/build/snapcraft-juju-35d6cf/parts/juju/src/cmd/juju/commands/bootstrap.go:983: failed to bootstrap model} {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:667: } {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:298: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/k8s.go:493: creating controller stack} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:502: creating statefulset for controller} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:917: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1051: timed out waiting for controller pod} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1008: pending:  - }])
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:1634 cleaning up after failed bootstrap

怀疑是由于使用了标准的Charmed Kubernetes #679部署kubernetes,其中的kubernetes模块配置是cores=4 mem=4G root-disk=16G,怀疑硬盘配置过少,出错。故重新部署了三个100G硬盘的worker节点
处理办法:
1 先在maas上部署的三个虚机”cores=4 mem=4G root-disk=100G“ #建议更改为内存8G,因为前文的portainer对集群的内存比较高。
2 根据前文ubuntu20.04下使用juju+maas环境部署k8s-9-缩放节点
2.1停止硬盘16G的节点kubernetes-worker/0

 juju run-action kubernetes-worker/0 pause --wait

2.2删除此节点。

juju remove-unit  kubernetes-worker/0

2.3 增加100G单元

juju add-unit kubernetes-worker

2.4 重复上述步骤两次。

3 再次重新部署控制器,

juju bootstrap myk8s my-controller --debug

成功。

4 创建模型

juju add-model kubeflow

5 部署 Kubeflow

juju deploy kubeflow --debug

安装完毕后,运行一段时间,查看状态,会发现出现很多错误,不用担心,是由于国际线路太忙,映像下载不下来造成的,国际线路闲时一般是凌晨2-7点,建议使用at命令定时安装,可以比较顺畅的安装,不用多次执行juju deploy kubeflow --debug

at 3:00 #定时在3点
>juju deploy kubeflow --debug
ctrl+d #保存

at  -c <job号> 查看

下午查看大概类似这个状态:

juju status
Model     Controller     Cloud/Region  Version  SLA          Timestamp
kubeflow  my-controller  myk8s         2.9.14   unsupported  14:00:40+08:00

App                        Version                    Status   Scale  Charm                 Store       Channel  Rev  OS          Address         Message
admission-webhook          res:oci-image@1abb127      active       1  admission-webhook     charmstore  stable    10  kubernetes  10.152.183.253
argo-controller            res:oci-image@c1746ae      waiting      1  argo-controller       charmstore  stable    51  kubernetes
dex-auth                   res:oci-image@af9c1b3      active       1  dex-auth              charmstore  stable    60  kubernetes  10.152.183.133
istio-ingressgateway                                  waiting      1  istio-ingressgateway  charmstore  stable    20  kubernetes                  Waiting for istio-pilot relation data
istio-pilot                res:oci-image@e3e03b3      waiting      1  istio-pilot           charmstore  stable    20  kubernetes  10.152.183.223
jupyter-controller         res:oci-image@8c7be42      active       1  jupyter-controller    charmstore  stable    56  kubernetes
jupyter-ui                 res:oci-image@af3b8ce      active       1  jupyter-ui            charmstore  stable    10  kubernetes  10.152.183.134
kfp-api                    res:oci-image@8e60840      waiting      1  kfp-api               charmstore  stable    12  kubernetes  10.152.183.121
kfp-db                     mariadb/server:10.3        active       1  mariadb-k8s           charmstore  stable    35  kubernetes  10.152.183.137
kfp-persistence            res:oci-image@9338d08      waiting      1  kfp-persistence       charmstore  stable     9  kubernetes
kfp-schedwf                res:oci-image@4ab6488      waiting      1  kfp-schedwf           charmstore  stable     9  kubernetes
kfp-ui                     res:oci-image@04a4348      waiting      1  kfp-ui                charmstore  stable    12  kubernetes  10.152.183.153
kfp-viewer                 res:oci-image@bae62bf      active       1  kfp-viewer            charmstore  stable     9  kubernetes
kfp-viz                    res:oci-image@c90a581      waiting      1  kfp-viz               charmstore  stable     8  kubernetes  10.152.183.233
kubeflow-dashboard         res:oci-image@126c9a9      waiting      1  kubeflow-dashboard    charmstore  stable    56  kubernetes  10.152.183.32
kubeflow-profiles          res:profile-image@582b8eb  active       1  kubeflow-profiles     charmstore  stable    52  kubernetes  10.152.183.182
kubeflow-volumes           res:oci-image@a325e90      active       1  kubeflow-volumes      charmstore  stable     0  kubernetes  10.152.183.164
minio                      res:oci-image@4707912      waiting      1  minio                 charmstore  stable    55  kubernetes  10.152.183.215
mlmd                       res:oci-image@78eb66d      active       1  mlmd                  charmstore  stable     5  kubernetes  10.152.183.46
oidc-gatekeeper            res:oci-image@9bb01f7      active       1  oidc-gatekeeper       charmstore  stable    54  kubernetes  10.152.183.183
pytorch-operator           res:oci-image@08c3373      waiting      1  pytorch-operator      charmstore  stable    53  kubernetes
seldon-controller-manager  res:oci-image@82fd029      active       1  seldon-core           charmstore  stable    50  kubernetes  10.152.183.113
tfjob-operator             res:oci-image@3fabaf3      active       1  tfjob-operator        charmstore  stable     1  kubernetes

Unit                          Workload  Agent  Address     Ports                                   Message
admission-webhook/0*          active    idle   10.1.44.32  443/TCP
argo-controller/0*            error     idle   10.1.20.90                                          OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/argo-charmers/argo-controller/oci-image@sha256:c1746aec607fac57e7e5006329b58c7a566f042c5bf0cf3cbae192adc5b06bb5": failed commit on ref "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64": "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64" failed size validation: 3920502 != 24609777: failed precondition
dex-auth/0*                   active    idle   10.1.20.39  5556/TCP
istio-ingressgateway/0*       waiting   idle                                                       Waiting for istio-pilot relation data
istio-pilot/0*                error     idle   10.1.20.62  8080/TCP,15010/TCP,15012/TCP,15017/TCP  OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/istio-charmers/istio-pilot/oci-image@sha256:e3e03b31cebfc4c73d4788b83af3339685673970a5c3bf3167db399d39696ed8": failed commit on ref "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e": "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e" failed size validation: 3806023 != 29905420: failed precondition
jupyter-controller/0*         active    idle   10.1.20.48
jupyter-ui/0*                 active    idle   10.1.20.50  5000/TCP
kfp-api/0*                    error     idle   10.1.20.93  8888/TCP,8887/TCP                       OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-api/oci-image@sha256:8e608409f50a332e787923dda2ea4eb5c9f0839a4c9ff3f77d535efa03eac9e9": failed commit on ref "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446": "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446" failed size validation: 5028131 != 45380216: failed precondition
kfp-db/0*                     active    idle   10.1.20.63  3306/TCP                                ready
kfp-persistence/0*            error     idle   10.1.20.91                                          crash loop backoff: back-off 5m0s restarting failed container=ml-pipeline-persistenceagent pod=kfp-persistence-864dc895d5-xshwz_kubeflow(384f9b66-dcc5-45c4-9f23-83592dfbc228)
kfp-schedwf/0*                error     idle   10.1.20.57                                          OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-schedwf/oci-image@sha256:4ab648890dad76ea51fdfb432d95992136127340b832e3d345207b839c6db23e": failed commit on ref "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c": "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c" failed size validation: 3804543 != 21611777: failed precondition
kfp-ui/0*                     error     idle   10.1.20.92  3000/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-ui/oci-image@sha256:04a4348d6b2ec8142cc0a1dd45f738b719fef7cca5c2585ec5b935d43eab1aa8": failed commit on ref "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f": "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f" failed size validation: 4635725 != 28057227: failed precondition
kfp-viewer/0*                 active    idle   10.1.20.65
kfp-viz/0*                    error     idle   10.1.20.68  8888/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-viz/oci-image@sha256:c90a5818043da47448c4230953b265a66877bd143e4bdd991f762cf47e2a16d6": failed commit on ref "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf": "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf" failed size validation: 3811546 != 3978030: failed precondition
kubeflow-dashboard/0*         error     idle   10.1.20.95  8082/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kubeflow-dashboard/oci-image@sha256:126c9a9f0b56c9eaa614cc24f1989f9aa2d47e9cfdce70373f5ce0937a7820e2": failed commit on ref "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231": "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231" failed size validation: 4150633 != 29259154: failed precondition
kubeflow-profiles/0*          active    idle   10.1.20.71  8080/TCP,8081/TCP
kubeflow-volumes/0*           active    idle   10.1.20.72  5000/TCP
minio/0*                      error     idle   10.1.20.77  9000/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/minio-charmers/minio/oci-image@sha256:4707912566436c2c1faeedb8c085a8d40b99cdf4bb0e2414295a8936e573866e": failed commit on ref "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561": "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561" failed size validation: 3872649 != 28593534: failed precondition
mlmd/0*                       active    idle   10.1.20.84  8080/TCP
oidc-gatekeeper/0*            active    idle   10.1.20.94  8080/TCP
pytorch-operator/0*           error     idle   10.1.20.87  8443/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/pytorch-charmers/pytorch-operator/oci-image@sha256:08c3373247c853e804d74041366a3b161334d25b953e233776884ffab9012fc4": failed commit on ref "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6": "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6" failed size validation: 3775605 != 17524427: failed precondition
seldon-controller-manager/0*  active    idle   10.1.20.88  8080/TCP,4443/TCP
tfjob-operator/0*             active    idle   10.1.20.89  8443/TCP

早上8点之后再次查询类似这个结果:

juju status
Model     Controller     Cloud/Region  Version  SLA          Timestamp
kubeflow  my-controller  myk8s         2.9.14   unsupported  09:48:06+08:00

App                        Version                    Status   Scale  Charm                 Store       Channel  Rev  OS          Address         Message
admission-webhook          res:oci-image@1abb127      active       1  admission-webhook     charmstore  stable    10  kubernetes  10.152.183.157
argo-controller            res:oci-image@c1746ae      active       1  argo-controller       charmstore  stable    51  kubernetes
dex-auth                   res:oci-image@af9c1b3      active       1  dex-auth              charmstore  stable    60  kubernetes  10.152.183.6
istio-ingressgateway                                  waiting      1  istio-ingressgateway  charmstore  stable    20  kubernetes                  Waiting for Istio Pilot information
istio-pilot                res:oci-image@e3e03b3      active       1  istio-pilot           charmstore  stable    20  kubernetes  10.152.183.223
jupyter-controller         res:oci-image@8c7be42      active       1  jupyter-controller    charmstore  stable    56  kubernetes
jupyter-ui                 res:oci-image@af3b8ce      active       1  jupyter-ui            charmstore  stable    10  kubernetes  10.152.183.214
kfp-api                    res:oci-image@8e60840      active       1  kfp-api               charmstore  stable    12  kubernetes  10.152.183.174
kfp-db                     mariadb/server:10.3        active       1  mariadb-k8s           charmstore  stable    35  kubernetes  10.152.183.129
kfp-persistence            res:oci-image@9338d08      active       1  kfp-persistence       charmstore  stable     9  kubernetes
kfp-schedwf                res:oci-image@4ab6488      active       1  kfp-schedwf           charmstore  stable     9  kubernetes
kfp-ui                     res:oci-image@04a4348      active       1  kfp-ui                charmstore  stable    12  kubernetes  10.152.183.30
kfp-viewer                 res:oci-image@bae62bf      active       1  kfp-viewer            charmstore  stable     9  kubernetes
kfp-viz                    res:oci-image@c90a581      active       1  kfp-viz               charmstore  stable     8  kubernetes  10.152.183.34
kubeflow-dashboard         res:oci-image@126c9a9      active       1  kubeflow-dashboard    charmstore  stable    56  kubernetes  10.152.183.59
kubeflow-profiles          res:profile-image@582b8eb  active       1  kubeflow-profiles     charmstore  stable    52  kubernetes  10.152.183.48
kubeflow-volumes           res:oci-image@a325e90      active       1  kubeflow-volumes      charmstore  stable     0  kubernetes  10.152.183.209
minio                      res:oci-image@4707912      active       1  minio                 charmstore  stable    55  kubernetes  10.152.183.247
mlmd                       res:oci-image@78eb66d      active       1  mlmd                  charmstore  stable     5  kubernetes  10.152.183.167
oidc-gatekeeper            res:oci-image@9bb01f7      active       1  oidc-gatekeeper       charmstore  stable    54  kubernetes  10.152.183.4
pytorch-operator           res:oci-image@08c3373      active       1  pytorch-operator      charmstore  stable    53  kubernetes
seldon-controller-manager  res:oci-image@82fd029      active       1  seldon-core           charmstore  stable    50  kubernetes  10.152.183.215
tfjob-operator             res:oci-image@3fabaf3      active       1  tfjob-operator        charmstore  stable     1  kubernetes

Unit                          Workload  Agent  Address     Ports                                   Message
admission-webhook/0*          active    idle   10.1.29.18  443/TCP
argo-controller/0*            active    idle   10.1.29.67
dex-auth/0*                   active    idle   10.1.29.45  5556/TCP
istio-ingressgateway/0*       waiting   idle                                                       Waiting for Istio Pilot information
istio-pilot/0*                active    idle   10.1.29.68  8080/TCP,15010/TCP,15012/TCP,15017/TCP
jupyter-controller/0*         active    idle   10.1.73.23
jupyter-ui/0*                 active    idle   10.1.29.50  5000/TCP
kfp-api/0*                    active    idle   10.1.29.71  8888/TCP,8887/TCP
kfp-db/0*                     active    idle   10.1.29.66  3306/TCP                                ready
kfp-persistence/0*            active    idle   10.1.29.70
kfp-schedwf/0*                active    idle   10.1.29.47
kfp-ui/0*                     active    idle   10.1.29.69  3000/TCP
kfp-viewer/0*                 active    idle   10.1.29.24
kfp-viz/0*                    active    idle   10.1.29.51  8888/TCP
kubeflow-dashboard/0*         active    idle   10.1.29.72  8082/TCP
kubeflow-profiles/0*          active    idle   10.1.29.56  8080/TCP,8081/TCP
kubeflow-volumes/0*           active    idle   10.1.29.44  5000/TCP
minio/0*                      active    idle   10.1.29.55  9000/TCP
mlmd/0*                       active    idle   10.1.29.41  8080/TCP
oidc-gatekeeper/0*            active    idle   10.1.29.73  8080/TCP
pytorch-operator/0*           active    idle   10.1.29.46  8443/TCP
seldon-controller-manager/0*  active    idle   10.1.29.48  8080/TCP,4443/TCP
tfjob-operator/0*             active    idle   10.1.29.49  8443/TCP
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐