kubeflow v1.8 离线安装
Kubeflow 项目致力于使 Kubernetes 上的机器学习 (ML) 工作流程的部署变得简单、可移植且可扩展。Kubeflow包含了一系列机器学习相关组件,并帮助在k8s上开发、部署机器学习应用。下图显示了主要的 Kubeflow 组件,涵盖了 Kubernetes 上的 ML 生命周期的每个步骤。
一、简介
Kubeflow 项目致力于使 Kubernetes 上的机器学习 (ML) 工作流程的部署变得简单、可移植且可扩展。
Kubeflow包含了一系列机器学习相关组件,并帮助在k8s上开发、部署机器学习应用。
下图显示了主要的 Kubeflow 组件,涵盖了 Kubernetes 上的 ML 生命周期的每个步骤
二、核心内容
操作界面- Central Dashboard
Kubeflow Central Dashboard为 Kubeflow 和生态系统组件提供经过身份验证的 Web 界面。它通过公开集群中运行的组件的 UI 来充当机器学习平台和工具的中心。
- Home: Home, the central hub to access recent resources, active experiments, and useful documentation.
- Notebook Servers: To manage Notebook servers.
- TensorBoards: To manage TensorBoard servers.
- Models: To manage deployed KFServing models.
- Volumes: To manage the cluster’s Volumes.
- Experiments (AutoML): To manage Katib experiments.
- Experiments (KFP): To manage Kubeflow Pipelines (KFP) experiments.
- Pipelines: To manage KFP pipelines.
- Runs: To manage KFP runs.
- Recurring Runs: To manage KFP recurring runs.
- Artifacts: To track ML Metadata (MLMD) artifacts.
- Executions: To track various component executions in MLMD.
- Manage Contributors: To configure user access sharing across namespaces in the Kubeflow deployment.
下图显示了 Kubeflow 作为在 Kubernetes 之上安排 ML 系统组件的平台:
模型开发 - Notebook
管道编排 - KFP
Kubeflow Pipelines (KFP) 是一个使用 Docker 容器构建和部署可移植且可扩展的机器学习 (ML) 工作流程的平台。
AutoML - Katib
Katib 是一个与机器学习 (ML) 框架无关的项目。它可以调整以用户选择的任何语言编写的应用程序的超参数,并且本机支持许多机器学习框架,例如 TensorFlow、MXNet、PyTorch、XGBoost 等
Katib 的主要功能以及支持执行各种 AutoML 算法的优化框架
模型训练 - Training Operator
Kubeflow Training Operator 是一个 Kubernetes 原生项目,用于对使用各种 ML 框架(例如 PyTorch、TensorFlow、XGBoost 等)创建的机器学习 (ML) 模型进行微调和可扩展的分布式训练
该图显示了 Training Operator 的主要功能和支持的 ML 框架。
Pipeline 可视化编辑器 - Elyra (beta状态)
推理服务 - KServe/Seldon Core/NVIDIA Triton/BentoML
支持多种推理服务技术框架:KServe(官方推荐)、Seldon Core(官方推荐)、NVIDIA Triton、BentoML。
KServe 支持在 Kubernetes 上进行无服务器推理,并为 TensorFlow、XGBoost、scikit-learn、PyTorch 和 ONNX 等常见机器学习 (ML) 框架提供高性能、高抽象接口,以解决生产模型服务用例。
特征存储 - Feast (alpha状态)
在训练和推理过程中,使用Feast定义、管理、发现、验证和为模型提供特征。
三、编译安装
服务器规划
环境
服务器
组件 | 描述 |
node2 | 16U16G、系统盘100G、数据盘4T X7 |
node2 | |
node3 |
基础依赖
Kubernetes v1.26.3安装 docker运行时(离线)-CSDN博客
Kubeflow 1.8
组件 | 描述 | 版本 | |
docker | v20.10.5 | ||
docker-compose | Compose 是用于定义和运行多容器 Docker 应用程序的工具。通过 Compose,您可以使用 YML 文件来配置应用程序需要的所有服务。然后,使用一个命令,就可以从 YML 文件配置中创建并启动所有服务 | v2.6.1 | |
cri-dockerd | |||
Kubernetes | v1.26.3 | https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.26.md | |
kubectl | Kubernetes 命令行工具 | v1.26 | |
Kustomize | Kustomize 引入了一种无需模板的方式来自定义应用程序配置,从而简化了现成应用程序的使用。现在,内置到kubectlas中apply | Release kustomize/v5.0.3 · kubernetes-sigs/kustomize · GitHub |
Kustomize 引入了一种无需模板的方式来自定义应用程序配置,从而简化了现成应用程序的使用。现在,内置到kubectl
as中apply -
kubectl
组件依赖版本
Kubeflow 1.8 依赖的组件如下:
Component | Local Manifests Path | Upstream Revision |
Training Operator | apps/training-operator/upstream | |
Notebook Controller | apps/jupyter/notebook-controller/upstream | |
PVC Viewer Controller | apps/pvcviewer-roller/upstream | |
Tensorboard Controller | apps/tensorboard/tensorboard-controller/upstream | |
Central Dashboard | apps/centraldashboard/upstream | |
Profiles + KFAM | apps/profiles/upstream | |
PodDefaults Webhook | apps/admission-webhook/upstream | |
Jupyter Web App | apps/jupyter/jupyter-web-app/upstream | |
Tensorboards Web App | apps/tensorboard/tensorboards-web-app/upstream | |
Volumes Web App | apps/volumes-web-app/upstream | |
Katib | apps/katib/upstream | |
KServe | contrib/kserve/kserve | |
KServe Models Web App | contrib/kserve/models-web-app | |
Kubeflow Pipelines | apps/pipeline/upstream | |
Kubeflow Tekton Pipelines | apps/kfp-tekton/upstream |
Component | Local Manifests Path | Upstream Revision |
Istio | common/istio-1-17 | |
Knative | common/knative/knative-serving common/knative/knative-eventing | |
Cert Manager | common/cert-manager |
服务安装
服务获取
#获取安装包
wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.8.0.tar.gz
#解压
tar -zxvf manifests-1.8.0.tar.gz
#获取需要的镜像,推送到内网harbor
cd manifests-1.8.0
kustomize build example |grep 'image: '|awk '$2 != "" { print $2}' |sort -u
# 有了镜像列表后,进行下载,有些镜像库连接不通的话,使用代理镜像地址进行下载
# 可以参考 上面Kubernetes安装文档中的 镜像文件 部分教程,进行镜像pull、push、tag
所需要的镜像如下
docker.io/istio/pilot:1.17.5
docker.io/istio/proxyv2:1.17.5
docker.io/kubeflowkatib/earlystopping-medianstop:v0.16.0
docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v0.16.0
docker.io/kubeflowkatib/file-metrics-collector:v0.16.0
docker.io/kubeflowkatib/katib-controller:v0.16.0
docker.io/kubeflowkatib/katib-db-manager:v0.16.0
docker.io/kubeflowkatib/katib-ui:v0.16.0
docker.io/kubeflowkatib/mxnet-mnist:v0.16.0
docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
docker.io/kubeflowkatib/suggestion-darts:v0.16.0
docker.io/kubeflowkatib/suggestion-enas:v0.16.0
docker.io/kubeflowkatib/suggestion-goptuna:v0.16.0
docker.io/kubeflowkatib/suggestion-hyperband:v0.16.0
docker.io/kubeflowkatib/suggestion-hyperopt:v0.16.0
docker.io/kubeflowkatib/suggestion-optuna:v0.16.0
docker.io/kubeflowkatib/suggestion-pbt:v0.16.0
docker.io/kubeflowkatib/suggestion-skopt:v0.16.0
docker.io/kubeflowkatib/tfevent-metrics-collector:v0.16.0
docker.io/kubeflowmanifestswg/oidc-authservice:e236439
docker.io/kubeflownotebookswg/centraldashboard:v1.8.0
docker.io/kubeflownotebookswg/jupyter-web-app:v1.8.0
docker.io/kubeflownotebookswg/kfam:v1.8.0
docker.io/kubeflownotebookswg/notebook-controller:v1.8.0
docker.io/kubeflownotebookswg/poddefaults-webhook:v1.8.0
docker.io/kubeflownotebookswg/profile-controller:v1.8.0
docker.io/kubeflownotebookswg/pvcviewer-controller:v1.8.0
docker.io/kubeflownotebookswg/tensorboard-controller:v1.8.0
docker.io/kubeflownotebookswg/tensorboards-web-app:v1.8.0
docker.io/kubeflownotebookswg/volumes-web-app:v1.8.0
docker.io/metacontrollerio/metacontroller:v2.0.4
docker.io/seldonio/mlserver:1.3.2
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:92967bab4ad8f7d55ce3a77ba8868f3f2ce173c010958c28b9a690964ad6ee9b
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:6d35cc98baa098fc0c5b4290859e363a8350a9dadc31d1191b0b5c9796958223
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:ebf93652f0254ac56600bedf4a7d81611b3e1e7f6526c6998da5dd24cdc67ee1
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:421aa67057240fa0c56ebf2c6e5b482a12842005805c46e067129402d1751220
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:bfa1dfea77aff6dfa7959f4822d8e61c4f7933053874cd3f27352323e6ecd985
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.3
gcr.io/ml-pipeline/cache-server:2.0.3
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.3
gcr.io/ml-pipeline/metadata-writer:2.0.3
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.3
gcr.io/ml-pipeline/scheduledworkflow:2.0.3
gcr.io/ml-pipeline/viewer-crd-controller:2.0.3
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/workflow-controller:v3.3.10-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
gcr.io/ml-pipeline/metadata-envoy:2.0.3
gcr.io/ml-pipeline/argoexec:v3.3.10-license-compliance
gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707
ghcr.io/dexidp/dex:v2.36.0
image:
kserve/kserve-controller:v0.11.1
kserve/lgbserver:v0.11.1
kserve/models-web-app:v0.10.0
kserve/paddleserver:v0.11.1
kserve/pmmlserver:v0.11.1
kserve/sklearnserver:v0.11.1
kserve/xgbserver:v0.11.1
kubeflow/training-operator:v1-855e096
mysql:8.0.29
nvcr.io/nvidia/tritonserver:23.05-py3
python:3.7
pytorch/torchserve-kfs:0.8.2
quay.io/jetstack/cert-manager-cainjector:v1.12.2
quay.io/jetstack/cert-manager-controller:v1.12.2
quay.io/jetstack/cert-manager-webhook:v1.12.2
tensorflow/serving:2.6.2
镜像版本修改: example/kustomization.yaml
部分sha256 类的镜像,需要重启打版本tag,所以在使用的时候需要修改镜像版本,添加如下内容
在代码中全局替换为harbor镜像,在所有image 前面机上harbor库的地址:
我是用开发工具进行了全局替换:
匹配yarm 类型文件
image: 替换为:image: 192.168.5.200:5000/ml/
注意里面的带引号的 选择性替换,手动排除,之后再进行全部替换,可能还是会有替换不完全的状态,
注意: 空格docker.io/ 替换为 192.168.5.200:5000/ml/docker.io/
: gcr.io/ 替换为 : 192.168.5.200:5000/ml/gcr.io/
"image": " 替换为 "image": "192.168.5.200:5000/ml/
kustomization.yaml 文件添加如下内容
images:
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:92967bab4ad8f7d55ce3a77ba8868f3f2ce173c010958c28b9a690964ad6ee9b
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/controller
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:6d35cc98baa098fc0c5b4290859e363a8350a9dadc31d1191b0b5c9796958223
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:ebf93652f0254ac56600bedf4a7d81611b3e1e7f6526c6998da5dd24cdc67ee1
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:421aa67057240fa0c56ebf2c6e5b482a12842005805c46e067129402d1751220
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:bfa1dfea77aff6dfa7959f4822d8e61c4f7933053874cd3f27352323e6ecd985
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/activator
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/controller
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/queue
newTag: "sha256"
- name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/webhook
newTag: "sha256"
安装kubeflow
#安装命令
while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
# 删除命令
while ! kustomize build example | kubectl delete -f -; do echo "Retrying to delete resources"; sleep 10; done
配置访问
https 访问:
配置证书
kubeflow-ingressgateway-certs.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: kubeflow-ingressgateway-certs
namespace: istio-system
spec:
commonName: 192.168.6.50 # Ex) kubeflow.mydomain.com
issuerRef:
kind: ClusterIssuer
name: kubeflow-self-signing-issuer
secretName: kubeflow-ingressgateway-certs
执行:kubectl apply -f kubeflow-ingressgateway-certs.yaml 配置证书
配置网关
gateway.yaml
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: kubeflow-gateway
namespace: kubeflow
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- "*"
port:
name: http
number: 80
protocol: HTTP
# Upgrade HTTP to HTTPS
# tls:
# httpsRedirect: true
- hosts:
- "*"
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kubeflow-ingressgateway-certs
执行:kubectl apply -f gateway.yaml 开启https端口
http访问:
1、kubectl port-forward 临时访问
2、istio-ingressgateway NodePort ,配置网关
#查看
kubectl get svc/istio-ingressgateway -n istio-system
#1、 --address 0.0.0.0 代表可以外部host访问,不加的话只能本地访问
# port-forward 将本地的8080端口转发到pod svc/istio-ingressgateway 的80端口
#临时运行
kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8090:80
#后台运行
nohup kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8090:80 &
#2、修改类型
#命令直接修改
kubectl patch service istio-ingressgateway -n istio-system -p '{"spec":{"type":"NodePort"}}'
kubectl patch svc istio-ingressgateway -n istio-system -p '{"spec":{"type":"LoadBalancer"}}'
# 或打开配置修改
kubectl edit svc istio-ingressgateway -n istio-system
kubectl -n istio-system edit gateway istio-ingressgateway
http://ml01:8090/ 默认用户名:user@example.com 密码: 12341234
问题处理
尽管上边做了很多操作,部署后还是会提示很多镜像找不到到,
ImagePullBackOff 错误,
找到镜像重启拉取到内网harbor中,如果有些是latest版本,修改成和harbor中一致的tag
Mysql 8 组件启动错误
-initialize specified but the data directory has files in it. Aborting.
解决办法:直接删除mysql 挂载的mysql-pv-claim 下nfs对应的目录
有些组件安装
失败可以尝试 删除kubectl delete -n kubeflow pod ,之后会自动重启创建
安装notebook错误
错误:[403] Could not find CSRF cookie XSRF-TOKEN in the request.
原因:分析通过访问 Kubeflow UI是通过 http(而不是 https),因此需要在 Jupyter Web 应用服务中将环境变量设置APP_SECURE_COOKIES=false
解决:
执行命令 kubectl edit deployments.apps -n kubeflow jupyter-web-app-deployment
修改APP_SECURE_COOKIES 为false
其他模块访问如果出现这个问题,操作都是一样的,找到对应的配置文件,修改APP_SECURE_COOKIES=false,建议在生产环境使用https 访问
安装tensorboard-controller-deployment错误
错误:Error: failed to start container "kube-rbac-proxy": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: chdir to cwd ("/home/nonroot") set in config.json failed: permission denied: unknown
解决:修改 Deployment配置,修改为如下配置
securityContext:
runAsNonRoot: True
runAsUser: 65532
runAsGroup: 65534
Failed to pull image "gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707
原因:
kfp-driver 镜像名称被硬编码到源码中,在gcr.io/ml-pipeline/api-server:2.0.5版本中进行了修复,可以通过参数指定镜像
解决办法:
kubectl -n kubeflow edit deployment.apps/ml-pipeline
#在env下添加如下参数配置
- name: V2_DRIVER_IMAGE
value: 192.168.5.200:5000/ml/gcr.io/ml-pipeline/kfp-driver:sha256
- name: V2_LAUNCHER_IMAGE
value: 192.168.5.200:5000/ml/gcr.io/ml-pipeline/kfp-launcher:sha256
pip
在使用pipeline的过程中如果没有内网可以使用devpi 构建代理库
#临时使用
pip install -i http://192.168.4.115:8000/root/pypi/+simple/ -- trusted-host 192.168.4.115 kserve
#全局使用
pip config set global.index-url http://192.168.4.115:8000/root/pypi/+simple/
四、使用
1、
更多推荐
所有评论(0)