一、简介

Kubeflow 项目致力于使 Kubernetes 上的机器学习 (ML) 工作流程的部署变得简单、可移植且可扩展。

Kubeflow包含了一系列机器学习相关组件,并帮助在k8s上开发、部署机器学习应用。

下图显示了主要的 Kubeflow 组件,涵盖了 Kubernetes 上的 ML 生命周期的每个步骤


二、核心内容

操作界面- Central Dashboard

Kubeflow Central Dashboard为 Kubeflow 和生态系统组件提供经过身份验证的 Web 界面。它通过公开集群中运行的组件的 UI 来充当机器学习平台和工具的中心。

  • Home: Home, the central hub to access recent resources, active experiments, and useful documentation.
  • Notebook Servers: To manage Notebook servers.
  • TensorBoards: To manage TensorBoard servers.
  • Models: To manage deployed KFServing models.
  • Volumes: To manage the cluster’s Volumes.
  • Experiments (AutoML): To manage Katib experiments.
  • Experiments (KFP): To manage Kubeflow Pipelines (KFP) experiments.
  • Pipelines: To manage KFP pipelines.
  • Runs: To manage KFP runs.
  • Recurring Runs: To manage KFP recurring runs.
  • Artifacts: To track ML Metadata (MLMD) artifacts.
  • Executions: To track various component executions in MLMD.
  • Manage Contributors: To configure user access sharing across namespaces in the Kubeflow deployment.

下图显示了 Kubeflow 作为在 Kubernetes 之上安排 ML 系统组件的平台:

模型开发 - Notebook

管道编排 - KFP

Kubeflow Pipelines (KFP) 是一个使用 Docker 容器构建和部署可移植且可扩展的机器学习 (ML) 工作流程的平台。

AutoML - Katib

Katib 是一个与机器学习 (ML) 框架无关的项目。它可以调整以用户选择的任何语言编写的应用程序的超参数,并且本机支持许多机器学习框架,例如 TensorFlow、MXNet、PyTorch、XGBoost 等

Katib 的主要功能以及支持执行各种 AutoML 算法的优化框架

模型训练 - Training Operator

Kubeflow Training Operator 是一个 Kubernetes 原生项目,用于对使用各种 ML 框架(例如 PyTorch、TensorFlow、XGBoost 等)创建的机器学习 (ML) 模型进行微调和可扩展的分布式训练

该图显示了 Training Operator 的主要功能和支持的 ML 框架。

Pipeline 可视化编辑器 - Elyra (beta状态)

推理服务 - KServe/Seldon Core/NVIDIA Triton/BentoML

支持多种推理服务技术框架:KServe(官方推荐)、Seldon Core(官方推荐)、NVIDIA Triton、BentoML。

KServe 支持在 Kubernetes 上进行无服务器推理,并为 TensorFlow、XGBoost、scikit-learn、PyTorch 和 ONNX 等常见机器学习 (ML) 框架提供高性能、高抽象接口,以解决生产模型服务用例。

特征存储 - Feast (alpha状态)

在训练和推理过程中,使用Feast定义、管理、发现、验证和为模型提供特征。

三、编译安装

服务器规划

环境

服务器

组件

描述

node2

16U16G、系统盘100G、数据盘4T X7

node2

node3

基础依赖

Kubernetes v1.26.3安装 docker运行时(离线)-CSDN博客

Kubeflow 1.8

组件

描述

版本

docker

v20.10.5

docker-compose

Compose 是用于定义和运行多容器 Docker 应用程序的工具。通过 Compose,您可以使用 YML 文件来配置应用程序需要的所有服务。然后,使用一个命令,就可以从 YML 文件配置中创建并启动所有服务

v2.6.1

cri-dockerd

Kubernetes

v1.26.3

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.26.md

kubectl

Kubernetes 命令行工具

v1.26

Kustomize

Kustomize 引入了一种无需模板的方式来自定义应用程序配置,从而简化了现成应用程序的使用。现在,内置到kubectlas中apply

Release kustomize/v5.0.3 · kubernetes-sigs/kustomize · GitHub

Kustomize 引入了一种无需模板的方式来自定义应用程序配置,从而简化了现成应用程序的使用。现在,内置到kubectlas中apply -

  • kubectl
组件依赖版本

Kubeflow 1.8 依赖的组件如下:

Component

Local Manifests Path

Upstream Revision

Training Operator

apps/training-operator/upstream

v1.7.0

Notebook Controller

apps/jupyter/notebook-controller/upstream

v1.8.0

PVC Viewer Controller

apps/pvcviewer-roller/upstream

v1.8.0

Tensorboard Controller

apps/tensorboard/tensorboard-controller/upstream

v1.8.0

Central Dashboard

apps/centraldashboard/upstream

v1.8.0

Profiles + KFAM

apps/profiles/upstream

v1.8.0

PodDefaults Webhook

apps/admission-webhook/upstream

v1.8.0

Jupyter Web App

apps/jupyter/jupyter-web-app/upstream

v1.8.0

Tensorboards Web App

apps/tensorboard/tensorboards-web-app/upstream

v1.8.0

Volumes Web App

apps/volumes-web-app/upstream

v1.8.0

Katib

apps/katib/upstream

v0.16.0

KServe

contrib/kserve/kserve

v0.11.1

KServe Models Web App

contrib/kserve/models-web-app

v0.10.0

Kubeflow Pipelines

apps/pipeline/upstream

2.0.3

Kubeflow Tekton Pipelines

apps/kfp-tekton/upstream

v2.0.3

Component

Local Manifests Path

Upstream Revision

Istio

common/istio-1-17

1.17.3

Knative

common/knative/knative-serving

common/knative/knative-eventing

1.10.2

1.10.1

Cert Manager

common/cert-manager

1.12.2

服务安装

服务获取
#获取安装包
wget  https://github.com/kubeflow/manifests/archive/refs/tags/v1.8.0.tar.gz

#解压
tar -zxvf manifests-1.8.0.tar.gz

#获取需要的镜像,推送到内网harbor
cd manifests-1.8.0
kustomize build example |grep 'image: '|awk '$2 != "" { print $2}' |sort -u 

# 有了镜像列表后,进行下载,有些镜像库连接不通的话,使用代理镜像地址进行下载
# 可以参考 上面Kubernetes安装文档中的  镜像文件  部分教程,进行镜像pull、push、tag

所需要的镜像如下

docker.io/istio/pilot:1.17.5
docker.io/istio/proxyv2:1.17.5
docker.io/kubeflowkatib/earlystopping-medianstop:v0.16.0
docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v0.16.0
docker.io/kubeflowkatib/file-metrics-collector:v0.16.0
docker.io/kubeflowkatib/katib-controller:v0.16.0
docker.io/kubeflowkatib/katib-db-manager:v0.16.0
docker.io/kubeflowkatib/katib-ui:v0.16.0
docker.io/kubeflowkatib/mxnet-mnist:v0.16.0
docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
docker.io/kubeflowkatib/suggestion-darts:v0.16.0
docker.io/kubeflowkatib/suggestion-enas:v0.16.0
docker.io/kubeflowkatib/suggestion-goptuna:v0.16.0
docker.io/kubeflowkatib/suggestion-hyperband:v0.16.0
docker.io/kubeflowkatib/suggestion-hyperopt:v0.16.0
docker.io/kubeflowkatib/suggestion-optuna:v0.16.0
docker.io/kubeflowkatib/suggestion-pbt:v0.16.0
docker.io/kubeflowkatib/suggestion-skopt:v0.16.0
docker.io/kubeflowkatib/tfevent-metrics-collector:v0.16.0
docker.io/kubeflowmanifestswg/oidc-authservice:e236439
docker.io/kubeflownotebookswg/centraldashboard:v1.8.0
docker.io/kubeflownotebookswg/jupyter-web-app:v1.8.0
docker.io/kubeflownotebookswg/kfam:v1.8.0
docker.io/kubeflownotebookswg/notebook-controller:v1.8.0
docker.io/kubeflownotebookswg/poddefaults-webhook:v1.8.0
docker.io/kubeflownotebookswg/profile-controller:v1.8.0
docker.io/kubeflownotebookswg/pvcviewer-controller:v1.8.0
docker.io/kubeflownotebookswg/tensorboard-controller:v1.8.0
docker.io/kubeflownotebookswg/tensorboards-web-app:v1.8.0
docker.io/kubeflownotebookswg/volumes-web-app:v1.8.0
docker.io/metacontrollerio/metacontroller:v2.0.4
docker.io/seldonio/mlserver:1.3.2
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:92967bab4ad8f7d55ce3a77ba8868f3f2ce173c010958c28b9a690964ad6ee9b
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:6d35cc98baa098fc0c5b4290859e363a8350a9dadc31d1191b0b5c9796958223
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:ebf93652f0254ac56600bedf4a7d81611b3e1e7f6526c6998da5dd24cdc67ee1
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:421aa67057240fa0c56ebf2c6e5b482a12842005805c46e067129402d1751220
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:bfa1dfea77aff6dfa7959f4822d8e61c4f7933053874cd3f27352323e6ecd985
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.3
gcr.io/ml-pipeline/cache-server:2.0.3
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.3
gcr.io/ml-pipeline/metadata-writer:2.0.3
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.3
gcr.io/ml-pipeline/scheduledworkflow:2.0.3
gcr.io/ml-pipeline/viewer-crd-controller:2.0.3
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/workflow-controller:v3.3.10-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
gcr.io/ml-pipeline/metadata-envoy:2.0.3
gcr.io/ml-pipeline/argoexec:v3.3.10-license-compliance
gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707
ghcr.io/dexidp/dex:v2.36.0
image:
kserve/kserve-controller:v0.11.1
kserve/lgbserver:v0.11.1
kserve/models-web-app:v0.10.0
kserve/paddleserver:v0.11.1
kserve/pmmlserver:v0.11.1
kserve/sklearnserver:v0.11.1
kserve/xgbserver:v0.11.1
kubeflow/training-operator:v1-855e096
mysql:8.0.29
nvcr.io/nvidia/tritonserver:23.05-py3
python:3.7
pytorch/torchserve-kfs:0.8.2
quay.io/jetstack/cert-manager-cainjector:v1.12.2
quay.io/jetstack/cert-manager-controller:v1.12.2
quay.io/jetstack/cert-manager-webhook:v1.12.2
tensorflow/serving:2.6.2
镜像版本修改: example/kustomization.yaml

部分sha256 类的镜像,需要重启打版本tag,所以在使用的时候需要修改镜像版本,添加如下内容

在代码中全局替换为harbor镜像,在所有image 前面机上harbor库的地址:

我是用开发工具进行了全局替换:

匹配yarm 类型文件

image: 替换为:image: 192.168.5.200:5000/ml/

注意里面的带引号的 选择性替换,手动排除,之后再进行全部替换,可能还是会有替换不完全的状态,

注意: 空格docker.io/ 替换为 192.168.5.200:5000/ml/docker.io/

: gcr.io/ 替换为 : 192.168.5.200:5000/ml/gcr.io/

"image": " 替换为 "image": "192.168.5.200:5000/ml/

kustomization.yaml 文件添加如下内容


images:
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:92967bab4ad8f7d55ce3a77ba8868f3f2ce173c010958c28b9a690964ad6ee9b
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/controller
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:6d35cc98baa098fc0c5b4290859e363a8350a9dadc31d1191b0b5c9796958223
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:ebf93652f0254ac56600bedf4a7d81611b3e1e7f6526c6998da5dd24cdc67ee1
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:421aa67057240fa0c56ebf2c6e5b482a12842005805c46e067129402d1751220
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:bfa1dfea77aff6dfa7959f4822d8e61c4f7933053874cd3f27352323e6ecd985
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/activator
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/controller
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/queue
    newTag: "sha256"
  - name: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
    newName: 192.168.5.200:5000/ml/gcr.io/knative-releases/knative.dev/serving/cmd/webhook
    newTag: "sha256"

安装kubeflow
#安装命令
while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

# 删除命令
while ! kustomize build example | kubectl delete -f -; do echo "Retrying to delete resources"; sleep 10; done

配置访问

https 访问:

配置证书

kubeflow-ingressgateway-certs.yaml

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kubeflow-ingressgateway-certs
  namespace: istio-system
spec:
  commonName: 192.168.6.50 #  Ex) kubeflow.mydomain.com
  issuerRef:
    kind: ClusterIssuer
    name: kubeflow-self-signing-issuer
  secretName: kubeflow-ingressgateway-certs

执行:kubectl apply -f kubeflow-ingressgateway-certs.yaml 配置证书

配置网关

gateway.yaml

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: kubeflow-gateway
  namespace: kubeflow
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - "*"
    port:
      name: http
      number: 80
      protocol: HTTP
    # Upgrade HTTP to HTTPS
    # tls:
    #   httpsRedirect: true
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: kubeflow-ingressgateway-certs

执行:kubectl apply -f gateway.yaml 开启https端口

http访问:

1、kubectl port-forward 临时访问

2、istio-ingressgateway NodePort ,配置网关

#查看
kubectl get svc/istio-ingressgateway -n istio-system

#1、 --address 0.0.0.0 代表可以外部host访问,不加的话只能本地访问
# port-forward 将本地的8080端口转发到pod svc/istio-ingressgateway 的80端口
#临时运行
kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8090:80

#后台运行
nohup  kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8090:80 &

#2、修改类型
#命令直接修改
kubectl patch service istio-ingressgateway -n istio-system -p '{"spec":{"type":"NodePort"}}'

kubectl patch svc istio-ingressgateway -n istio-system -p '{"spec":{"type":"LoadBalancer"}}'

# 或打开配置修改
kubectl edit svc istio-ingressgateway -n istio-system
kubectl -n istio-system edit gateway istio-ingressgateway

http://ml01:8090/ 默认用户名:user@example.com 密码: 12341234

问题处理

尽管上边做了很多操作,部署后还是会提示很多镜像找不到到,

ImagePullBackOff 错误

找到镜像重启拉取到内网harbor中,如果有些是latest版本,修改成和harbor中一致的tag

Mysql 8 组件启动错误

-initialize specified but the data directory has files in it. Aborting.

解决办法:直接删除mysql 挂载的mysql-pv-claim 下nfs对应的目录

有些组件安装

失败可以尝试 删除kubectl delete -n kubeflow pod ,之后会自动重启创建

安装notebook错误
错误:[403] Could not find CSRF cookie XSRF-TOKEN in the request.
原因:分析通过访问 Kubeflow UI是通过 http(而不是 https),因此需要在 Jupyter Web 应用服务中将环境变量设置APP_SECURE_COOKIES=false
解决:
执行命令 kubectl edit deployments.apps -n kubeflow jupyter-web-app-deployment
修改APP_SECURE_COOKIES 为false

其他模块访问如果出现这个问题,操作都是一样的,找到对应的配置文件,修改APP_SECURE_COOKIES=false,建议在生产环境使用https 访问

安装tensorboard-controller-deployment错误
错误:Error: failed to start container "kube-rbac-proxy": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: chdir to cwd ("/home/nonroot") set in config.json failed: permission denied: unknown

解决:修改 Deployment配置,修改为如下配置
 securityContext:
      runAsNonRoot: True
      runAsUser: 65532
      runAsGroup: 65534

Failed to pull image "gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707

原因:

kfp-driver 镜像名称被硬编码到源码中,在gcr.io/ml-pipeline/api-server:2.0.5版本中进行了修复,可以通过参数指定镜像

解决办法:

kubectl -n kubeflow edit deployment.apps/ml-pipeline
#在env下添加如下参数配置
- name: V2_DRIVER_IMAGE
  value: 192.168.5.200:5000/ml/gcr.io/ml-pipeline/kfp-driver:sha256
- name: V2_LAUNCHER_IMAGE
  value: 192.168.5.200:5000/ml/gcr.io/ml-pipeline/kfp-launcher:sha256

pip

在使用pipeline的过程中如果没有内网可以使用devpi 构建代理库

#临时使用
pip install -i http://192.168.4.115:8000/root/pypi/+simple/   -- trusted-host 192.168.4.115  kserve 

#全局使用
pip config set global.index-url http://192.168.4.115:8000/root/pypi/+simple/

四、使用

1、

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐