从零到一:基于 K3s 快速搭建本地化 kubeflow AI 机器学习平台
Kubeflow 是一种开源的 Kubernetes 原生框架,可用于开发、管理和运行机器学习工作负载,支持诸如 PyTorch、TensorFlow 等众多优秀的机器学习框架,本文介绍如何在 Mac 上搭建本地化的 kubeflow 机器学习平台。尽管 K3s 自身需要的资源不多,但是 kubeflow 套件组件众多,需要设置 Docker 的资源分配,避免安装过程中发生 Pod Pending
背景
Kubeflow 是一种开源的 Kubernetes 原生框架,可用于开发、管理和运行机器学习工作负载,支持诸如 PyTorch、TensorFlow 等众多优秀的机器学习框架,本文介绍如何在 Mac 上搭建本地化的 kubeflow 机器学习平台。
注意:本文以 deyloyKF 发行版作为主要安装对象,本地环境仅适用于开发测试使用,不可用于生产环境!
更多 kubeflow 发行版参考官网介绍:https://www.kubeflow.org/docs/started/installing-kubeflow/
基本环境:
OS:macos 13.1 (amd64)
DockerDesktop:v4.15.0
尽管 K3s 自身需要的资源不多,但是 kubeflow 套件组件众多,需要设置 Docker 的资源分配,避免安装过程中发生 Pod Pending.
Docker 资源建议设置:CPU 8 核,Memory 10G,磁盘 40G
安装部署步骤
1. 安装依赖的 CLI
brew install bash argocd jq k3d kubectl kustomize
2. 创建 Kubernetes 集群
为了尽可能降低资源消耗,这里使用 K3s 运行本地集群:
k3d cluster create "kubeflow" --image "rancher/k3s:v1.27.10-k3s2"
通过如下命令检查集群是否就绪:
kubectl get -A pods
正常的输出结果类似如下这样:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system local-path-provisioner-957fdf8bc-cj9l5 1/1 Running 0 2m30s
kube-system coredns-77ccd57875-xzzz4 1/1 Running 0 2m30s
kube-system metrics-server-648b5df564-gwnhq 1/1 Running 0 2m30s
kube-system helm-install-traefik-crd-49l4k 0/1 Completed 0 2m31s
kube-system helm-install-traefik-xrjtd 0/1 Completed 2 2m31s
kube-system svclb-traefik-a79cf0ef-lj4td 2/2 Running 0 89s
kube-system traefik-768bdcdcdd-mr8z8 1/1 Running 0 89s
3. 部署 ArgoCD
ArgoCD 是工作流编排工具,可以帮助我们实现 Kubeflow 的自动化部署
git clone -b main https://github.com/deployKF/deployKF.git
cd deployKF/argocd-plugin
chmod +x ./install_argocd.sh
bash ./install_argocd.sh
通过如下命令检查 ArgoCD 是否就绪:
kubectl get pod -n argocd
正常的输出结果类似如下这样:
NAME READY STATUS RESTARTS AGE
argocd-redis-69f8795dbd-7v4nn 1/1 Running 0 106s
argocd-applicationset-controller-7b9c4dfb77-7gsf2 1/1 Running 0 106s
argocd-notifications-controller-756764ddd5-jw92c 1/1 Running 0 106s
argocd-server-86f64667bc-7nt7d 1/1 Running 0 105s
argocd-application-controller-0 1/1 Running 0 105s
argocd-dex-server-9b5c6dccd-2p779 1/1 Running 0 106s
argocd-repo-server-5b55578f7c-sfzf4 2/2 Running 0 105s
4. 安装 kubeflow 套件
准备如下文件:deploykf-app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: deploykf-app-of-apps
namespace: argocd
labels:
app.kubernetes.io/name: deploykf-app-of-apps
app.kubernetes.io/part-of: deploykf
spec:
project: "default"
source:
## source git repo configuration
## - we use the 'deploykf/deploykf' repo so we can read its 'sample-values.yaml'
## file, but you may use any repo (even one with no files)
##
repoURL: "https://github.com/deployKF/deployKF.git"
targetRevision: "v0.1.4"
path: "."
## plugin configuration
##
plugin:
name: "deploykf"
parameters:
## the deployKF generator version
## - available versions: https://github.com/deployKF/deployKF/releases
##
- name: "source_version"
string: "0.1.4"
## paths to values files within the `repoURL` repository
## - the values in these files are merged, with later files taking precedence
## - we strongly recommend using 'sample-values.yaml' as the base of your values
## so you can easily upgrade to newer versions of deployKF
##
- name: "values_files"
array:
- "./sample-values.yaml"
## a string containing the contents of a values file
## - this parameter allows defining values without needing to create a file in the repo
## - these values are merged with higher precedence than those defined in `values_files`
##
- name: "values"
string: |
##
## This demonstrates how you might structure overrides for the 'sample-values.yaml' file.
## For a more comprehensive example, see the 'sample-values-overrides.yaml' in the main repo.
##
## Notes:
## - YAML maps are RECURSIVELY merged across values files
## - YAML lists are REPLACED in their entirety across values files
## - Do NOT include empty/null sections, as this will remove ALL values from that section.
## To include a section without overriding any values, set it to an empty map: `{}`
##
## --------------------------------------------------------------------------------
## argocd
## --------------------------------------------------------------------------------
argocd:
namespace: argocd
project: default
## --------------------------------------------------------------------------------
## kubernetes
## --------------------------------------------------------------------------------
kubernetes:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------------------------------------------------
## deploykf-dependencies
## --------------------------------------------------------------------------------
deploykf_dependencies:
## --------------------------------------
## cert-manager
## --------------------------------------
cert_manager:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## istio
## --------------------------------------
istio:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## kyverno
## --------------------------------------
kyverno:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------------------------------------------------
## deploykf-core
## --------------------------------------------------------------------------------
deploykf_core:
## --------------------------------------
## deploykf-auth
## --------------------------------------
deploykf_auth:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## deploykf-istio-gateway
## --------------------------------------
deploykf_istio_gateway:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## deploykf-profiles-generator
## --------------------------------------
deploykf_profiles_generator:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------------------------------------------------
## deploykf-opt
## --------------------------------------------------------------------------------
deploykf_opt:
## --------------------------------------
## deploykf-minio
## --------------------------------------
deploykf_minio:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## deploykf-mysql
## --------------------------------------
deploykf_mysql:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------------------------------------------------
## kubeflow-tools
## --------------------------------------------------------------------------------
kubeflow_tools:
## --------------------------------------
## katib
## --------------------------------------
katib:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## notebooks
## --------------------------------------
notebooks:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
## --------------------------------------
## pipelines
## --------------------------------------
pipelines:
{} # <-- REMOVE THIS, IF YOU INCLUDE VALUES UNDER THIS SECTION!
destination:
server: "https://kubernetes.default.svc"
namespace: "argocd"
执行如下命令,部署工作流:
kubectl apply -f ./deploykf-app-of-apps.yaml
通过 UI 界面查看 ArgoCD 状态:
kubectl port-forward --namespace "argocd" svc/argocd-server 8090:https
浏览器打开 https://localhost:8090/
,用户名:admin,密码可通过如下命令获取:
echo $(kubectl -n argocd get secret/argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d)
由于程序间存在依赖关系,可以通过如下脚本按序执行 Sync 操作:
git clone -b main https://github.com/deployKF/deployKF.git
cd deployKF/scripts
chmod +x ./sync_argocd_apps.sh
bash ./sync_argocd_apps.sh
该脚本是幂等的,失败后可反复执行直到部署成功,成功部署后的运行中 Pod 列表类似如下这样:
NAMESPACE NAME READY STATUS RESTARTS AGE
argocd argocd-redis-69f8795dbd-x5wtv 1/1 Running 5 (17m ago) 105m
argocd argocd-server-86f64667bc-zfm7m 1/1 Running 4 (17m ago) 73m
argocd argocd-repo-server-5b55578f7c-x26zz 2/2 Running 10 (17m ago) 91m
argocd argocd-notifications-controller-756764ddd5-2fqbr 1/1 Running 5 (17m ago) 89m
argocd argocd-dex-server-9b5c6dccd-bl86m 1/1 Running 5 (17m ago) 91m
argocd argocd-application-controller-0 1/1 Running 5 (17m ago) 91m
argocd argocd-applicationset-controller-7b9c4dfb77-hph2r 1/1 Running 5 (17m ago) 105m
cert-manager cert-manager-c688c56f-w4jts 1/1 Running 5 (17m ago) 109m
cert-manager trust-manager-78766fd9bd-zd5zf 1/1 Running 5 (17m ago) 90m
cert-manager cert-manager-webhook-d45447457-q6cf8 1/1 Running 6 (17m ago) 109m
cert-manager cert-manager-cainjector-59d694bcc7-mrcvg 1/1 Running 6 (17m ago) 109m
deploykf-auth oauth2-proxy-5fd9888b79-tpnrt 2/2 Running 11 (16m ago) 73m
deploykf-auth dex-68c8bf56b9-78d5g 2/2 Running 8 (17m ago) 73m
deploykf-dashboard profile-controller-5575767c76-vshp2 2/2 Running 8 (17m ago) 73m
deploykf-dashboard kfam-api-75b64c9645-sjfcq 2/2 Running 10 (17m ago) 98m
deploykf-dashboard central-dashboard-6b5d9574dc-fmlt4 2/2 Running 10 (17m ago) 98m
deploykf-istio-gateway deploykf-gateway-6ddf8947cc-qz55g 1/1 Running 5 (17m ago) 98m
deploykf-minio deploykf-minio-568b877668-w2wct 2/2 Running 5 (17m ago) 52m
deploykf-mysql deploykf-mysql-0 1/1 Running 5 (17m ago) 109m
istio-system istiod-7b9b6df595-jbztw 1/1 Running 5 (17m ago) 91m
kube-system svclb-deploykf-gateway-7f7cba3a-kkskn 3/3 Running 15 (17m ago) 100m
kube-system metrics-server-648b5df564-gwnhq 1/1 Running 9 (17m ago) 5h43m
kube-system local-path-provisioner-957fdf8bc-cj9l5 1/1 Running 7 (17m ago) 5h43m
kube-system coredns-77ccd57875-xzzz4 1/1 Running 7 (17m ago) 5h43m
kube-system traefik-768bdcdcdd-mr8z8 1/1 Running 7 (17m ago) 5h42m
kube-system svclb-traefik-a79cf0ef-6ksjm 2/2 Running 10 (17m ago) 100m
kubeflow katib-controller-75858c4ddf-hwvkx 1/1 Running 8 (17m ago) 95m
kubeflow ml-pipeline-ui-68b7f6586d-qtjp5 2/2 Running 15 (17m ago) 94m
kubeflow ml-pipeline-persistenceagent-68bbd65f98-tsnqn 2/2 Running 10 (17m ago) 94m
kubeflow katib-ui-d4df8bdb6-2x75p 2/2 Running 10 (17m ago) 95m
kubeflow ml-pipeline-6445d9fb77-dxgv4 2/2 Running 24 (16m ago) 94m
kubeflow admission-webhook-deployment-789dc56fbf-z7cj8 1/1 Running 5 (17m ago) 94m
kubeflow metadata-writer-6f95b9588c-fmx4s 2/2 Running 8 (17m ago) 73m
kubeflow notebook-controller-deployment-649cf9b976-vnvwd 2/2 Running 10 (17m ago) 95m
kubeflow training-operator-7cf5c66858-jf5sr 1/1 Running 3 (17m ago) 43m
kubeflow tensorboards-web-app-deployment-778466f5f6-dmrks 2/2 Running 2 (17m ago) 43m
kubeflow tensorboard-controller-deployment-644f57dd7c-zlxnw 3/3 Running 24 (17m ago) 92m
kubeflow ml-pipeline-scheduledworkflow-578475988-kwz27 2/2 Running 10 (17m ago) 94m
kubeflow volumes-web-app-deployment-588d46bb75-95g6b 2/2 Running 2 (17m ago) 42m
kubeflow ml-pipeline-viewer-crd-6857ccc85c-zl895 2/2 Running 10 (17m ago) 94m
kubeflow metadata-grpc-deployment-566d54d578-wwj9n 2/2 Running 23 (16m ago) 94m
kubeflow ml-pipeline-visualizationserver-7b45b7fd56-s4pxh 2/2 Running 15 (17m ago) 94m
kubeflow cache-server-66d7586749-prmkq 2/2 Running 10 (17m ago) 94m
kubeflow jupyter-web-app-deployment-9c8c779c-hcqvr 2/2 Running 15 (17m ago) 91m
kubeflow katib-db-manager-6998f5bdd8-lrs77 1/1 Running 5 (17m ago) 95m
kubeflow metadata-envoy-deployment-b48db5966-542nh 1/1 Running 5 (17m ago) 94m
kubeflow-argo-workflows argo-workflow-controller-79fc5c6895-2g26t 2/2 Running 10 (17m ago) 98m
kubeflow-argo-workflows argo-server-6d97fb7649-lsfdw 2/2 Running 5 (16m ago) 73m
kyverno kyverno-cleanup-controller-6cb4d5848-hh8nm 1/1 Running 5 (17m ago) 109m
kyverno kyverno-admission-controller-964c74c7d-frknb 1/1 Running 5 (17m ago) 109m
kyverno kyverno-background-controller-796f77c79f-nwhrs 1/1 Running 5 (17m ago) 109m
kyverno kyverno-reports-controller-6d6d98fc96-z7qjv 1/1 Running 5 (17m ago) 109m
kyverno kyverno-admission-controller-964c74c7d-hgtc2 1/1 Running 4 (17m ago) 109m
kyverno kyverno-admission-controller-964c74c7d-x744h 1/1 Running 5 (17m ago) 109m
team-1 ml-pipeline-visualizationserver-677c86b748-nbrr5 2/2 Running 2 (17m ago) 73m
team-1 ml-pipeline-ui-artifact-7749b4f5f6-ld7kl 2/2 Running 10 (17m ago) 94m
team-1-prod ml-pipeline-visualizationserver-677c86b748-hqwsh 2/2 Running 2 (17m ago) 73m
team-1-prod ml-pipeline-ui-artifact-7749b4f5f6-hl6gk 2/2 Running 10 (17m ago) 94m
同步完成后的 ArgoCD 界面(完成 20 个应用同步):
5. 访问控制台
执行端口转发:
kubectl port-forward \
--namespace "deploykf-istio-gateway" \
svc/deploykf-gateway 8080:http 8443:https
由于 Istio Gateway 基于 Host Header 区分访问的目标服务,因此需要配置本地 /etc/hosts 文件,追加如下内容:
127.0.0.1 deploykf.example.com
127.0.0.1 argo-server.deploykf.example.com
127.0.0.1 minio-api.deploykf.example.com
127.0.0.1 minio-console.deploykf.example.com
浏览器访问 https://deploykf.example.com:8443/
管理员:用户名 admin@example.com 密码 admin
用户 1: 用户名 user1@example.com 密码 user1
用户 2: 用户名 user2@example.com 密码 user2
6. 运行 Jupyter
更多功能持续探索中…
本文引用
https://www.deploykf.org/guides/local-quickstart/
更多推荐
所有评论(0)