kubeflow 使用demo
配置注入,podDefault 类似k8s的PodPreset ,可以在将部分配置注入到 pod中。下面的例子 是,在kubeflow-user-example-com空间中添加查看权限的用户。执行 kubectl apply -f admin-profile.yaml。用户想要其他人参与自己的开发,可以将其他人添加到自己的命名空间,又两种方式。pytorch-init-container-ima
kubeflow使用
添加用户
创建profile,metadata.name(命名空间,没有会自动创建)
执行 kubectl apply -f admin-profile.yaml
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
## the profile name will be the namespace name
## WARNING: unexpected behavior may occur if the namespace already exists
name: admin-profile
spec:
## the owner of the profile
## NOTE: you may wish to make a global super-admin the owner of all profiles
## and only give end-users view or modify access to profiles to prevent
## them from adding/removing contributors
owner:
kind: User
name: admin@example.com
## plugins extend the functionality of the profile
## https://github.com/kubeflow/kubeflow/tree/master/components/profile-controller#plugins
plugins: []
## optionally create a ResourceQuota for the profile
## https://github.com/kubeflow/kubeflow/tree/master/components/profile-controller#resourcequotaspec
## https://kubernetes.io/docs/reference/kubernetes-api/policy-resources/resource-quota-v1/#ResourceQuotaSpec
resourceQuotaSpec: {}
生成用户密码,执行后输入密码,会生成hash代码,
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
获取dex配置
<span class="color_font"><span>kubectl get configmap dex -n auth -o jsonpath='{.data.config\.yaml}' > dex-yaml.yaml</span></span>
修改dex配置添加用户信息,添加后如下
issuer: http://dex.auth.svc.cluster.local:5556/dex
storage:
type: kubernetes
config:
inCluster: true
web:
http: 0.0.0.0:5556
logger:
level: "debug"
format: text
oauth2:
skipApprovalScreen: true
enablePasswordDB: true
staticPasswords:
- email: user@example.com
hash: $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
# https://github.com/dexidp/dex/pull/1601/commits
# FIXME: Use hashFromEnv instead
username: user
userID: "15841185641784"
- email: admin@example.com
hash: $2y$12$aduHAiTcmHCHm3.5I9w/0OEZT5f20MK79sEOyHf4sMkwPEXEertuW
username: admin
staticClients:
# https://github.com/dexidp/dex/pull/1664
- idEnv: OIDC_CLIENT_ID
redirectURIs: ["/authservice/oidc/callback"]
name: 'Dex Login Application'
secretEnv: OIDC_CLIENT_SECRET
config.yaml并重启deployment
kubectl create configmap dex --from-file=config.yaml=dex-yaml.yaml -n auth --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment dex -n auth
添加贡献者
用户想要其他人参与自己的开发,可以将其他人添加到自己的命名空间,又两种方式
基于kubernetes 的RBAC:https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/rbac/
下面的例子 是,在kubeflow-user-example-com空间中添加查看权限的用户test@example.com
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: user-test-example-com-clusterrole-view
namespace: kubeflow-user-example-com #profile 名称
annotations:
role: view #用户的角色,edit或者view
user: test@example.com #用户的电子邮件(区分大小写)
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeflow-view #用户的角色,edit或者view kubeflow-edit
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: test@example.com #用户的电子邮件(区分大小写)
- 创建贡献者授权策略
基于Istio的功能:https://istio.io/latest/docs/reference/config/security/authorization-policy/
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: user-test-example-com-clusterrole-nonotebook
namespace: kubeflow-user-example-com
annotations:
role: view
user: test@example.com
spec:
rules:
- from:
- source:
## for more information see the KFAM code:
## https://github.com/kubeflow/kubeflow/blob/v1.8.0/components/access-management/kfam/bindings.go#L79-L110
principals:
## required for kubeflow notebooks
## TEMPLATE: "cluster.local/ns/<ISTIO_GATEWAY_NAMESPACE>/sa/<ISTIO_GATEWAY_SERVICE_ACCOUNT>"
- "cluster.local/ns/istio-system/sa/istio-ingressgateway-service-account"
## required for kubeflow pipelines
## TEMPLATE: "cluster.local/ns/<KUBEFLOW_NAMESPACE>/sa/<KFP_UI_SERVICE_ACCOUNT>"
- "cluster.local/ns/kubeflow/sa/ml-pipeline-ui"
when:
- key: request.headers[kubeflow-userid]
values:
- test@example.com
提交分布式训练任务
离线环境 需要修改镜像
比如PyTorchJob,会包含initContainer,image是 alpine,需要修改镜像地址
PyTorchJob
pytorch-init-container-image,如下
apiVersion: apps/v1
kind: Deployment
metadata:
name: training-operator
labels:
control-plane: kubeflow-training-operator
spec:
selector:
matchLabels:
control-plane: kubeflow-training-operator
replicas: 1
template:
metadata:
labels:
control-plane: kubeflow-training-operator
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- /manager
- --pytorch-init-container-image=alpine:3.10
- --mpi-kubectl-delivery-image=mpioperator/mpi-operator:latest
image: kubeflow/training-operator:v1-855e096
name: training-operator
ports:
- containerPort: 8080
env:
...
提交训练任务
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
namespace: "admin-profile"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
nvidia.com/use-gpuuuid: "GPU-xxxx" #GPU选择
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "nccl"]
env:
- name: NVIDIA_VISIBLE_DEVICES #设置可访问的gpu设备
value: "0"
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "gpu" #gpu节点污点容忍配置
operator: "Equal"
value: "on"
effect: "NoSchedule"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
nvidia.com/use-gpuuuid: "GPU-xxxx" #GPU选择
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist:latest
args: ["--backend", "nccl"]
env:
- name: NVIDIA_VISIBLE_DEVICES #设置可访问的gpu设备
value: "0"
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "gpu" #gpu节点污点容忍配置
operator: "Equal"
value: "on"
effect: "NoSchedule"
MPIJob
部署mpi-operator
crd 版本v2beta1
git clone https://github.com/kubeflow/mpi-operator.git
#如果是内网,需要先修改镜像地址,再部署
kubectl apply -k manifests/overlays/kubeflow
提交训练任务
- 镜像打包 mpi-operator\examples\v2beta1\horovod
- 提交训练
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-mnist-test14
namespace: admin-profile
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- image: horovod:v1
imagePullPolicy: Always
name: mpi-launcher
securityContext:
runAsUser: 0
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /examples/tensorflow_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
volumeMounts:
- name: mnist-volume
mountPath: /root/
volumes:
- name: mnist-volume
persistentVolumeClaim:
claimName: mnist-pvc
Worker:
replicas: 2
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- image: horovod:v1
imagePullPolicy: Always
name: mpi-worker
securityContext:
privileged: true
resources:
limits:
cpu: 2
memory: 4Gi
volumeMounts:
- name: mnist-volume
mountPath: /root/
volumes:
- name: mnist-volume
persistentVolumeClaim:
claimName: mnist-pvc
katib AutoML
官方示例 python sdk
需要安装 kubeflow-katib 依赖:
pip install kubeflow-katib
在notebook中执行代码
# [1] 创建目标函数.
def objective(parameters):
# Import required packages.
import time
time.sleep(5)
# Calculate objective function.
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
# Katib parses metrics in this format: <metric-name>=<metric-value>.
print(f"result={result}")
import kubeflow.katib as katib
# [2] Create hyperparameter search space.
parameters = {
"a": katib.search.int(min=10, max=20),
"b": katib.search.double(min=0.1, max=0.2)
}
# [3] 创建 Katib 实验,包含 12 次试验,每个试验分配 2 个 CPUname = "tune-experiment"
katib.KatibClient().tune(
name=name,
objective=objective,
parameters=parameters,
objective_metric_name="result",
max_trial_count=12,
resources_per_trial={"cpu": "2"},
base_image="docker.io/tensorflow/tensorflow:2.13.0"
)
# [4] 等待 Katib 实验完成
katib.KatibClient().wait_for_experiment_condition(name=name)
# [5] Get the best hyperparameters.
模型注册表 model-registry
需要安装 kubeflow-katib 依赖:
pip install model-registry
在notebook中执行代码
# [1] 创建目标函数.
def objective(parameters):
# Import required packages.
import time
time.sleep(5)
# Calculate objective function.
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
# Katib parses metrics in this format: <metric-name>=<metric-value>.
print(f"result={result}")
import kubeflow.katib as katib
# [2] Create hyperparameter search space.
parameters = {
"a": katib.search.int(min=10, max=20),
"b": katib.search.double(min=0.1, max=0.2)
}
# [3] 创建 Katib 实验,包含 12 次试验,每个试验分配 2 个 CPUname = "tune-experiment"
katib.KatibClient().tune(
name=name,
objective=objective,
parameters=parameters,
objective_metric_name="result",
max_trial_count=12,
resources_per_trial={"cpu": "2"},
base_image="docker.io/tensorflow/tensorflow:2.13.0"
)
# [4] 等待 Katib 实验完成
katib.KatibClient().wait_for_experiment_condition(name=name)
# [5] Get the best hyperparameters.
kserve 模型服务
准备模型
from sklearn import svm
from sklearn import datasets
from joblib import dump
clf = svm.SVC(gamma='scale')
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
dump(clf, 'model.joblib')
发布模型
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: test
spec:
predictor:
sklearn:
storageUri: "pvc://kf-test/models/sklearn/1.0/model"
istio 服务网格
kubeflow namespace 默认设置了istio-proxy 代理注入
kind: Namespace
apiVersion: v1
metadata:
name: kubeflow
labels:
control-plane: kubeflow
istio-injection: enabled #在namespaces 中默认注入代理
kubernetes.io/metadata.name: kubeflow
设置pod 不注入代理得方法
kind: Namespace
apiVersion: v1
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
sidecar.istio.io/inject: "false" # 关闭代理注入
AuthorizationPolicy 设置特定url 跳过验证
# metrics 路径,跳过istio校验
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-metric-access
namespace: kubeflow
spec:
selector:
matchLabels:
app: notebook-controller #标签匹配app: notebook-controller得才会跳过验证
rules:
- to:
- operation:
paths: ["/metrics"]
notebook 高级选项 配置注入、亲和/污点
配置注入,podDefault 类似k8s的PodPreset ,可以在将部分配置注入到 pod中
apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
name: nvidia-gpu-devices-1
namespace: admin-profile
spec:
desc: The first gpu
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
selector:
matchLabels:
nvidia-gpu-devices-1: "true"
亲和/污点
spawner_ui_config.yaml 配置文件中tolerationGroup和 affinityConfig 配置项
tolerationGroup:
readOnly: false
options:
- groupKey: "group_1"
displayName: "gpuserver RTX4080 16G x 3"
tolerations:
- key: "gpu" #gpu节点污点容忍配置
operator: "Equal"
value: "on"
effect: "NoSchedule"
更多推荐
所有评论(0)