kubeflow使用

添加用户

创建profile,metadata.name(命名空间,没有会自动创建)

执行 kubectl apply -f admin-profile.yaml

apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  ## the profile name will be the namespace name
  ## WARNING: unexpected behavior may occur if the namespace already exists
  name: admin-profile
spec:
  ## the owner of the profile
  ## NOTE: you may wish to make a global super-admin the owner of all profiles
  ##       and only give end-users view or modify access to profiles to prevent
  ##       them from adding/removing contributors
  owner:
    kind: User
    name: admin@example.com

  ## plugins extend the functionality of the profile
  ## https://github.com/kubeflow/kubeflow/tree/master/components/profile-controller#plugins
  plugins: []

  ## optionally create a ResourceQuota for the profile
  ## https://github.com/kubeflow/kubeflow/tree/master/components/profile-controller#resourcequotaspec
  ## https://kubernetes.io/docs/reference/kubernetes-api/policy-resources/resource-quota-v1/#ResourceQuotaSpec
  resourceQuotaSpec: {}

生成用户密码,执行后输入密码,会生成hash代码,

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

获取dex配置

<span class="color_font"><span>kubectl get configmap dex -n auth -o jsonpath='{.data.config\.yaml}' > dex-yaml.yaml</span></span>

修改dex配置添加用户信息,添加后如下

issuer: http://dex.auth.svc.cluster.local:5556/dex
storage:
  type: kubernetes
  config:
    inCluster: true
web:
  http: 0.0.0.0:5556
logger:
  level: "debug"
  format: text
oauth2:
  skipApprovalScreen: true
enablePasswordDB: true
staticPasswords:
- email: user@example.com
  hash: $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
  # https://github.com/dexidp/dex/pull/1601/commits
  # FIXME: Use hashFromEnv instead
  username: user
  userID: "15841185641784"
- email: admin@example.com
  hash: $2y$12$aduHAiTcmHCHm3.5I9w/0OEZT5f20MK79sEOyHf4sMkwPEXEertuW
  username: admin
staticClients:
# https://github.com/dexidp/dex/pull/1664
- idEnv: OIDC_CLIENT_ID
  redirectURIs: ["/authservice/oidc/callback"]
  name: 'Dex Login Application'
  secretEnv: OIDC_CLIENT_SECRET

config.yaml并重启deployment

kubectl create configmap dex --from-file=config.yaml=dex-yaml.yaml -n auth --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment dex -n auth

添加贡献者

用户想要其他人参与自己的开发,可以将其他人添加到自己的命名空间,又两种方式

  1. 在系统页面中添加(添加之后,用户默认有编辑权限,要控制贡献者只有查看权限的话只能手动配置)
  2. 手动管理贡献者
    1. 创建贡献者角色绑定

基于kubernetes 的RBAC:https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/rbac/

下面的例子 是,在kubeflow-user-example-com空间中添加查看权限的用户test@example.com

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: user-test-example-com-clusterrole-view
  namespace: kubeflow-user-example-com  #profile 名称
  annotations:
    role: view  #用户的角色,edit或者view
    user: test@example.com  #用户的电子邮件(区分大小写)
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kubeflow-view  #用户的角色,edit或者view  kubeflow-edit
subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: test@example.com  #用户的电子邮件(区分大小写)
  1. 创建贡献者授权策略

基于Istio的功能:https://istio.io/latest/docs/reference/config/security/authorization-policy/

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: user-test-example-com-clusterrole-nonotebook
  namespace: kubeflow-user-example-com
  annotations:
    role: view
    user: test@example.com
spec:
  rules:
    - from:
        - source:
            ## for more information see the KFAM code:
            ## https://github.com/kubeflow/kubeflow/blob/v1.8.0/components/access-management/kfam/bindings.go#L79-L110
            principals:
              ## required for kubeflow notebooks
              ## TEMPLATE: "cluster.local/ns/<ISTIO_GATEWAY_NAMESPACE>/sa/<ISTIO_GATEWAY_SERVICE_ACCOUNT>"
              - "cluster.local/ns/istio-system/sa/istio-ingressgateway-service-account"

              ## required for kubeflow pipelines
              ## TEMPLATE: "cluster.local/ns/<KUBEFLOW_NAMESPACE>/sa/<KFP_UI_SERVICE_ACCOUNT>"
              - "cluster.local/ns/kubeflow/sa/ml-pipeline-ui"
      when:
        - key: request.headers[kubeflow-userid]
          values:
            - test@example.com

提交分布式训练任务

离线环境 需要修改镜像

比如PyTorchJob,会包含initContainer,image是 alpine,需要修改镜像地址

PyTorchJob

pytorch-init-container-image,如下

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-operator
  labels:
    control-plane: kubeflow-training-operator
spec:
  selector:
    matchLabels:
      control-plane: kubeflow-training-operator
  replicas: 1
  template:
    metadata:
      labels:
        control-plane: kubeflow-training-operator
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      containers:
        - command:
            - /manager
            - --pytorch-init-container-image=alpine:3.10
            - --mpi-kubectl-delivery-image=mpioperator/mpi-operator:latest
          image: kubeflow/training-operator:v1-855e096
          name: training-operator
          ports:
            - containerPort: 8080
          env:
...

提交训练任务

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-nccl"
  namespace: "admin-profile"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
            nvidia.com/use-gpuuuid: "GPU-xxxx" #GPU选择
        spec:
          containers:
            - name: pytorch
              image: kubeflow/pytorch-dist-mnist:latest
              args: ["--backend", "nccl"]
              env:
                - name: NVIDIA_VISIBLE_DEVICES #设置可访问的gpu设备
                  value: "0"
              resources:
                limits:
                  nvidia.com/gpu: 1
          tolerations:
            - key: "gpu" #gpu节点污点容忍配置
              operator: "Equal"
              value: "on"
              effect: "NoSchedule"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
            nvidia.com/use-gpuuuid: "GPU-xxxx" #GPU选择
        spec:
          containers:
            - name: pytorch
              image: kubeflow/pytorch-dist-mnist:latest
              args: ["--backend", "nccl"]
              env:
                - name: NVIDIA_VISIBLE_DEVICES #设置可访问的gpu设备
                  value: "0"
              resources:
                limits:
                  nvidia.com/gpu: 1
          tolerations:
            - key: "gpu" #gpu节点污点容忍配置
              operator: "Equal"
              value: "on"
              effect: "NoSchedule"

MPIJob

部署mpi-operator

crd 版本v2beta1

git clone https://github.com/kubeflow/mpi-operator.git
#如果是内网,需要先修改镜像地址,再部署
kubectl apply -k manifests/overlays/kubeflow

提交训练任务

  1. 镜像打包 mpi-operator\examples\v2beta1\horovod
  2. 提交训练
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: tensorflow-mnist-test14
  namespace: admin-profile
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - image: horovod:v1
            imagePullPolicy: Always
            name: mpi-launcher
            securityContext:
              runAsUser: 0
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
            volumeMounts:
              - name: mnist-volume
                mountPath: /root/
          volumes:
            - name: mnist-volume
              persistentVolumeClaim:
                claimName: mnist-pvc
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - image: horovod:v1
            imagePullPolicy: Always
            name: mpi-worker
            securityContext:
              privileged: true
            resources:
              limits:
                cpu: 2
                memory: 4Gi
            volumeMounts:
              - name: mnist-volume
                mountPath: /root/
          volumes:
            - name: mnist-volume
              persistentVolumeClaim:
                claimName: mnist-pvc

katib AutoML

官方示例 python sdk

需要安装 kubeflow-katib 依赖:

pip install kubeflow-katib

在notebook中执行代码

# [1] 创建目标函数.
def objective(parameters):
    # Import required packages.
    import time
    time.sleep(5)
    # Calculate objective function.
    result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    print(f"result={result}")

import kubeflow.katib as katib

# [2] Create hyperparameter search space.
parameters = {
    "a": katib.search.int(min=10, max=20),
    "b": katib.search.double(min=0.1, max=0.2)
}

# [3] 创建 Katib 实验,包含 12 次试验,每个试验分配 2 个 CPUname = "tune-experiment"
katib.KatibClient().tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="result",
    max_trial_count=12,
    resources_per_trial={"cpu": "2"},
    base_image="docker.io/tensorflow/tensorflow:2.13.0"

)

# [4] 等待 Katib 实验完成
katib.KatibClient().wait_for_experiment_condition(name=name)

# [5] Get the best hyperparameters.

模型注册表 model-registry

需要安装 kubeflow-katib 依赖:

pip install model-registry

在notebook中执行代码

# [1] 创建目标函数.
def objective(parameters):
    # Import required packages.
    import time
    time.sleep(5)
    # Calculate objective function.
    result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    print(f"result={result}")

import kubeflow.katib as katib

# [2] Create hyperparameter search space.
parameters = {
    "a": katib.search.int(min=10, max=20),
    "b": katib.search.double(min=0.1, max=0.2)
}

# [3] 创建 Katib 实验,包含 12 次试验,每个试验分配 2 个 CPUname = "tune-experiment"
katib.KatibClient().tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="result",
    max_trial_count=12,
    resources_per_trial={"cpu": "2"},
    base_image="docker.io/tensorflow/tensorflow:2.13.0"

)

# [4] 等待 Katib 实验完成
katib.KatibClient().wait_for_experiment_condition(name=name)

# [5] Get the best hyperparameters.

kserve 模型服务

准备模型

from sklearn import svm
from sklearn import datasets
from joblib import dump
clf = svm.SVC(gamma='scale')
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
dump(clf, 'model.joblib')

发布模型

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
  namespace: test
spec:
  predictor:
    sklearn:
      storageUri: "pvc://kf-test/models/sklearn/1.0/model" 

istio 服务网格

kubeflow namespace 默认设置了istio-proxy 代理注入

kind: Namespace
apiVersion: v1
metadata:
  name: kubeflow
  labels:
    control-plane: kubeflow
    istio-injection: enabled #在namespaces 中默认注入代理
    kubernetes.io/metadata.name: kubeflow

设置pod 不注入代理得方法

kind: Namespace
apiVersion: v1    
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    sidecar.istio.io/inject: "false" # 关闭代理注入

AuthorizationPolicy 设置特定url 跳过验证

# metrics 路径,跳过istio校验

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-metric-access
  namespace: kubeflow
spec:
  selector:
    matchLabels:
      app: notebook-controller   #标签匹配app: notebook-controller得才会跳过验证
  rules:
    - to:
        - operation:
            paths: ["/metrics"]

notebook 高级选项 配置注入、亲和/污点

配置注入,podDefault 类似k8s的PodPreset ,可以在将部分配置注入到 pod中

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: nvidia-gpu-devices-1
  namespace: admin-profile
spec:
  desc:   The first gpu
  env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "0"
  selector:
    matchLabels:
      nvidia-gpu-devices-1: "true"

亲和/污点

spawner_ui_config.yaml 配置文件中tolerationGroup和 affinityConfig 配置项

tolerationGroup:
    readOnly: false
    options:
      - groupKey: "group_1"
        displayName: "gpuserver RTX4080 16G x 3"
        tolerations:
          - key: "gpu" #gpu节点污点容忍配置
            operator: "Equal"
            value: "on"
            effect: "NoSchedule"
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐