openpai的tensorflow利用k8s分布式训练之FrameworkController
简述:openpai在基于YARN的任务调度工具FrameworkLaucher之后又添加了基于K8S的任务调度工具FrameworkController,感觉和kubeflow的TFJob类似,我们先来试试FrameworkController这工具如何单独使用,环境:k8s: 1.15.1docker: 18.09.5...
简述:
openpai在基于YARN的任务调度工具FrameworkLaucher之后又添加了基于K8S的任务调度工具FrameworkController,感觉和kubeflow的TFJob类似,我们先来试试FrameworkController这工具如何单独使用,
环境:
k8s: 1.15.1
docker: 18.09.5
对系统环境的要求不严格,ubuntu,centos都可以使用,其他系统没有验证过。
安装FrameworkController:
为FrameworkController创建 Service Account 和 ClusterRole
kubectl create serviceaccount frameworkcontroller --namespace default
kubectl create clusterrolebinding frameworkcontroller \
--clusterrole=cluster-admin \
--user=system:serviceaccount:default:frameworkcontroller
编写frameworkcontroller.yaml文件
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: frameworkcontroller
namespace: default
spec:
serviceName: frameworkcontroller
selector:
matchLabels:
app: frameworkcontroller
replicas: 1
template:
metadata:
labels:
app: frameworkcontroller
spec:
# Using the service account with granted permission
# if the k8s cluster enforces authorization.
serviceAccountName: frameworkcontroller
containers:
- name: frameworkcontroller
image: frameworkcontroller/frameworkcontroller
# Using k8s inClusterConfig, so usually, no need to specify
# KUBE_APISERVER_ADDRESS or KUBECONFIG
#env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
#- name: KUBECONFIG
# value: {Pod Local KubeConfig File Path}
创建framework控制器
kubectl create -f frameworkcontroller.yaml
查看创捷结果
kubectl get pod -n default
NAME READY STATUS RESTARTS AGE
frameworkcontroller-0 1/1 Running 0 30s
FrameWorkBarrier测试:
为FrameWorkBarrier创建 Service Account 和 ClusterRole
kubectl create serviceaccount frameworkbarrier --namespace default
kubectl create clusterrole frameworkbarrier \
--verb=get,list,watch \
--resource=frameworks
kubectl create clusterrolebinding frameworkbarrier \
--clusterrole=frameworkbarrier \
--user=system:serviceaccount:default:frameworkbarrier
编写frameworkbarrier.yaml文件
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: frameworkbarrier
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 0
taskRoles:
- name: server
taskNumber: 10
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
pod:
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: ubuntu:trusty
# Using /mnt/frameworkbarrier/injector.sh to inject environment variables,
# such as:
# FB_{UpperCase({TaskRoleName})}_IPS=
# {Task[0].PodIP},...,
# {Task[TaskRole.TaskNumber-1].PodIP}
# FB_{UpperCase({TaskRoleName})}_ADDRESSES=
# {Task[0].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT},...,
# {Task[TaskRole.TaskNumber-1].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT}
# Note, the environment variable FB_{UpperCase({TaskRoleName})}_PORT should be
# provided by the caller in advance.
#
# User may need to tweak these environment variables to its own
# input format.
#
# User can also write its own injector script to inject other
# Framework information from the Framework object file:
# /mnt/frameworkbarrier/framework.json.
command: [
"sh", "-c",
"FB_SERVER_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh && printenv &&
FB_SERVER_PORT=4002 FB_WORKER_PORT=5002 . /mnt/frameworkbarrier/injector.sh && printenv &&
sleep 60"]
ports:
- containerPort: 4001
- containerPort: 4002
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
# [PREREQUISITE]
# User needs to create a service account in the same namespace of this
# Framework with granted permission for frameworkbarrier, if the k8s
# cluster enforces authorization.
# For example, if the cluster enforces RBAC:
# kubectl create serviceaccount frameworkbarrier --namespace default
# kubectl create clusterrole frameworkbarrier \
# --verb=get,list,watch \
# --resource=frameworks
# kubectl create clusterrolebinding frameworkbarrier \
# --clusterrole=frameworkbarrier \
# --user=system:serviceaccount:default:frameworkbarrier
serviceAccountName: frameworkbarrier
initContainers:
- name: frameworkbarrier
# Using official image to demonstrate this example.
image: frameworkcontroller/frameworkbarrier
# Using k8s inClusterConfig, so usually, no need to specify
# KUBE_APISERVER_ADDRESS or KUBECONFIG
#env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
#- name: KUBECONFIG
# value: {Pod Local KubeConfig File Path}
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
volumes:
- name: frameworkbarrier-volume
emptyDir: {}
- name: worker
taskNumber: 10
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
pod:
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: ubuntu:trusty
command: [
"sh", "-c",
"FB_SERVER_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh && printenv &&
FB_SERVER_PORT=4002 FB_WORKER_PORT=5002 . /mnt/frameworkbarrier/injector.sh && printenv &&
sleep 60"]
ports:
- containerPort: 5001
- containerPort: 5002
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
# [PREREQUISITE]
# Same as server TaskRole.
serviceAccountName: frameworkbarrier
initContainers:
- name: frameworkbarrier
image: frameworkcontroller/frameworkbarrier
#env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
#- name: KUBECONFIG
# value: {Pod Local KubeConfig File Path}
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
volumes:
- name: frameworkbarrier-volume
emptyDir: {}
部署FrameWorkBarrier
kubectl apply -f frameworkbarrier.yaml -n default
验证
kubectl get pod -n default
NAME READY STATUS RESTARTS AGE
frameworkbarrier-server-0 0/1 Init:0/1 0 8s
frameworkbarrier-server-1 0/1 Init:0/1 0 8s
frameworkbarrier-server-2 0/1 Init:0/1 0 8s
frameworkbarrier-server-3 0/1 Init:0/1 0 8s
frameworkbarrier-server-4 0/1 Init:0/1 0 8s
frameworkbarrier-server-5 0/1 Init:0/1 0 8s
frameworkbarrier-server-6 0/1 Init:0/1 0 8s
frameworkbarrier-server-7 0/1 Init:0/1 0 8s
frameworkbarrier-server-8 0/1 Init:0/1 0 8s
frameworkbarrier-server-9 0/1 Init:0/1 0 8s
frameworkbarrier-worker-0 0/1 Init:0/1 0 8s
frameworkbarrier-worker-1 0/1 Init:0/1 0 7s
frameworkbarrier-worker-2 0/1 Init:0/1 0 7s
frameworkbarrier-worker-3 0/1 Init:0/1 0 7s
frameworkbarrier-worker-4 0/1 Init:0/1 0 7s
frameworkbarrier-worker-5 0/1 Init:0/1 0 7s
frameworkbarrier-worker-6 0/1 Init:0/1 0 6s
frameworkbarrier-worker-7 0/1 Init:0/1 0 6s
frameworkbarrier-worker-8 0/1 Init:0/1 0 6s
frameworkbarrier-worker-9 0/1 Init:0/1 0 6s
frameworkcontroller-0 1/1 Running 0 1m55s
查看frameworks
kubectl get framework -n default
NAME AGE
frameworkbarrier 69s
删除frameworkbarrier
kubectl delete framework frameworkbarrier -n default
分布式运行Tensorflow任务
分布式的任务需要分布式的文件系统,我这里使用了NFS,搭建方式很简单,先创建pv和pvc
data-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: data-volume
labels:
pv: data-volume
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
nfs:
server: 192.168.0.10
path: /data3/nfs-data/pai/
data-volume-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-volume
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
selector:
matchLabels:
pv: data-volume
创建pv和pvc
kubectl apply -f data-volume.yaml
kubectl apply -f data-volume-pvc.yaml
查看创建结果
kubectl get pv,pvc -n default
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/data-volume 20Gi RWO Retain Bound default/data-volume 89s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/data-volume Bound data-volume 20Gi RWO 89s
编写tensorflowdistributedtrainingwithcpu.yaml,我这里使用的是cpu版本的
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: tensorflowdistributedtrainingwithcpu
namespace: default
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
taskRoles:
- name: ps
taskNumber: 2
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
pod:
spec:
restartPolicy: Never
# [PREREQUISITE]
# User needs to setup the k8s cluster networking model and aware the
# potential network overhead, if he want to disable the hostNetwork to
# avoid the coordination of the containerPort usage.
# And for this example, if the hostNetwork is disabled, it only needs
# at least 1 node, otherwise, it needs at least 3 nodes since all the
# 3 workers are specified with the same containerPort.
# See https://kubernetes.io/docs/concepts/cluster-administration/networking
hostNetwork: false
containers:
- name: tensorflow
# Using official image to demonstrate this example.
# The image contains and only contains tensorflow official code.
image: frameworkcontroller/tensorflow-examples:cpu
# For the tf_cnn_benchmarks usage, see
# https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
workingDir: /tensorflow/benchmarks/scripts/tf_cnn_benchmarks
# Using /mnt/frameworkbarrier/injector.sh to inject environment variables
# without the need for image invasion and k8s DNS:
# FB_{UpperCase({TaskRoleName})}_ADDRESSES=
# {Task[0].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT},...,
# {Task[TaskRole.TaskNumber-1].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT}
# See more in ./example/framework/extension/frameworkbarrier.yaml
command: [
"sh", "-c",
"FB_PS_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh &&
python tf_cnn_benchmarks.py --job_name=ps --task_index=${FC_TASK_INDEX}
--ps_hosts=${FB_PS_ADDRESSES} --worker_hosts=${FB_WORKER_ADDRESSES}
--variable_update=parameter_server --cross_replica_sync=false
--model=alexnet --batch_size=8 --num_batches=10
--device=cpu --local_parameter_device=cpu --data_format=NHWC
--data_name=cifar10 --data_dir=/mnt/data/cifar-10-batches-py
--train_dir=/mnt/data/${FC_FRAMEWORK_NAME}/output"]
ports:
- containerPort: 4001
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
- name: data-volume
mountPath: /mnt/data
# [PREREQUISITE]
# User needs to create a service account for frameworkbarrier, if the
# k8s cluster enforces authorization.
# See more in ./example/framework/extension/frameworkbarrier.yaml
serviceAccountName: frameworkbarrier
initContainers:
- name: frameworkbarrier
# Using official image to demonstrate this example.
image: frameworkcontroller/frameworkbarrier
# Using k8s inClusterConfig, so usually, no need to specify
# KUBE_APISERVER_ADDRESS or KUBECONFIG
#env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
#- name: KUBECONFIG
# value: {Pod Local KubeConfig File Path}
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
volumes:
- name: frameworkbarrier-volume
emptyDir: {}
- name: data-volume
persistentVolumeClaim: # pvc
claimName: data-volume
# [PREREQUISITE]
# User needs to specify his own data-volume for input data and
# output model.
# The data-volume must be a distributed shared file system, so that
# data can be "handed off" between Pods, such as nfs, cephfs or
# glusterfs, etc.
# See https://kubernetes.io/docs/concepts/storage/volumes.
#
# And then he needs to download and extract the example input data
# from:
# https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
# to:
# {Volume Shared Directory}/cifar-10-batches-py
#
# For example:
#nfs:
# server: {NFS Server Host}
# path: {NFS Shared Directory}
- name: worker
taskNumber: 3
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
# Succeed the FrameworkAttempt immediately if worker's all Tasks succeeded.
minSucceededTaskCount: 3
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
pod:
spec:
restartPolicy: Never
# [PREREQUISITE]
# Same as ps TaskRole.
hostNetwork: false
containers:
- name: tensorflow
image: frameworkcontroller/tensorflow-examples:cpu
workingDir: /tensorflow/benchmarks/scripts/tf_cnn_benchmarks
command: [
"sh", "-c",
"FB_PS_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh &&
python tf_cnn_benchmarks.py --job_name=worker --task_index=${FC_TASK_INDEX}
--ps_hosts=${FB_PS_ADDRESSES} --worker_hosts=${FB_WORKER_ADDRESSES}
--variable_update=parameter_server --cross_replica_sync=false
--model=alexnet --batch_size=8 --num_batches=10
--device=cpu --local_parameter_device=cpu --data_format=NHWC
--data_name=cifar10 --data_dir=/mnt/data/cifar-10-batches-py
--train_dir=/mnt/data/${FC_FRAMEWORK_NAME}/output"]
ports:
- containerPort: 5001
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
- name: data-volume
mountPath: /mnt/data
# [PREREQUISITE]
# Same as ps TaskRole.
serviceAccountName: frameworkbarrier
initContainers:
- name: frameworkbarrier
image: frameworkcontroller/frameworkbarrier
#env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
#- name: KUBECONFIG
# value: {Pod Local KubeConfig File Path}
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
volumes:
- name: frameworkbarrier-volume
emptyDir: {}
- name: data-volume
persistentVolumeClaim: # pvc
claimName: data-volume
# [PREREQUISITE]
# Same as ps TaskRole.
#nfs:
# server: {NFS Server Host}
# path: {NFS Shared Directory}
在创建任务之前,先在挂载的NFS路径下下载数据
cd /data3/nfs-data/pai
wget wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -zxvf cifar-10-python.tar.gz
创建任务
kubectl apply -f tensorflowdistributedtrainingwithcpu.yaml -n default
查看pod
kubectl get pod -n default
NAME READY STATUS RESTARTS AGE
frameworkcontroller-0 1/1 Running 0 8h
tensorflowdistributedtrainingwithcpu-ps-0 0/1 Init:0/1 0 4s
tensorflowdistributedtrainingwithcpu-ps-1 0/1 Init:0/1 0 4s
tensorflowdistributedtrainingwithcpu-worker-0 0/1 Init:0/1 0 4s
tensorflowdistributedtrainingwithcpu-worker-1 0/1 Init:0/1 0 4s
tensorflowdistributedtrainingwithcpu-worker-2 0/1 Init:0/1 0 4s
查看framework
kubectl get framework -n default
NAME AGE
tensorflowdistributedtrainingwithcpu 35s
过几分钟之后,所有pod运行成功之后都会退出,我们会在NFS挂在的路径下看到训练的结果,都保存在以任务名为文件名的目录下,整个frameworkcontroller的部署和使用过程到此结束。
有问题加QQ群: 526855734
更多推荐
所有评论(0)