Ubuntu部署FfDL记录

Kubernetes是一个开源的，用于管理云平台中多个主机上的容器化的应用，Kubernetes的目标是让部署容器化的应用简单并且高效（powerful）,Kubernetes提供了应用部署，规划，更新，维护的一种机制。准备docker:k8s底层是基于docker的,所以你需要先安装docker.禁掉swap分区:你可以用sudo swapoff -a,要永久禁用swap分区的话,需要s...

James-0

766人浏览 · 2019-01-25 15:45:04

James-0 · 2019-01-25 15:45:04 发布

Kubernetes是一个开源的，用于管理云平台中多个主机上的容器化的应用，Kubernetes的目标是让部署容器化的应用简单并且高效（powerful）,Kubernetes提供了应用部署，规划，更新，维护的一种机制。

准备
docker:k8s底层是基于docker的,所以你需要先安装docker.
禁掉swap分区:你可以用sudo swapoff -a,要永久禁用swap分区的话,需要sudo vim /etc/fstab,注释掉swap那一行
翻墙工具Shadowsocks:因为后面依赖的一些资源,docker镜像是放在google平台上的,所以要翻墙

设置http代理
一般开ss,其实是设置了一个socks5代理,所以我们还需要一个http转socks5的工具,这里用的是privoxy.
先安装privoxy

sudo apt-get install privoxy
配置Privoxy, 打开 /etc/privoxy/config,在最后一行后边加上

forward-socks5 / 127.0.0.1:1080 .
listen-address 127.0.0.1:8008
这里的意思是把请求全部映射到本地1080端口上,privoxy监听在8008端口.
然后重启Privoxy

sudo service privoxy restart
然后你就可以用

export http_proxy=http://127.0.0.1:8008
export https_proxy=http://127.0.0.1:8008
来访问国外资源,可以测试一下,curl https://google.com,配置正确的话,会输出

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
安装kubeadm
安装k8s其实有好几种方法,因为是单机部署,用官方做的工具kubeadm来安装更加简单快速.
接下来几步我们会使用proxychains,它可以让我们在终端直接使用socks5代理,这里用了proxychains-ng(新一代proxychains)

git clone https://github.com/rofl0r/proxychains-ng.git
cd proxychains-ng
./configure --prefix=/usr --sysconfdir=/etc
$ make
$ make install
$ make install-config (安装proxychains.conf配置文件)
使用的话,在需要代理的命令前加上proxychains4 ，如：

proxychains4 wget https://google.com
接下的一步是下载并添加Kubernetes安装的密钥。

proxychains4 curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add
配置kubernetes源

sudo touch /etc/apt/sources.list.d/kubernetes.list
sudo echo deb http://apt.kubernetes.io/ kubernetes-xenial main >> /etc/apt/sources.list.d/kubernetes.list
安装kubeadm和kubelet等依赖

proxychains4 apt-get update
proxychains4 apt-get install -y kubelet kubeadm kubectl kubernetes-cni
kubeadm init初始化集群
打开终端,我们先要设置http代理,这里用proxychains4没什么用

export http_proxy=http://127.0.0.1:8008
export https_proxy=http://127.0.0.1:8008
export no_proxy=192.168.1.118 # 你电脑的ip地址
还需要做的是给docker设置代理,因为镜像在google平台上,注意,有2种代理,一种是docker client的，一种是docker server的，不要搞混了。这里设置的是docker server的(因为pull镜像是docker server执行的),这里代理就是上述的privoxy地址：

#为docker service创建一个systemd drop-in 目录
mkdir -p /etc/systemd/system/docker.service.d

#使用下面内容创建文件/etc/systemd/system/docker.service.d/http-proxy.conf
[Service]
Environment="HTTP_PROXY=http://127.0.0.1:8008/"

#使用下面内容创建文件/etc/systemd/system/docker.service.d/https-proxy.conf
[Service]
Environment="HTTPS_PROXY=http://127.0.0.1:8008/"

#写入改动
sudo systemctl daemon-reload

#重启docker服务
sudo systemctl restart docker
执行kubeadm init, kubeadm init的时候要先想好使用Pod的哪个网络插件，这里选择的是Calico插件

kubeadm init --pod-network-cidr=172.16.0.0/16
整个过程看日志的话,可以使用

journalctl -xeu kubelet
可能一次执行不会成功,设置正确之后,你可以再执行kubeadm init的话,可以加参数忽略所有前置检查错误

kubeadm init --pod-network-cidr=172.16.0.0/16 --ignore-preflight-errors=all
正确初始化,会看到字样

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
......
按照提示,执行命令

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
查看所有pods,使用kubectl get pods --all-namespaces

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-78fcdf6894-5h7tl 0/1 Pending 0 1h
kube-system coredns-78fcdf6894-z7vcj 0/1 Pending 0 1h
kube-system etcd-salamanderpc 1/1 Running 0 1h
kube-system kube-apiserver-salamanderpc 1/1 Running 1 1h
kube-system kube-controller-manager-salamanderpc 1/1 Running 1 1h
kube-system kube-proxy-brgdx 1/1 Running 0 1h
kube-system kube-scheduler-salamanderpc 1/1 Running 1 1h
发现coredns还是pedding,这个没关系,我们还需要安装Pod Network插件,这里安装的是Calico
首先,安装etcd实例

kubectl apply -f \
https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/hosted/etcd.yaml
输出

daemonset "calico-etcd" created
service "calico-etcd" created
安装calico的RBAC

kubectl apply -f \
https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/rbac.yaml
输出

clusterrole.rbac.authorization.k8s.io "calico-kube-controllers" created
clusterrolebinding.rbac.authorization.k8s.io "calico-kube-controllers" created
clusterrole.rbac.authorization.k8s.io "calico-node" created
clusterrolebinding.rbac.authorization.k8s.io "calico-node" created
kubectl apply -f \
https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/hosted/calico.yaml
输出

configmap "calico-config" created
secret "calico-etcd-secrets" created
daemonset.extensions "calico-node" created
serviceaccount "calico-node" created
deployment.extensions "calico-kube-controllers" created
serviceaccount "calico-kube-controllers" created
等待所有pod变成running

watch kubectl get pods --all-namespaces
需要等待一定时间(1,2分钟)

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-etcd-l9zrs 1/1 Running 0 1m
kube-system calico-kube-controllers-65945f849d-kpndn 1/1 Running 0 1m
kube-system calico-node-5bb4d 2/2 Running 0 1m
kube-system coredns-78fcdf6894-5pjcn 1/1 Running 0 3m
kube-system coredns-78fcdf6894-f5wtd 1/1 Running 0 3m
kube-system etcd-salamanderpc 1/1 Running 0 2m
kube-system kube-apiserver-salamanderpc 1/1 Running 0 2m
kube-system kube-controller-manager-salamanderpc 1/1 Running 0 2m
kube-system kube-proxy-f6kxr 1/1 Running 0 3m
kube-system kube-scheduler-salamanderpc 1/1 Running 0 2m
部署服务
因为是单节点,本来是需要加入worker节点去运行真正的服务的,但为了测试,我们可以

kubectl taint nodes --all node-role.kubernetes.io/master-
脱离限制(线上是不能这么做的)

我们新建一个Deployment文件,内容为

apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 2 # tells deployment to run 2 pods matching the template
template: # create pods using pod definition in this template
metadata:
# unlike pod-nginx.yaml, the name is not included in the meta data as a unique name is
# generated from the deployment name
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.0
ports:
- containerPort: 80
Deployment是新一代用于Pod管理的对象，与Replication Controller相比，它提供了更加完善的功能，使用起来更加简单方便。
然后,创建Deployment

kubectl create -f nginx_deployment.yaml
上面会创建两种pods,容器开放端口80

查看Deployment

kubectl get deployment
查看创建的pods(有两个)

kubectl get pods
显示

NAME READY STATUS RESTARTS AGE
nginx-deployment-67594d6bf6-bwnlz 1/1 Running 0 39m
nginx-deployment-67594d6bf6-frrdx 1/1 Running 0 39m
为了能够对外访问,我们需要定义service

kind: Service
apiVersion: v1
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 9898
targetPort: 80
创建service

kubectl create -f nginx-service.yaml
上面的service对外暴露端口为9898
查看创建的service

kubectl get svc
显示

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3h
nginx-service ClusterIP 10.101.10.236 <none> 9898/TCP 23m

FFDL

项目地址：https://github.com/IBM/FfDL
安装helm
helm

helm init
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'

kubectl config set-context $(kubectl config current-context) --namespace=ivdai

export VM_TYPE=none
export PUBLIC_IP=<Cluster Public IP>
export NAMESPACE=default

create nfs for pv

# Create the shared directory
sudo mkdir -p /data-nfs

# Install NFS kernel server
sudo apt update
sudo apt install -y nfs-kernel-server

# Update /etc/exports
sudo echo "/data-nfs *(rw,no_root_squash,no_subtree_check)" | sudo tee -a /etc/exports

# Restart NFS kernel server
sudo service nfs-kernel-server restart

test_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0001
labels:
type: dlaas-static-volume
spec:
capacity:
storage: 200Gi
accessModes:
- ReadWriteMany
nfs:
path: /data-nfs
server: 192.168.8.110

kubectl create -f test_pv.yaml

helm install .
kubectl config set-context $(kubectl config current-context) --namespace=$NAMESPACE
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# alertmanager-7cf6b988b9-h9q6q 1/1 Running 0 5h
# etcd0 1/1 Running 0 5h
# ffdl-lcm-65bc97bcfd-qqkfc 1/1 Running 0 5h
# ffdl-restapi-8777444f6-7jfcf 1/1 Running 0 5h
# ffdl-trainer-768d7d6b9-4k8ql 1/1 Running 0 5h
# ffdl-trainingdata-866c8f48f5-ng27z 1/1 Running 0 5h
# ffdl-ui-5bf86cc7f5-zsqv5 1/1 Running 0 5h
# mongo-0 1/1 Running 0 5h
# prometheus-5f85fd7695-6dpt8 2/2 Running 0 5h
# pushgateway-7dd8f7c86d-gzr2g 2/2 Running 0 5h
# storage-0 1/1 Running 0 5h

node_ip=$PUBLIC_IP
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}')
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}')
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
s3_port=$(kubectl get service minio -o jsonpath='{.spec.ports[0].nodePort}')

echo "Monitoring dashboard: http://$node_ip:$grafana_port/ (login: admin/admin)"
echo "Web UI: http://$node_ip:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user"

Using FfDL Local S3 Based Object Storage

node_ip=$PUBLIC_IP
s3_port=$(kubectl get service minio -o jsonpath='{.spec.ports[0].nodePort}')
s3_url=http://$node_ip:$s3_port

export AWS_ACCESS_KEY_ID=admin; export AWS_SECRET_ACCESS_KEY=password; export AWS_DEFAULT_REGION=us-east-1;

s3cmd="aws --endpoint-url=$s3_url s3"
$s3cmd mb s3://trainingdata
$s3cmd mb s3://trainedmodel
$s3cmd mb s3://mnist_lmdb_data
$s3cmd mb s3://dlaas-trained-models

mkdir tmp
for file in t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz;
do
test -e tmp/$file || wget -q -O tmp/$file http://yann.lecun.com/exdb/mnist/$file
$s3cmd cp tmp/$file s3://trainingdata/$file
done

restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;

if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
else
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
fi

CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model

Using Cloud Object Storage

export AWS_ACCESS_KEY_ID=mos
export AWS_SECRET_ACCESS_KEY=mos
s3_url=http://120.79.11.211:8080
s3cmd="aws --endpoint-url=$s3_url s3"

trainingDataBucket=<unique bucket name for training data storage>
trainingResultBucket=<unique bucket name for training result storage>

$s3cmd mb s3://$trainingDataBucket
$s3cmd mb s3://$trainingResultBucket

if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/tf_training_data/$trainingDataBucket/ etc/examples/tf-model/manifest.yml
sed -i '' s/tf_trained_model/$trainingResultBucket/ etc/examples/tf-model/manifest.yml
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
sed -i '' s/user_name: test/user_name: $AWS_ACCESS_KEY_ID/ etc/examples/tf-model/manifest.yml
sed -i '' s/password: test/password: $AWS_SECRET_ACCESS_KEY/ etc/examples/tf-model/manifest.yml
else
sed -i s/tf_training_data/$trainingDataBucket/ etc/examples/tf-model/manifest.yml
sed -i s/tf_trained_model/$trainingResultBucket/ etc/examples/tf-model/manifest.yml
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
sed -i s/user_name: test/user_name: $AWS_ACCESS_KEY_ID/ etc/examples/tf-model/manifest.yml
sed -i s/password: test/password: $AWS_SECRET_ACCESS_KEY/ etc/examples/tf-model/manifest.yml
fi

restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;

# Obtain the correct CLI for your machine and run the training job with our default TensorFlow model
CLI_CMD=cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model

docker build -q -t docker.io/ffdl/ffdl-restapi:user-lyh .
(cd ./restapi/ && (test ! -e main.go || CGO_ENABLED=0 GOOS=linux go build -ldflags "-s -w" -a -installsuffix cgo -o bin/main))

kubectl set image deploy ffdl-restapi ffdl-restapi-container=ffdl/ffdl-restapi:v0.1.1

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub