k8s(ubuntu)单机搭建部署记录——筑梦之路
K8S 单机部署总结-2021-4-15环境说明:显卡GPU:GTX 1660显卡驱动:nvidia-driver 440.100cuda版本:cuda 10.2cudnn版本:cudnn 8.0kubernetes版本:1.21.0docker-ce版本:19.03操作系统版本:ubuntu server lts 16.04网络ip地址:192.168.30.18系统基础环境1.修改主机名并做域
·
K8S 单机部署总结
-2021-4-15
环境说明:
显卡GPU:GTX 1660
显卡驱动:nvidia-driver 440.100
cuda版本:cuda 10.2
cudnn版本:cudnn 8.0
kubernetes版本:1.21.0
docker-ce版本:19.03
操作系统版本:ubuntu server lts 16.04
网络ip地址:192.168.30.18
系统基础环境
1.修改主机名并做域名解析
hostnamectl set-hostname k8s-master
echo "192.168.30.18 k8s-master" >> /etc/hosts
#禁用swap
swapoff -a
注释掉/etc/fstab
swap 一行
free -h 验证
2.安装显卡驱动
#安装编译器
sudo aot-get upodate && sudo apt-get install gcc make -y
#禁用默认的显卡驱动:
vim /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
sudo update-initramfs -u
reboot #重启
lsmod |grep nouveau
#安装显卡驱动
chmod +x NVIDIA-Linux*.run
./NVIDIA-Linux*.run
3.安装cuda
chmod +x cuda-*.run
./cuda-*.run
4.安装cudnn
tar -zxvf cudnn*.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include/
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
chmod a+r /usr/local/cuda/include/cudnn.h
chmod a+r /usr/local/cuda/lib64/libcudnn*
5.安装docker
安装docker-ce 阿里源
apt-get update
apt-get -y install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
apt-get -y update
apt-get -y install docker-ce
将当前非root用户加入docker组:
sudo groupadd docker #添加docker用户组
sudo gpasswd -a $USER docker #将登陆用户加入到docker用户组中
newgrp docker #更新用户组
docker ps #测试docker命令是否可以使用sudo正常使用
#安装指定版本的docker-ce:
apt-cache madison docker-ce
apt-get install docker-ce=19.03.0~ce-0~ubuntu
配置docker
cat /etc/docker/daemon.json
{
"default-runtime": "nvidia", ##调用显卡
"insecure-registries": ["192.168.30.18:5000"], ##内网docker仓库地址
"registry-mirrors": ["https://bqbd2ve6.mirror.aliyuncs.com"], ##国内镜像仓库
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
#有待验证
vim /lib/systemd/system/docker.service
ExecStart=/usr/bin/dockerd -H fd:// --default-runtime=nvidia
###################################################
K8S的安装搭建
1.添加apt源
curl https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | apt-key add -
cat /etc/apt/sources.list.d/kubernetes.list
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
sudo apt-get update
2.安装kubeadm
apt-get install -y kubelet kubeadm kubectl
#查看版本
kubelet --version
3.查看所需的镜像
kubeadm config images list --kubernetes-version=v1.21.0
k8s.gcr.io/kube-apiserver:v1.21.0
k8s.gcr.io/kube-controller-manager:v1.21.0
k8s.gcr.io/kube-scheduler:v1.21.0
k8s.gcr.io/kube-proxy:v1.21.0
k8s.gcr.io/pause:3.4.1
k8s.gcr.io/etcd:3.4.13-0
k8s.gcr.io/coredns/coredns:v1.8.0
#拉取脚本
脚本拉取:
k8s_get.sh
#!/bin/bash
KUBE_VERSION=v1.21.0
KUBE_PAUSE_VERSION=3.4.1
ETCD_VERSION=3.4.13-0
DNS_VERSION=1.8.0
username=registry.cn-hangzhou.aliyuncs.com/google_containers
images=(kube-proxy-amd64:${KUBE_VERSION}
kube-scheduler-amd64:${KUBE_VERSION}
kube-controller-manager-amd64:${KUBE_VERSION}
kube-apiserver-amd64:${KUBE_VERSION}
pause:${KUBE_PAUSE_VERSION}
etcd-amd64:${ETCD_VERSION}
coredns:${DNS_VERSION}
)
for image in ${images[@]}
do
docker pull ${username}/${image}
docker tag ${username}/${image} k8s.gcr.io/${image}
#docker tag ${username}/${image} gcr.io/google_containers/${image}
docker rmi ${username}/${image}
done
有问题的镜像需要手动修正
初始化k8s:
kubeadm init --apiserver-advertise-address=192.168.30.18 --pod-network-cidr=10.244.0.0/16 --kubernetes-version=v1.21.0
##########################
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.30.18:6443 --token kf2mr2.s2wz5q1vyr4nyom0 \
--discovery-token-ca-cert-hash sha256:b78d180c56b7ace538f57f66099424c8f05a70da087d75b6f13059f87840db7c
############################
root用户:
export KUBECONFIG=/etc/kubernetes/admin.conf
创建fannel网络
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
查看pod情况:
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get pods -n kube-system
kubectl describe node
#解除限制 apiserver也作为一个node
kubectl taint node k8s-master node-role.kubernetes.io/master-
------------------------------------------------------------------
k8s使用gpu资源
k8s-device-plugin github: https://github.com/NVIDIA/k8s-device-plugin
注意事项:
从Kubernetes 1.8开始,官方推荐使用Device Plugins方式来使用GPU。
需要在Node上pre-install NVIDIA Driver,并建议通过Daemonset部署NVIDIA Device Plugin,完成后Kubernetes才能发现nvidia.com/gpu。
因为device plugin通过extended resources来expose gpu resource的,所以在container请求gpu资源的时候要注意resource QoS为Guaranteed。
Containers目前仍然不支持共享同一块gpu卡。每个Container可以请求多块gpu卡,但是不支持gpu fraction。
Node 上要有nvidia-docker >= 2.0
显卡驱动版本大于361.93
将nvidia配置为Docker默认的runtime
编辑/etc/docker/daemon.json文件,增加"default-runtime": "nvidia"键值对,此时该文件的内容应该如下所示):
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
在很多教程中都说明若要使用GPU,需要设置Accelerators为true,而实际上该参数在1.11版本之后就弃用了。而将DevicePlugins设置为true也是在1.9版本之前需要做的事情,在1.10版本之后默认就为true。
————————————————
版权声明:本文为CSDN博主「hunyxv」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/hunyxv/article/details/92988788
master上执行:
#kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
#需要根据自己的版本
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
###############################################
本地仓库建立操作
mkdir /data/registry-data
docker run -d -p 5000:5000 --name=registry --restart=always --privileged=true --log-driver=none -v /data/registry-data/:/tmp/registry registry
docker tag gsm_living_check:latest 192.168.30.18:5000/gsm_living_check:latest
docker push 192.168.30.18:5000/gsm_living_check:latest
#####################################################
应用部署yaml示例-1
apiVersion: apps/v1
kind: Deployment
metadata:
name: zzqj-faceai-python
spec:
replicas: 1
selector:
matchLabels:
app: zzqj-faceai-python
template:
metadata:
labels:
app: zzqj-faceai-python
spec:
containers:
- name: zzqj-faceai-python ##容器名称
image: 192.168.30.18:5000/gsm_living_check ##使用本地镜像
imagePullPolicy: Never ##镜像获取策略 从不拉取
command: [ "/bin/bash", "-ce", "tail -f /dev/null" ] ##容器内运行的命令
volumeMounts:
- mountPath: /data/gsm-ai/ ##容器内的目录
name: data ##容器内挂载点
ports:
- containerPort: 10100 ##容器内的端口
volumes:
- name: data ##容器内的挂载点
hostPath:
path: "/data" ##宿主机目录
---
apiVersion: v1
kind: Service
metadata:
name: zzqj-faceai-python
spec:
type: NodePort
ports:
- port: 10100
nodePort: 30000 ##外部访问的端口 30000起
selector:
app: zzqj-faceai-python
#############
部署应用示例yaml-2 调用gpu
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-gpu
spec:
replicas: 1
selector:
matchLabels:
app: tf-gpu
template:
metadata:
labels:
app: tf-gpu
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:1.15.0-py3-jupyter
ports:
- containerPort: 8888
resources:
limits:
nvidia.com/gpu: 1 ##使用的gpu数量
---
apiVersion: v1
kind: Service
metadata:
name: tf-gpu
spec:
ports:
- port: 8888
targetPort: 8888
nodePort: 30888
name: jupyter
selector:
name: tf-gpu
type: NodePort
#################################
kubectl apply -f test-python.yaml
kubectl apply -f test-gpu.yaml
#查询
kubectl get node
kubectl get pod,svc,deploy --all-namespaces
kubectl describe node
#扩容示例
kubectl scale deploy/nginx-deployment --replicas=5
更多推荐
已为社区贡献10条内容
所有评论(0)