K8S GPU集群
参照https://zhuanlan.zhihu.com/p/138554103https://kubernetes.io/zh/docs/setup/production-environment/tools/kubeadm/install-kubeadm/https://blog.k8s.li/install-k8s-ubuntu18-04.htmlhttps://www.cnblogs.com
参照https://zhuanlan.zhihu.com/p/138554103
https://kubernetes.io/zh/docs/setup/production-environment/tools/kubeadm/install-kubeadm/
https://blog.k8s.li/install-k8s-ubuntu18-04.html
https://www.cnblogs.com/kevingrace/p/12778066.html
https://fastzhong.com/posts/k8s-install-kubeadm/
-
一、安装前设置
1、查看显卡驱动(需要安装好):nvidia-smi
2、查看hostname(设置好了名字):hostname
3、查看时区时间(时间统一):date
4、查看休眠状态(关闭休眠):systemctl status sleep.target
5、查看swap分区(为了保证kubelet正常工作, 你必须禁用交换分区):free -mh
关闭:sudo swapoff -a
6、加载 br_netfilter 内核模块: sudo modprobe br_netfilter
确认Linux内核加载了br_netfilter模块:lsmod | grep br_netfilter
7、打开bridge-nf-call-iptables
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF
确保在sysctl配置中将net.bridge.bridge-nf-call-iptables设置为1(为了让Linux 节点上的iptables能够正确地查看桥接流量):
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
运用配置:sudo sysctl --system
-
二、安装Docker
1、查看有更新的软件(只查看不更新):sudo apt update
2、安装docker: sudo apt install docker.io
3、开始并启动docker:
sudo systemctl start docker
sudo systemctl enable docker
4、安装Nvidia-docker2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-docker2
遇到了问题,但是这个问题实际上没什么影响
cd /usr/local/cuda/lib64/
ls -lha libcudnn*
sudo rm libcudnn.so
sudo rm libcudnn.so.7
sudo ln libcudnn.so.7.6.5 libcudnn.so.7
sudo ln libcudnn.so.7 libcudnn.so
查看:sudo ldconfig
5、设置nvidia-docker runtime
到目录/etc/docker/daemon.json下,修改文件权限:sudo chmod 777 daemon.json
直接打开文件并修改:
#修改/etc/docker/daemon.json文件如下所示:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
使其生效:sudo pkill -SIGHUP dockerd
-
三、安装K8s master
1、修改镜像路径:
安装包以允许apt通过HTTPS使用存储库\添加阿里云的官方GPG密钥
sudo apt-get update && sudo apt-get install -y ca-certificates curl software-properties-common apt-transport-https curl
curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add –
继续添加kubernetes apt资源:
sudo tee /etc/apt/sources.list.d/kubernetes.list <<EOF
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF
2、安装kubeadm kubelet和kubectl:
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
遇到了一个错误
参考https://blog.csdn.net/bigdata_mining/article/details/98876435
deb [arch=amd64] https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial stable main,但是后续有没有了
3、获取kubeadm镜像
拉取镜像的时候一直出问题,首先查看需要拉取的镜像:kubeadm config images list
在查看一下docer中的镜像:sudo docker images
原本docker中是不会有镜像的,无论是使用sudo kubeadm config images pull --image-repository registry.aliyuncs.com/google_containers,还是直接使用sudo swapoff -a && sudo kubeadm init –Kubernetes-version=v1.21.3 –image-repository registry.aliyuncs.com/google_containers都会出错,然后参考:https://www.cnblogs.com/kevingrace/p/12778066.html
首先# docker pull registry.aliyuncs.com/google_containers/coredns: 1.8.0
然后挨个改名字# docker tag registry.aliyuncs.com/google_containers/coredns: 1.8.0 k8s.gcr.io/coredns/coredns: 1.8.0
再挨个删除:docker rmi registry.aliyuncs.com/google_containers/coredns: 1.8.0
4、初始化master
切记这个环节必须保证swap关闭了,具体操作:1、将 /etc/fstab中swap部分都注释掉。2、重启电脑reboot。3、sudo swapoff -a
初始化:sudo kubeadm init –pod-network-cidr 172.16.0.0/16
文章中提到“然后使用了非默认的CIDR(一定要和宿主机的局域网的CIDR不一样)”,虽然不懂,但是作者应该有其背后的深意。
给自己的非sudo的常规身份拷贝一个token,这样就可以执行kubectl命令了:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
如果有其他错误,针对性的找,然后sudo kubeam reset重置一下
四、安装calico插件
1、calico.yaml :sudo kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
有错误,但是先忽略
五、安装K8s worker节点
1、关闭:sudo swapoff -a
2、加载 br_netfilter 内核模块: sudo modprobe br_netfilter
3、确认Linux内核加载了br_netfilter模块:lsmod | grep br_netfilter
4、打开bridge-nf-call-iptables
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF
5、确保在sysctl配置中将net.bridge.bridge-nf-call-iptables设置为1(为了让Linux 节点上的iptables能够正确地查看桥接流量):
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
6、运用配置:sudo sysctl --system
7、查看有更新的软件(只查看不更新):sudo apt update
遇到“Packages 403 Forbidden [IP: 91.189.95.83 80]”则:
https://blog.csdn.net/weixin_39682177/article/details/104923782
做法就是将所有list文件后缀改为list.bak
cd /etc/apt/sources.list.d
ls
sudo mv xxx.list xxx.list.bak
ls
sudo apt-get update
8、安装docker: sudo apt install docker.io
9、开始并启动docker:
sudo systemctl start docker
sudo systemctl enable docker
10、安装Nvidia-docker2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-docker2
遇到的错误libcudnn.so.7 is not a symbolic link
cd /usr/local/cuda/lib64/
ls -lha libcudnn*
sudo rm libcudnn.so
sudo rm libcudnn.so.7
sudo ln libcudnn.so.7.6.5 libcudnn.so.7
sudo ln libcudnn.so.7 libcudnn.so
11、设置nvidia-docker runtime
到目录/etc/docker/daemon.json下,修改文件权限:sudo chmod 777 daemon.json
直接打开文件并修改:
#修改/etc/docker/daemon.json文件如下所示:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
使其生效:sudo pkill -SIGHUP dockerd
12、修改镜像路径:
安装包以允许apt通过HTTPS使用存储库\添加阿里云的官方GPG密钥
sudo apt-get update && sudo apt-get install -y ca-certificates curl software-properties-common apt-transport-https curl
curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add –
继续添加kubernetes apt资源:
sudo tee /etc/apt/sources.list.d/kubernetes.list <<EOF
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF
13、安装kubeadm kubelet和kubectl:
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
出现错误“由于没有公钥,无法验证下列签名: NO_PUBKEY”:
https://blog.csdn.net/qq_38347931/article/details/103086873
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys FEEA9169307EA071 8B57C5C2836F4BEB(这是我缺少的,填写你自己缺少的)
14、获取kubeadm镜像
拉取镜像的时候一直出问题,首先查看需要拉取的镜像:kubeadm config images list
拉取镜像:sudo docker pull registry.aliyuncs.com/google_containers/coredns: 1.8.0
查看一下docer中的镜像:sudo docker images
挨个改名字:sudo docker tag registry.aliyuncs.com/google_containers/coredns: 1.8.0 k8s.gcr.io/coredns/coredns: 1.8.0
挨个删:sudo docker rmi registry.aliyuncs.com/google_containers/coredns: 1.8.0
确保sudo docker images输出和kubeadm config images list完全一致
15、保证swap关闭了,
具体操作:1、修改/etc/fstab 权限sudo chmod 777 fstab。 2、将 /etc/fstab中swap部分都注释掉。3、重启电脑reboot。4、sudo swapoff -a
六、配置节点
1、查看master token:
https://blog.csdn.net/weixin_44130081/article/details/103563392
kubeadm token create --print-join-command
2、加入集群
将输出的语句在从机上执行:sudo kubeadm join 192.168.1.137:6443 --token xxx -discovery-token-ca-cert-hash sha256xxx
3、此时可以在主机上查看节点信息了:kubectl get nodes
kubeadm token list
发现很多节点是notReady
原来是calico.yaml(部署 pod 网络)的问题,在主机重新执行:calico.yaml :sudo kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
再来检测: kubectl get nodes
4、当所有从机节点都加进来之后,在主机执行:
kubectl apply -f https://github.com/NVIDIA/k8s-device-plugin/blob/master/nvidia-device-plugin.yml
5、检测节点
在主机上检测:kubectl describe node 从机hostname | grep -i -B 7 GPU
七、安装K8s dashboard
1、安装dashboard:
kubectl apply -f https://github.com/kubernetes/dashboard/blob/master/aio/deploy/recommended.yaml
2、访问:
下载
https://kuboard.cn/install/install-k8s-dashboard.html#%E8%AE%BF%E9%97%AE
创建 Service Account 和 ClusterRoleBinding
kubectl apply -f https://kuboard.cn/install-script/k8s-dashboard/auth.yaml
3、获取Bearer Token
kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user | awk '{print $1}')
4、开启代理kubectl proxy:
开启一次就好了
5、登录:输入
http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/
八、创建Nginx Deployment
https://fastzhong.com/posts/k8s-install-kubeadm/
1、安装kubectl create deployment nginx --image=nginx损失
2、暴露端口:kubectl expose deployment nginx --port=80 --type=NodePort
3、查看:kubectl get pod,svc -o wide
4、验证暴露的端口:
更多推荐
所有评论(0)