RDMA device-plugin for Kubernetes

Introduction

k8s-rdma-device-plugin is a device plugin for Kubernetes to manage RDMA device.

RDMA(remote direct memory access) is a high performance network protocol, which has the following major advantages:

  • Zero-copy

    Applications can perform data transfer without the network software stack involvement and data is being send received directly to the buffers without being copied between the network layers.

  • Kernel bypass

    Applications can perform data transfer directly from userspace without the need to perform context switches.

  • No CPU involvement

    Applications can access remote memory without consuming any CPU in the remote machine. The remote memory machine will be read without any intervention of remote process (or processor). The caches in the remote CPU(s) won’t be filled with the accessed memory content.

You can read this post to get more information about RDMA.

This plugin allow you to use RDMA device in container of Kubernetes cluster. And more, We can use this plugin work with sriov-cni to provide high perfmance network connection for distributed application, especially GPU distributed application, such as Tensorflow,Spark, etc.


安装步骤

上面是官方的介绍,大致对此有个了解。安装k8s-rdma-device-plugin的目的是在K8S中使用这种高性能的通信网络。下面是具体的安装步骤:

  1. 本地InfiniBand驱动安装

    InfiniBand称为无限宽带技术,简称IB。我们使用IB线将两台设备进行连接,然后进行驱动安装。

    • 环境检测

      查看本地是否安装了IB卡

      root@m1:/# lspci |grep Mell
      1a:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5]
      

      如果没有返回任何信息,说明服务器没有安装IB卡,也无需接下来的配置。

    • 依赖安装

      apt-get install python-libxml2 gfortran libgfortran3 libnl-route-3-200 dpatch quilt bison swig \
      debhelper automake libltdl-dev chrpath flex autoconf m4 autotools-dev graphviz lsb-core
      

      如果在安装依赖中有任何问题的请及时解决,每台服务器情况不同,但一定要确保这些依赖安装成功。

    • 安装驱动

      root@m1:/# cd ./rdma-device-plugin
      root@m1:/# tar zxvf MLNX_OFED_LINUX-4.7-1.0.0.1-ubuntu16.04-x86_64.tgz
      root@m1:/# cd MLNX_OFED_LINUX-4.7-1.0.0.1-ubuntu16.04-x86_64
      root@m1:/# ll
      total 272
      drwxr-xr-x  6 root root   4096 926  2019 ./
      drwxr-xr-x 15 root root   4096 1022 16:01 ../
      -rw-r--r--  1 root root      7 926  2019 .arch
      -rwxr-xr-x  1 root root   2605 926  2019 common_installers.pl*
      -rwxr-xr-x  1 root root   5956 926  2019 common.pl*
      -rwxr-xr-x  1 root root  24634 926  2019 create_mlnx_ofed_installers.pl*
      drwxr-xr-x  5 root root   4096 926  2019 DEBS/
      drwxr-xr-x  2 root root   4096 926  2019 DEBS_UPSTREAM_LIBS/
      -rw-r--r--  1 root root     12 926  2019 distro
      drwxr-xr-x  8 root root   4096 926  2019 docs/
      -rw-r--r--  1 root root    956 926  2019 LICENSE
      -rw-r--r--  1 root root     12 926  2019 .mlnx
      -rwxr-xr-x  1 root root  27611 926  2019 mlnx_add_kernel_support.sh*
      -rwxr-xr-x  1 root root 151310 926  2019 mlnxofedinstall*
      -rw-r--r--  1 root root   2764 926  2019 RPM-GPG-KEY-Mellanox
      drwxr-xr-x  2 root root   4096 926  2019 src/
      -rwxr-xr-x  1 root root  10894 926  2019 uninstall.sh*
      root@m1:/# ./mlnxofedinstall --force
      ...
      

      按照剧本走的话应该是能成功安装的,但不幸的是可能会遇到各种问题,请自己百度解决。

    • 重新加载驱动

      root@m1:/# /etc/init.d/openibd restart
      
    • 查看IB

      root@m1:/# ibstat
      CA 'mlx5_0'
      	CA type: MT4119
      	Number of ports: 1
      	Firmware version: 16.24.1000
      	Hardware version: 0
      	Node GUID: 0xb8599f03001212a0
      	System image GUID: 0xb8599f03001212a0
      	Port 1:
      		State: Active
      		Physical state: LinkUp
      		Rate: 56
      		Base lid: 1
      		LMC: 0
      		SM lid: 1
      		Capability mask: 0x2651e84a
      		Port GUID: 0xb8599f03001212a0
      		Link layer: InfiniBand
      

      出现此信息说明IB驱动安装成功。

  2. 测试连接性及性能

    请按照同样的方式在另一台机器m2上进行IB驱动安装。

    • 连接性

      测连接性需要有一个服务端和一个客户端,此处我们把m1作为服务端,m2作为客户端

      • 服务端

        root@m1:/# ibping -S -C mlx5_0 -P 1 # 无任何输出
        

        -S:服务端

        -C:CA

        -P:Port

      • 客户端

        root@m2:/# ibping -c 10000 -f -C mlx4_0 -P 1 -L 1
        
        --- m1.(none) (Lid 1) ibping statistics ---
        10000 packets transmitted, 10000 received, 0% packet loss, time 1410 ms
        rtt min/avg/max = 0.038/0.140/3.774 ms
        

        -c:发送10000个packet之后停止

        -f:flood destination

        -C:客户端的CA

        -P:服务端的Port

        -L:服务端的Base lid

    • 性能

      重启IB服务和子网管理器

      root@m1:/# /etc/init.d/openibd restart
      root@m1:/# /etc/init.d/opensmd restart
      

      测试写带宽

      第一台m1执行

      root@m1:/# ib_write_bw
      
      ************************************
      * Waiting for client to connect... *
      ************************************
      

      第二台m2执行

      root@m2:/# ib_write_bw m1_ip
      ---------------------------------------------------------------------------------------
                          RDMA_Write BW Test
       Dual-port       : OFF		Device         : mlx4_0
       Number of qps   : 1		Transport type : IB
       Connection type : RC		Using SRQ      : OFF
       TX depth        : 128
       CQ Moderation   : 100
       Mtu             : 2048[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs	 : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x02 QPN 0x021d PSN 0xaf91fe RKey 0x28010100 VAddr 0x007f4732586000
       remote address: LID 0x01 QPN 0x0088 PSN 0xb7c60d RKey 0x009866 VAddr 0x007f60a41d9000
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
      Conflicting CPU frequency values detected: 999.994000 != 1549.358000. CPU Frequency is not max.
       65536      5000             1708.48            1707.75		   0.027324
      ---------------------------------------------------------------------------------------
      

      此时,两台设备都会输出如上的信息。

      同样的方法 读带宽和延迟的测试分别使用ib_read_bwib_write_lat/ib_read_lat

      至此两台设备的IB驱动已经全部安装完毕,接下来进行device-plugin的安装。

  3. 安装rdma-device-plugin

    root@m2:/# cd ./rdma-device-plugin
    root@m2:/# docker load -i carmark_k8s_rdma_device_plugin.tar
    root@m2:/# docker images|grep carmark
    carmark/k8s-rdma-device-plugin   latest   50c33cf119a4    2 years ago      1.31GB
    root@m2:/# cd dockerfile
    root@m2:/# docker build -t carmark/k8s-rdma-device-plugin:latest .
    root@m2:/# cd ../
    root@m2:/# kubectl -n kube-system apply -f rdma-device-plugin.yml
    root@m2:/# kubectl -n kube-system get pods|grep rdma
    rdma-device-plugin-daemonset-4bwlk      1/1     Running   0          15h
    rdma-device-plugin-daemonset-hxqk7      1/1     Running   0          15h
    

    查看rdma资源

    root@m2:/# kubectl describe node
    

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9FAV59Ve-1656220433534)(D:/BaiduNetdiskDownload/learningPath/rdma-device-plugin/img/describe_node.png)]

在此贴上rdma-device-plugin.yml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: rdma-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: rdma-device-plugin-ds
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      hostNetwork: true
      containers:
      - image: carmark/k8s-rdma-device-plugin:latest
        imagePullPolicy: IfNotPresent
        name: rdma-device-plugin-ctr
        #args: ["-log-level", "debug"]
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: sys-class
            mountPath: /sys/class  
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: sys-class
          hostPath:
            path: /sys/class

至此,rdma-device-plugin安装完成。

如下是编译k8s-rdma-device-plugin的步骤,感兴趣的可以瞅瞅。


走过的坎坷路

我们看一下Dockerfile

FROM carmark/k8s-rdma-device-plugin 

COPY k8s-rdma-device-plugin /usr/local/bin/

ENTRYPOINT ["k8s-rdma-device-plugin"]

此处的k8s-rdma-device-plugin可执行文件是经过Go编译而来,而从网上直接下载的代码编译是不成功的,需要进行小小的修改。

  1. 首先在服务器安装Go环境,已安装的跳过
root@m1:/# cd rdma-device-plugin
root@m1:/# tar zxvf go1.15.3.linux-amd64.tar.gz -C /usr/local/
root@m1:/# vim ~/.bashrc 
# 添加如下路径
export GOROOT=/usr/local/go
export GOPATH=/home/goProject
export PATH=$PATH:$GOROOT/bin
root@m1:/# source ~/.bashrc
root@m1:/usr/local/go# go version
go version go1.15.3 linux/amd64
  1. 编译k8s-rdma-device-plugin
root@m1:/# mkdir -p /home/goProject/src
root@m1:/# unzip -d /home/goProject/src/ k8s-rdma-device-plugin-master.zip
root@m1:/# cd /home/goProject/src/k8s-rdma-device-plugin
root@m1:/# ll
total 100
drwxr-xr-x  5 root root  4096 1231  2019 ./
drwxr-xr-x 13 root root  4096 1023 11:24 ../
-rwxr-xr-x  1 root root   378 1231  2019 build*
-rw-r--r--  1 root root   118 1231  2019 Dockerfile
-rw-r--r--  1 root root   507 1231  2019 .gitignore
-rw-r--r--  1 root root  4134 1231  2019 Gopkg.lock
-rw-r--r--  1 root root   927 1231  2019 Gopkg.toml
drwxr-xr-x  2 root root  4096 1231  2019 hack/
drwxr-xr-x  2 root root  4096 1231  2019 ibverbs/
-rw-r--r--  1 root root 11358 1231  2019 LICENSE
-rw-r--r--  1 root root  2228 1231  2019 main.go
-rw-r--r--  1 root root  1304 1231  2019 rdma-device-plugin.yml
-rw-r--r--  1 root root  3208 1231  2019 rdma.go
-rw-r--r--  1 root root  4421 1231  2019 README.md
-rw-r--r--  1 root root  6509 1231  2019 server.go
-rw-r--r--  1 root root  2330 1231  2019 sriov.go
-rw-r--r--  1 root root   240 1231  2019 .travis.yml
-rw-r--r--  1 root root   169 1231  2019 types.go
drwxr-xr-x  6 root root  4096 1231  2019 vendor/
-rw-r--r--  1 root root   500 1231  2019 watcher.go

修改build

#!/bin/sh
REPO_PATH="k8s-rdma-device-plugin"

export GO15VENDOREXPERIMENT=1
export GOBIN=${PWD}/bin

FMT="*.go"
echo "Checking gofmt..."
fmtRes=$(gofmt -l $FMT)
if [ -n "${fmtRes}" ]; then
    echo -e "gofmt checking failed:\n${fmtRes}"
    exit 255
fi

echo "Building plugins"
go install "$@" ${REPO_PATH}

更改go代码中的导入包github.com/hustcat进行更改

rdma.go

"github.com/hustcat/k8s-rdma-device-plugin/ibverbs"
//修改为
"k8s-rdma-device-plugin/ibverbs"

执行build

root@m1:/# ./build
root@m1:/# ls bin
k8s-rdma-device-plugin

然后执行

root@m1:/# bin/k8s-rdma-device-plugin
I1023 11:52:22.554006   86270 main.go:31] Fetching devices.
ibvDevList: [{mlx5_0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/mlx5_0}]
netDevList: [vethf7e27bc7 veth8ceb929c veth31b4a302 enp96s0f0 vethca24852d enp96s0f1 vetha55e1b96 veth71d39aad veth492f0bf9 vethaf32d3a6 veth5f06dcff veth0deb7cf6 vethbb1ed727 veth874fceaa vethbcc0a7e6 veth2fa745a9 veth60889727 vethb7416a73 vetha4154a1b vethfc2bd58b vethc16f6b00 vethf7716b90 veth81218fb6 veth084ab25a veth9f377e8d veth4cea3686 veth2c2cff6c vetha72f5da2 vethfbb5aafd vethf6336b7b veth87f1624f veth8fdc4f8a veth3171c3c4 veth6c474d5f vethc132f493 veth605e82fe veth08aa8528 veth2f65d6b0 veth2b9b279f vethfaea8c1e veth4358a077 veth47ee05e3 vethdb1f63a9 veth699abb19 veth75d06790 veth89cc49c0 veth524565dc veth76dfa640 veth96dfd1b0 veth60a3a19f vethe36ff75e veth1b9fb905 vethff533970 veth39d46ea3 veth1505fe28 vethc85e7e03 veth3df6fbda vethfc30a2e7 veth7b8563e2 veth4f87fa9b ib0 veth661b8ccc vethace37698 veth0e581eb6 veth5ddaf13a veth60873598 veth5adab830 veth05a04167 vethbaceff83 vethd995d93e flannel.1 cni0 vethec045ed veth95c1ff1c vethc08ae971 vethd275ef73 veth5e91879e veth321d140c veth399324b6 vetheb6c5e27 vethb141865a veth56fc65ae veth164f0728]
I1023 11:52:22.572912   86270 main.go:43] RDMA device list: [{{mlx5_0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/mlx5_0} ib0 1}]
I1023 11:52:22.572950   86270 main.go:44] Starting FS watcher.
I1023 11:52:22.572997   86270 main.go:52] Starting OS watcher.
ibvDevList: [{mlx5_0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/mlx5_0}]
netDevList: [vethf7e27bc7 veth8ceb929c veth31b4a302 enp96s0f0 vethca24852d enp96s0f1 vetha55e1b96 veth71d39aad veth492f0bf9 vethaf32d3a6 veth5f06dcff veth0deb7cf6 vethbb1ed727 veth874fceaa vethbcc0a7e6 veth2fa745a9 veth60889727 vethb7416a73 vetha4154a1b vethfc2bd58b vethc16f6b00 vethf7716b90 veth81218fb6 veth084ab25a veth9f377e8d veth4cea3686 veth2c2cff6c vetha72f5da2 vethfbb5aafd vethf6336b7b veth87f1624f veth8fdc4f8a veth3171c3c4 veth6c474d5f vethc132f493 veth605e82fe veth08aa8528 veth2f65d6b0 veth2b9b279f vethfaea8c1e veth4358a077 veth47ee05e3 vethdb1f63a9 veth699abb19 veth75d06790 veth89cc49c0 veth524565dc veth76dfa640 veth96dfd1b0 veth60a3a19f vethe36ff75e veth1b9fb905 vethff533970 veth39d46ea3 veth1505fe28 vethc85e7e03 veth3df6fbda vethfc30a2e7 veth7b8563e2 veth4f87fa9b ib0 veth661b8ccc vethace37698 veth0e581eb6 veth5ddaf13a veth60873598 veth5adab830 veth05a04167 vethbaceff83 vethd995d93e flannel.1 cni0 vethec045ed veth95c1ff1c vethc08ae971 vethd275ef73 veth5e91879e veth321d140c veth399324b6 vetheb6c5e27 vethb141865a veth56fc65ae veth164f0728]
I1023 11:52:22.597377   86270 server.go:258] Starting to serve on /var/lib/kubelet/device-plugins/rdma.sock
I1023 11:52:22.599371   86270 server.go:266] Registered device plugin with Kubelet

Over

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐