
一 GPU节点安装NVIDIA显卡驱动

1 查看显卡信息

    lspci  | grep -i vga

    02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
    3d:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
    41:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
    45:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
    47:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)

    lspci -v -s 3d:00.0

    3d:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 12ae
        Physical Slot: 0
        Flags: bus master, fast devsel, latency 0, IRQ 637, NUMA node 0
        Memory at b7000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 3affe0000000 (64-bit, prefetchable) [size=256M]
        Memory at 3afff0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 8000 [size=128]
        [virtual] Expansion ROM at b8000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Capabilities: [bb0] #15
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

2 禁用系统默认的nouveau显卡驱动

    lsmod | grep nouveau

    修改/etc/modprobe.d/blacklist.conf 文件,加入一行并保存
    blacklist nouveau
    vim /etc/modprobe.d/blacklist.conf

    echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

    重新建立initramfs image文件
    mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
    dracut /boot/initramfs-$(uname -r).img $(uname -r)


    lsmod | grep nouveau

3 安装依赖包

    uname -r
    yum install kernel-devel-$(uname -r) gcc dkms

    下载地址 http://vault.centos.org/7.6.1810/os/x86_64/Packages/
    rpm -ivh kernel-headers-3.10.0-957.el7.x86_64.rpm
    rpm -ivh kernel-devel-3.10.0-957.el7.x86_64.rpm
4 根据显卡和内核信息下载合适的显卡驱动并安装

    下载地址 https://www.nvidia.cn/Download/index.aspx?lang=cn

    chmod u+x NVIDIA-Linux-x86_64-440.100.run

    ./NVIDIA-Linux-x86_64-440.100.run --no-opengl-files



    Mon Sep 28 10:46:24 2020       
    | NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |   0  GeForce RTX 208...  Off  | 00000000:3D:00.0 Off |                  N/A |
    | 22%   27C    P8    18W / 250W |      0MiB / 11019MiB |      0%      Default |
    |   1  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
    | 22%   26C    P8     8W / 250W |      0MiB / 11019MiB |      0%      Default |
    |   2  GeForce RTX 208...  Off  | 00000000:45:00.0 Off |                  N/A |
    | 22%   27C    P8     1W / 250W |      0MiB / 11019MiB |      0%      Default |
    |   3  GeForce RTX 208...  Off  | 00000000:47:00.0 Off |                  N/A |
    | 22%   27C    P8    17W / 250W |      0MiB / 11019MiB |      0%      Default |
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |  No running processes found                                                 |

二 GPU节点安装nvidia-docker2

1 前提条件:
    GNU/Linux x86_64 with kernel version > 3.10
    Docker >= 1.12
    NVIDIA GPU with Architecture > Fermi (2.1)
    NVIDIA drivers ~= 361.93 (untested on older versions)

2 设置源
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

3 安装
    sudo yum install nvidia-docker2
    sudo pkill -SIGHUP dockerd

    yum install --downloadonly --downloaddir=<your_dir> <package-name> 


    unzip nvidia-docker2_rpms.zip


    yum install libnvidia-container1-1.0.0-0.1.beta.1.x86_64.rpm
    yum install libnvidia-container-tools-1.0.0-0.1.beta.1.x86_64.rpm
    yum install nvidia-container-runtime-hook-1.3.0-1.x86_64.rpm 
    yum install nvidia-container-runtime-2.0.0-1.docker18.03.0.x86_64.rpm
    yum install nvidia-docker2-2.0.3-1.docker18.03.0.ce.noarch.rpm

4 安装后需要配置新的Docker Runtime 同时也需要把默认的Runtime设为 nvidia
    vim /etc/docker/daemon.json
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []

5 重启docker
    systemctl restart docker

6 查看docker信息
    docker info

    Containers: 52
     Running: 26
     Paused: 0
     Stopped: 26
    Images: 35
    Server Version: 18.03.0-ce
    Storage Driver: overlay2
     Backing Filesystem: xfs
     Supports d_type: true
     Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
     Volume: local
     Network: bridge host macvlan null overlay
     Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
    Swarm: inactive
    Runtimes: nvidia runc
    Default Runtime: nvidia
    Init Binary: docker-init
    containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
    runc version: 4fc53a81fb7c994640722ac585fa9ca548971871-dirty (expected: 4fc53a81fb7c994640722ac585fa9ca548971871)
    init version: 949e6fa
    Security Options:
      Profile: default
    Kernel Version: 3.10.0-957.el7.x86_64
    Operating System: CentOS Linux 7 (Core)
    OSType: linux
    Architecture: x86_64
    CPUs: 40
    Total Memory: 251.4GiB
    Name: localhost.localdomain
    Docker Root Dir: /var/lib/docker
    Debug Mode (client): false
    Debug Mode (server): false
    Registry: https://index.docker.io/v1/
    Experimental: false
    Insecure Registries:
    Live Restore Enabled: false

7 使用cuda镜像测试(离线环境中需要手动下载对应的镜像)
    docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

    Mon Sep 28 09:05:46 2020       
    | NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |   0  GeForce RTX 208...  Off  | 00000000:3D:00.0 Off |                  N/A |
    | 22%   27C    P8    19W / 250W |      0MiB / 11019MiB |      0%      Default |
    |   1  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
    | 22%   26C    P8     7W / 250W |      0MiB / 11019MiB |      0%      Default |
    |   2  GeForce RTX 208...  Off  | 00000000:45:00.0 Off |                  N/A |
    | 22%   27C    P8     1W / 250W |      0MiB / 11019MiB |      0%      Default |
    |   3  GeForce RTX 208...  Off  | 00000000:47:00.0 Off |                  N/A |
    | 22%   27C    P8    16W / 250W |      0MiB / 11019MiB |      0%      Default |
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |  No running processes found                                                 |

三 nvidia-device-plugin插件安装

1 前提条件:

    NVIDIA drivers ~= 361.93
    nvidia-docker version > 2.0 (see how to install and it's prerequisites)
    docker configured with nvidia as the default runtime.
    Kubernetes version = 1.11

2 master节点安装(离线环境中需要手动下载对应的文件和镜像)

    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

    由于我k8s集群不支持apiVersion: extensions/v1beta1而且就一个节点有GPU所以修改了nvidia-device-plugin.yml内容如下

    apiVersion: apps/v1
    kind: DaemonSet
      name: nvidia-device-plugin-daemonset
      namespace: kube-system
          name: nvidia-device-plugin-ds
          # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
          # reserves resources for critical add-on pods so that they can be rescheduled after
          # a failure.  This annotation works in tandem with the toleration below.
            scheduler.alpha.kubernetes.io/critical-pod: ""
            name: nvidia-device-plugin-ds
          # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
          # This, along with the annotation above marks this pod as a critical add-on.
          - key: CriticalAddonsOnly
            operator: Exists
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          - image: nvidia/k8s-device-plugin:1.11
            name: nvidia-device-plugin-ctr
              allowPrivilegeEscalation: false
                drop: ["ALL"]
              - name: device-plugin
                mountPath: /var/lib/kubelet/device-plugins
            - name: device-plugin
                path: /var/lib/kubelet/device-plugins

    kubectl create -f nvidia-device-plugin.yml

3 查看插件对应的pod 
    kubectl get pod -nkube-system |grep nvidia

    nvidia-device-plugin-daemonset-xgd2d                      1/1     Running   0          26h

4 查看GUP节点的信息
    kubectl describe node
    Roles:              <none>
    Labels:             beta.kubernetes.io/arch=amd64
    Annotations:        node.alpha.kubernetes.io/ttl: 0
    CreationTimestamp:  Fri, 04 Sep 2020 11:27:08 +0800
    Taints:             <none>
    Unschedulable:      false
      AcquireTime:     <unset>
      RenewTime:       Mon, 28 Sep 2020 17:44:33 +0800
      Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----                 ------  -----------------                 ------------------                ------                       -------
      NetworkUnavailable   False   Sun, 27 Sep 2020 20:03:08 +0800   Sun, 27 Sep 2020 20:03:08 +0800   CalicoIsUp                   Calico is running on this node
      MemoryPressure       False   Mon, 28 Sep 2020 17:42:17 +0800   Fri, 25 Sep 2020 16:45:11 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure         False   Mon, 28 Sep 2020 17:42:17 +0800   Fri, 25 Sep 2020 16:45:11 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
      PIDPressure          False   Mon, 28 Sep 2020 17:42:17 +0800   Fri, 25 Sep 2020 16:45:11 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready                True    Mon, 28 Sep 2020 17:42:17 +0800   Fri, 25 Sep 2020 16:45:21 +0800   KubeletReady                 kubelet is posting ready status
      cce/eni:            10
      cpu:                40
      ephemeral-storage:  51175Mi
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             263573464Ki
      nvidia.com/gpu:     4
      pods:               110
      cce/eni:            10
      cpu:                39830m
      ephemeral-storage:  48294789041
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             251373528Ki
      nvidia.com/gpu:     4
      pods:               110
    System Info:
      Machine ID:                 b0cc0937e42441da9c493765ae349d5c
      System UUID:                0E5D05B4-4D2F-03E4-B211-D21D80451C1B
      Boot ID:                    599a1158-3afb-4362-ab5e-6dcd126f40b6
      Kernel Version:             3.10.0-957.el7.x86_64
      OS Image:                   CentOS Linux 7 (Core)
      Operating System:           linux
      Architecture:               amd64
      Container Runtime Version:  docker://18.3.0
      Kubelet Version:            v1.17.5
      Kube-Proxy Version:         v1.17.5
    Non-terminated Pods:          (13 in total)
      Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
      ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
      cce-monitor                 cop-addon-log-agent-fluent-bit-mbwdd          200m (0%)     2 (5%)      200Mi (0%)       200Mi (0%)     24d
      cce-monitor                 copaddon-prometheus-node-exporter-qcbn7       200m (0%)     400m (1%)   256Mi (0%)       400Mi (0%)     24d
      cce                         statefulset-jyr4rdwy-0                        100m (0%)     1 (2%)      300Mi (0%)       2Gi (0%)       3d
      istio-system                istio-policy-748f9ff8dc-b8f7c                 200m (0%)     1250m (3%)  256Mi (0%)       1Gi (0%)       3d
      kube-system                 calico-node-zbtcq                             250m (0%)     1 (2%)      300Mi (0%)       2Gi (0%)       21h
      kube-system                 nvidia-device-plugin-daemonset-xgd2d          100m (0%)     1 (2%)      300Mi (0%)       2Gi (0%)       26h
      projectsn1yesim             test02-0                                      100m (0%)     300m (0%)   100Mi (0%)       300Mi (0%)     3d
      projectvnajfc8x             test006-0                                     100m (0%)     100m (0%)   100Mi (0%)       100Mi (0%)     3d
      projectvnajfc8x             test0911-0                                    100m (0%)     300m (0%)   100Mi (0%)       300Mi (0%)     3d
      test01                      deployment-laj6ry6y-54696bff5f-zs2m5          200m (0%)     1500m (3%)  428Mi (0%)       2560Mi (1%)    3d
      test01                      deployment-xuqz01-5b8858cddf-gm6cp            200m (0%)     1500m (3%)  428Mi (0%)       2560Mi (1%)    3d
      test01                      deployment-xuqz02-v2-v3-6b86846699-g5lch      200m (0%)     1500m (3%)  428Mi (0%)       2560Mi (1%)    3d
      test01                      deployment-xuqz02-v2-v3-v4-7dbdb5b56-kqdx8    200m (0%)     1500m (3%)  428Mi (0%)       2560Mi (1%)    3d
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests     Limits
      --------           --------     ------
      cpu                2150m (5%)   13350m (33%)
      memory             3624Mi (1%)  18708Mi (7%)
      ephemeral-storage  0 (0%)       0 (0%)
      cce/eni            0            0
      nvidia.com/gpu     0            0
    Events:              <none>

5 测试


    apiVersion: v1
    kind: Pod
      name: cuda-vector-add
      restartPolicy: OnFailure
        - name: cuda-vector-add
          image: "tingweiwu/cuda-vector-add:v0.1"
              nvidia.com/gpu: 1 # requesting 1 GPU

    kubectl create -f cuda-vector-add0.1.yaml
    pod/cuda-vector-add created

    kubectl get pod
    NAME              READY   STATUS      RESTARTS   AGE
    cuda-vector-add   0/1     Completed   0          12s
    kubectl logs cuda-vector-add


    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED

    若想保持STATUS 一直是running 修改cuda-vector-add0.1.yaml文件如下

    apiVersion: v1
    kind: Pod
      name: cuda-vector-add
      restartPolicy: OnFailure
        - name: cuda-vector-add
          image: "tingweiwu/cuda-vector-add:v0.1"
              nvidia.com/gpu: 1 # requesting 1 GPU
          command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]  

    若登录容器内执行nvidia-smi命令 kubectl exec -it cuda-vector-add bash nvidia-smi 出现Failed to initialize NVML: Unknown Error 错误

    apiVersion: v1
    kind: Pod
      name: cuda-vector-add
      restartPolicy: OnFailure
        - name: cuda-vector-add
          image: "tingweiwu/cuda-vector-add:v0.1"
              nvidia.com/gpu: 1 # requesting 1 GPU
            privileged: true 
          command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]  



