0 背景

Triton是什么?Triton是 NVIDIA 推出的 Inference Server,专门做 AI 模型的部署服务。客户端可以同伙HTTP/REST或gRPC的方式来请求服务,特性包括以下方面:

  • 支持多种框架,例如 Tensorflow、TensoRT、Pytorch、ONNX甚至自定义框架后端;
  • 支持 GPU 和 CPU 方式运行,能最大化利用硬件资源;
  • 容器化部署,集成 k8s,可以方便的进行编排和扩展;
  • 支持并发模型,支持多种模型或同一模型的不同实例在同一GPU上运行
  • 支持多种批处理算法,可以提高推理吞吐量;

官方文档:https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

官网介绍:https://developer.nvidia.com/nvidia-triton-inference-server

1 服务端镜像

Triton 支持源码安装和容器安装两种方式,这里我们以容器安装方式为例

docker pull nvcr.io/nvidia/tritonserver:20.09-py3

目前只能从nvidia官方下拉镜像,如果不会科学上网的话,速度会比较慢

下载好镜像之后,创建容器,格式如下

docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/example/model/repository:/models <docker image> tritonserver --model-repository=/models

其中,  <docker image> 要改为nvcr.io/nvidia/tritonserver:20.09-py3

sudo docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/home/lthpc/workspace_zong/triton_server/repository:/models tritonserver:20.09-py3 tritonserver --model-repository=/models

如果运行时报错如下

ERROR: This container was built for NVIDIA Driver Release 450.51 or later, but
       version 418.116.00 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (cuInit(0) returned 804)]]

则是因为显卡驱动版本低,需要升级,参考《NVIDIA之显卡驱动安装方法》升级一下

启动成功后输出如下

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 20.09 (build 16016295)

Copyright (c) 2018-2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

I0927 03:36:55.184340 1 metrics.cc:184] found 1 GPUs supporting NVML metrics
I0927 03:36:55.190330 1 metrics.cc:193]   GPU 0: TITAN V
I0927 03:36:55.190594 1 server.cc:120] Initializing Triton Inference Server
I0927 03:36:55.190606 1 server.cc:121]   id: 'triton'
I0927 03:36:55.190612 1 server.cc:122]   version: '2.3.0'
I0927 03:36:55.190618 1 server.cc:128]   extensions:  classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics
I0927 03:36:55.507614 1 pinned_memory_manager.cc:195] Pinned memory pool is created at '0x7efe38000000' with size 268435456
I0927 03:36:55.509141 1 cuda_memory_manager.cc:98] CUDA memory pool is created on device 0 with size 67108864
I0927 03:36:55.515347 1 grpc_server.cc:3897] Started GRPCInferenceService at 0.0.0.0:8001
I0927 03:36:55.515670 1 http_server.cc:2705] Started HTTPService at 0.0.0.0:8000
I0927 03:36:55.556973 1 http_server.cc:2724] Started Metrics Service at 0.0.0.0:8002

新打开一个终端,使用下边的指令验证以下

$ curl -v localhost:8000/v2/health/ready
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.61.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host localhost left intact

至此,服务端安装并启动完毕

2 客户端镜像

客户端也有镜像示例代码,安装方法如下

下拉方法(注意把 <xx.yy> 替换为自己要下拉的版本),这里我以20.09-py3-clientsdk为例

$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-clientsdk
$ docker pull nvcr.io/nvidia/tritonserver:20.09-py3-clientsdk

启动方法

$ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-clientsdk
$ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:20.09-py3-clientsdk

容器启动完毕后,可以使用下边的方法进行测试

二进制文件方法:

$ /workspace/install/bin/image_client -m resnet50_netdef -s INCEPTION /workspace/images/mug.jpg
Request 0, batch size 1
Image 'images/mug.jpg':
    0.723992 (504) = COFFEE MUG

python脚本方法

$ python /workspace/install/python/image_client.py -m resnet50_netdef -s INCEPTION /workspace/images/mug.jpg
Request 1, batch size 1
    0.777365 (504) = COFFEE MUG

至此完成了环境的部署工作,接下来进行进一步的研究

 
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐