Tensorrt加速Tensorflow推断速度(python和c++)
博主的环境配置可参考之前博客。Ubuntu配置TensorRT及验证_竹叶青lvye的博客-CSDN博客博主的一些基本环境配置可见之前博客非虚拟机环境下Ubuntu配置_jiugeshao的专栏-CSDN博客第一步: 准备安装AnacondaAnaconda3-5.2.0-Linux-x86_64.shhttps://repo.anaconda.com/archive/Anaconda3-5.2.
博主的环境配置可参考之前博客。
Ubuntu配置TensorRT及验证_竹叶青lvye的博客-CSDN博客博主的一些基本环境配置可见之前博客非虚拟机环境下Ubuntu配置_jiugeshao的专栏-CSDN博客第一步: 准备安装AnacondaAnaconda3-5.2.0-Linux-x86_64.shhttps://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.shhttps://blog.csdn.net/jiugeshao/article/details/123119995?spm=1001.2014.3001.5502tensorflow的版本是2.4.0, cuda、cuddn、tensorrt的版本见上面博客。tensorrt的python库安装如下,几个whl文件安装下。
准备工作结束后,下面开始进入实验阶段。
实验一: tensorflow的pb模型转化为uff模型,tensorrt加载uff模型去预测图片
使用前面博客中获得的pb模型,将其转化为uff模型,再转化为tensorrt后去预测猫咪图片,看看时间需要多少(前面可看到未加速前,所花费时间为2.5s)
pb转uff模型的过程见我前面博客
pb模型转uff模型(tensorflow2.x)_竹叶青lvye的博客-CSDN博客大多数的博客只是提到tensorflow1.x系列下的转换。大概步骤就是安装tensorrt,同时安装tensorrt下的几个python的wl文件。可参见博主之前的博客:1.tensorrt的安装Ubuntu配置TensorRT及验证_竹叶青lvye的博客-CSDN博客2.tensorrt下几个whl文件的安装TensorRT加速方法介绍(python pytorch模型)_竹叶青lvye的博客-CSDN博客_tensorrt加速pytorch安装uff-0.6.9-py2.py3-none-https://blog.csdn.net/jiugeshao/article/details/123609152?spm=1001.2014.3001.5502可参考tensorrt下自带的例子来使用uff文件进行加速验证(结合自己的路径查找)
在其基础上修改了代码,对一张图片预测,测试的uff模型是从上面博客中获得,图片还是之前几篇博客中所用的猫咪图
from random import randint
import tensorrt
from PIL import Image
import numpy as np
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import pycuda.driver as cuda
# This import causes pycuda to automatically manage CUDA context creation and cleanup.
import pycuda.autoinit
from tensorflow.keras.preprocessing import image
import tensorrt as trt
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common
import cv2
import time
# You can set the logger severity higher to suppress messages (or lower to display more messages).
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
# Frozen model layers:
# Input
# resnet50/conv1_pad/Pad/paddings
# resnet50/conv1_pad/Pad
class ModelData(object):
MODEL_FILE = "weights.uff"
INPUT_NAME = "Input"
INPUT_SHAPE = (224, 224,3)
OUTPUT_NAME = "resnet50/predictions/Softmax"
# resnet50/predictions/BiasAdd
# resnet50/predictions/Softmax
# Identity
def build_engine():
# For more information on TRT basics, refer to the introductory samples.
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, builder.create_builder_config() as config, trt.UffParser() as parser:
config.max_workspace_size = common.GiB(1)
# Parse the Uff Network
parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE, tensorrt.UffInputOrder.NHWC)
parser.register_output(ModelData.OUTPUT_NAME)
parser.parse(ModelData.MODEL_FILE, network)
engine = builder.build_engine(network, config)
return engine
def main():
engine = build_engine()
# Build an engine, allocate buffers and create a stream.
# For more information on buffer allocation, refer to the introductory samples.
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
with engine.create_execution_context() as context:
img = image.load_img('2008_002682.jpg', target_size=(224, 224))
img = image.img_to_array(img)
img = preprocess_input(img)
print(img.shape)
img = img[np.newaxis, :]
inputs[0].host = img.ravel()
print(inputs[0].host.shape)
t_model = time.perf_counter()
result = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
print(f'do inference cost:{time.perf_counter() - t_model:.8f}s')
output = np.array(result[1])
output = output[np.newaxis,:]
print(output.shape)
print('Predicted:', decode_predictions(output, top=5)[0])
if __name__ == '__main__':
main()
预测结果如下:
对比直接使用pb模型来预测博客(需要2s的预测时间),预测结果一致,时间却大大提升了,这里只需要0.004s左右,可以看到时间大大提升了。
如下是关键语句:
parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE, tensorrt.UffInputOrder.NHWC)
其可以改变模型的排序通道,使其适应NHWC的图片数据格式。
实验二:tensorflow的pb模型转化为onnx模型,tensorrt加载onnx模型对图片进行预测
还是那这篇博客中生成的pb模型做实验
安装下如下包
pip install -U tf2onnx
可执行如下命令,即可以方便完成转换
python -m tf2onnx.convert --graphdef weights.pb --output model.onnx --inputs Input:0 --outputs resnet50/predictions/Softmax:0
需要输入和输出节点名字(博客中有打印每层的名字,所以博主这边是可以获取到的)
接下来就是用tensorrt去加载此onnx模型,转化为engine后来预测图片,此可借鉴我之前博客
TensorRT加速方法介绍(python pytorch模型)_竹叶青lvye的博客-CSDN博客_tensorrt加速pytorchhttps://blog.csdn.net/jiugeshao/article/details/123141499?spm=1001.2014.3001.5502中所提到的第一种加速方法,先转换为tensorrt模型
./trtexec --onnx=/home/sxhlvye/Trial1/Tensorrt/model.onnx --saveEngine=/home/sxhlvye/Trial1/Tensorrt/model.trt
完毕后,再执行代码如下(不同于前面博客,博主这里稍微做了些改动):
import sys
import cv2
from PIL import Image
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import resnet50
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import tensorflow as tf
import time
import numpy as np
# This import causes pycuda to automatically manage CUDA context creation and cleanup.
import pycuda.autoinit
import tensorrt as trt
import common
import pycuda.driver as cuda
import time
import matplotlib.pyplot as plt
import cv2
# You can set the logger severity higher to suppress messages (or lower to display more messages).
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
filename = "2008_002682.jpg"
engine_file_path = "model.trt"
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
"""Within this context, host_mom means the cpu memory and device means the GPU memory
"""
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer data from CPU to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
t_model = time.perf_counter()
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
print(f'only one line cost:{time.perf_counter() - t_model:.8f}s')
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
def main():
print("Reading engine from file {}".format(engine_file_path))
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
#create the context for this engine
context = engine.create_execution_context()
#allocate buffers for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings
#read a image
img = image.load_img('2008_002682.jpg', target_size=(224, 224))
img = image.img_to_array(img)
img = preprocess_input(img)
print(img.shape)
img = img[np.newaxis, :]
# Load data to the buffer
inputs[0].host = img.ravel()
print(inputs[0].host.shape)
#Do Inference
t_model = time.perf_counter()
result = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream) # numpy data
print(f'do inference cost:{time.perf_counter() - t_model:.8f}s')
output = np.array(result[0])
output = output[np.newaxis, :]
print(output.shape)
print('Predicted:', decode_predictions(output, top=5)[0])
if __name__ == '__main__':
main()
执行结果如下:
/home/sxhlvye/anaconda3/bin/python /home/sxhlvye/Trial1/Tensorrt/test_onnx1.py
2022-03-20 18:13:45.682920: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Reading engine from file model.trt
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.0 but loaded cuDNN 8.0.5
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.0 but loaded cuDNN 8.0.5
(224, 224, 3)
(150528,)
only one line cost:0.33635211s
do inference cost:0.33758112s
(1, 1000)
Predicted: [('n02123597', 'Siamese_cat', 0.16550788), ('n02108915', 'French_bulldog', 0.14138032), ('n04409515', 'tennis_ball', 0.08570899), ('n02095314', 'wire-haired_fox_terrier', 0.052046295), ('n02123045', 'tabby', 0.05069564)]
Process finished with exit code 0
对比直接pb预测图片的结果
keras模型转换为tensorflow的pb模型结构_竹叶青lvye的博客-CSDN博客
可看到结果一致,但tensorrt加速后,速度提升到了0.3376s
实验三:tensorflow的pb模型转化为ONNX模型后(上面说明了转变过程),c++去部署
这里可以参考博主之前的博客,里面演示的就是拿到onnx模型后怎么去部署,这边就不再赘叙了
实验四:tensorflow的pb转化为UFF模型后(上面说明了转变过程),c++去部署
和上面博客大同小异,这里是参考的tensorrt自带例子
博主知识简单的修改了下,让其流程跑通,图像归一化只是简单的做了下,详细代码如下(c++的环境配置见上面提到的博客)
#include "BatchStream.h"
#include "EntropyCalibrator.h"
#include "argsParser.h"
#include "buffers.h"
#include "common.h"
#include "logger.h"
#include "NvInfer.h"
#include "NvUffParser.h"
#include <cuda_runtime_api.h>
#include "parserOnnxConfig.h"
#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <sstream>
using namespace std;
using namespace cv;
//! \brief The SampleUffSSD class implements the SSD sample
//!
//! \details It creates the network using an UFF model
//!
class SampleUffSSD
{
template <typename T>
using SampleUniquePtr = std::unique_ptr<T, samplesCommon::InferDeleter>;
public:
SampleUffSSD():mEngine(nullptr)
{
}
//!
//! \brief Function builds the network engine
//!
bool build();
//!
//! \brief Runs the TensorRT inference engine for this sample
//!
bool infer();
//!
//! \brief Cleans up any state created in the sample class
//!
bool teardown();
private:
nvinfer1::Dims mInputDims; //!< The dimensions of the input to the network.
std::vector<samplesCommon::PPM<3, 224, 224>> mPPMs; //!< PPMs of test images
std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network
//!
//! \brief Parses an UFF model for SSD and creates a TensorRT network
//!
bool constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
SampleUniquePtr<nvuffparser::IUffParser>& parser);
//!
//! \brief Reads the input and mean data, preprocesses, and stores the result in a managed buffer
//!
bool processInput(const samplesCommon::BufferManager& buffers);
//!
//! \brief Filters output detections and verify results
//!
bool verifyOutput(const samplesCommon::BufferManager& buffers);
};
//!
//! \brief Creates the network, configures the builder and creates the network engine
//!
//! \details This function creates the SSD network by parsing the UFF model and builds
//! the engine that will be used to run SSD (mEngine)
//!
//! \return Returns true if the engine was created successfully and false otherwise
//!
bool SampleUffSSD::build()
{
auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
if (!builder)
{
return false;
}
auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
if (!network)
{
return false;
}
auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
if (!config)
{
return false;
}
auto parser = SampleUniquePtr<nvuffparser::IUffParser>(nvuffparser::createUffParser());
if (!parser)
{
return false;
}
auto constructed = constructNetwork(builder, network, config, parser);
if (!constructed)
{
return false;
}
ASSERT(network->getNbInputs() == 1);
mInputDims = network->getInput(0)->getDimensions();
ASSERT(mInputDims.nbDims == 3);
ASSERT(network->getNbOutputs() == 2);
return true;
}
//!
//! \brief Uses a UFF parser to create the SSD Network and marks the
//! output layers
//!
//! \param network Pointer to the network that will be populated with the SSD network
//!
//! \param builder Pointer to the engine builder
//!
bool SampleUffSSD::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
SampleUniquePtr<nvuffparser::IUffParser>& parser)
{
parser->registerInput("Input", Dims3(224, 224, 3), nvuffparser::UffInputOrder::kNHWC);
parser->registerOutput("resnet50/predictions/Softmax");
auto parsed = parser->parse("/home/sxhlvye/Trial1/Tensorrt/weights.uff", *network, nvinfer1::DataType::kFLOAT);
if (!parsed)
{
return false;
}
builder->setMaxBatchSize(1);
config->setMaxWorkspaceSize(1_GiB);
config->setFlag(BuilderFlag::kFP16);
// Calibrator life time needs to last until after the engine is built.
mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());
if (!mEngine)
{
return false;
}
return true;
}
//!
//! \brief Runs the TensorRT inference engine for this sample
//!
//! \details This function is the main execution function of the sample. It allocates the buffer,
//! sets inputs, executes the engine and verifies the detection outputs.
//!
bool SampleUffSSD::infer()
{
// Create RAII buffer manager object
samplesCommon::BufferManager buffers(mEngine, 1);
auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
if (!context)
{
return false;
}
if (!processInput(buffers))
{
return false;
}
// Memcpy from host input buffers to device input buffers
buffers.copyInputToDevice();
const bool status = context->execute(1, buffers.getDeviceBindings().data());
if (!status)
{
return false;
}
// Memcpy from device output buffers to host output buffers
buffers.copyOutputToHost();
// Post-process detections and verify results
if (!verifyOutput(buffers))
{
return false;
}
return true;
}
//!
//! \brief Cleans up any state created in the sample class
//!
bool SampleUffSSD::teardown()
{
//! Clean up the libprotobuf files as the parsing is complete
//! \note It is not safe to use any other part of the protocol buffers library after
//! ShutdownProtobufLibrary() has been called.
nvuffparser::shutdownProtobufLibrary();
return true;
}
//!
//! \brief Reads the input and mean data, preprocesses, and stores the result in a managed buffer
//!
bool SampleUffSSD::processInput(const samplesCommon::BufferManager& buffers)
{
cv::Mat image = cv::imread("/home/sxhlvye/Trial1/Tensorrt/2008_002682.jpg", cv::IMREAD_COLOR);
//cv::cvtColor(image, image, cv::COLOR_BGR2RGB);
cout << image.channels() << "," << image.size().width << "," << image.size().height << std::endl;
cv::Mat dst = cv::Mat::zeros(341, 256, CV_32FC3);
cv::resize(image, dst, dst.size());
cout << dst.channels() << "," << dst.size().width << "," << dst.size().height << std::endl;
cv::Mat dst1 = dst(Range(58, 282), Range(16, 240)).clone();
cout << dst1.channels() << "," << dst1.size().width << "," << dst1.size().height << std::endl;
const int channel = 3;
const int inputH = 244;
const int inputW = 224;
// Read a random digit file
std::vector<float> fileData(inputH * inputW * channel);
for (int c = 0; c < channel; ++c)
{
for (int i = 0; i < dst1.rows; ++i)
{
cv::Vec3b *p1 = dst1.ptr<cv::Vec3b>(i);
for (int j = 0; j < dst1.cols; ++j)
{
fileData[c * dst1.cols * dst1.rows + i * dst1.cols + j] = (1-p1[j][c]) / 255.0f;
}
}
}
float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("Input"));
for (int i = 0; i < inputH * inputW * channel; i++)
{
hostDataBuffer[i] = fileData[i];
}
return true;
}
//!
//! \brief Filters output detections and verify result
//!
//! \return whether the detection output matches expectations
//!
bool SampleUffSSD::verifyOutput(const samplesCommon::BufferManager& buffers)
{
const int outputSize = 1000;
std::cout << "outputSize: " << outputSize << std::endl;
float* output = static_cast<float*>(buffers.getHostBuffer("resnet50/predictions/Softmax"));
float val{0.0f};
int idx{0};
// Calculate Softmax
float sum{0.0f};
for (int i = 0; i < outputSize; i++)
{
output[i] = exp(output[i]);
sum += output[i];
}
for (int i = 0; i < outputSize; i++)
{
output[i] /= sum;
}
vector<float> voutput(1000);
for (int i = 0; i < outputSize; i++)
{
voutput[i] = output[i];
}
for(int i=0; i<1000; i++)
{
for(int j= i+1; j< 1000; j++)
{
if(output[i] < output[j])
{
int temp;
temp = output[i];
output[i] = output[j];
output[j] = temp;
}
}
}
for(int i=0; i<5;i++)
{
cout << output[i] << std::endl;
}
vector<string> labels;
string line;
ifstream readFile("/home/sxhlvye/Trial/yolov3-9.5.0/imagenet_classes.txt");
while (getline(readFile,line))
{
//istringstream record(line);
//string label;
// record >> label;
//cout << line << std::endl;
labels.push_back(line);
}
vector<int> indexs(5);
for(int i=0; i< 1000;i++)
{
if(voutput[i] == output[0])
{
indexs[0] = i;
}
if(voutput[i] == output[1])
{
indexs[1] = i;
}
if(voutput[i] == output[2])
{
indexs[2] = i;
}
if(voutput[i] == output[3])
{
indexs[3] = i;
}
if(voutput[i] == output[4])
{
indexs[4] = i;
}
}
cout << "top 5: " << std::endl;
cout << labels[indexs[0]] << "--->" << output[0] << std::endl;
cout << labels[indexs[1]] << "--->" << output[1] << std::endl;
cout << labels[indexs[2]] << "--->" << output[2] << std::endl;
cout << labels[indexs[3]] << "--->" << output[3] << std::endl;
cout << labels[indexs[4]] << "--->" << output[4] << std::endl;
return true;
}
int32_t main()
{
SampleUffSSD sample;
if (!sample.build())
{
std::cout << "faile" << std::endl;
return 0;
}
if (!sample.infer())
{
std::cout << "faile" << std::endl;
return 0;
}
if (!sample.teardown())
{
std::cout << "faile" << std::endl;
return 0;
}
}
运行结果如下:
可正常跑通流程!
结束语:
当然keras的.h5模型也是可以直接转换为onnx的,博主就不实验了,到此这一系列基本快到尾声了,人毕竟是活的,把一条主脉摸清楚了,剩下了分支,无非是倒来倒去。以下博客可用于参考
keras保存模型_onnx+tensorrt部署keras模型_weixin_39777464的博客-CSDN博客
tensorflow 小于_用TensorRT C++ API加速TensorFlow模型实例_莫祖兰的博客-CSDN博客
tensorflow实现将ckpt转pb文件_pan_jinquan的博客-CSDN博客_ckpt转pb
2022年3月20日 22 : 46
更多推荐
所有评论(0)