嘿,这篇文章是不是发布很快,没想到吧。
承接上文,本节文章对代码进行分析。
先对Tensorrt的流程进行一些简要的介绍,后续如果有必要会单开一节进行详细说明。
Tensorrt的加速步骤可分为三步:
1、tensorrt api定义一个网络或者pytorch–>onnx导出模型
2、制定一个容器builder用于生成engine。builder负责指定engine所需内存大小等。
3、使用builder生成engine。engine可用于tensorrt的推理模式。
注:无论是tensorrt直接编写model还是onnx转换的model都需要构建一个builder,然后从builder中生成.engine文件。仔细观察.onnx—>.engine这部分转换代码其实就是一个读取onnx的参数,构建builder,并生成engine。
备:.onnx—>.engine有多种方式,例如NVIDIA官方提供的trtexec脚本,或者使用代码编写相关步骤,如https://github.com/noahmr/yolov5-tensorrt/tree/main/examples/builder noahmr大佬所写代码,大家有兴趣可以去看一看。(有没有哪位同学想了解下,俺也可以写篇文章分析下onnx—>builder—>engine中的步骤及各个参数的意义。嘿嘿)
builder用于生成engine的步骤被称为序列化(serialize),engine可用于tensorrt的推理模式时的操作被称为反序列化(deserialize),序列化的目的主要起到转换模型使其适用于Tensorrt的框架,并可以将转换后engine保存,用于后续推理。(序列化常常会花费一段时间大约几分钟,所以在真正使用时我们常常先生成engine保存文件,以后在推理时可以直接调用。不同的机器通常需要在本地生成相应的engine文件,毕竟调用的硬件或者库文件位置可能不同,相关优化也就不一样)
首先我们先对结构进行分析,建立一个YoloLayerPlugin类。直接两大件先上手。
class API YoloLayerPlugin : public IPluginV2IOExt
{
public:
YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel);
//YoloLayerPlugin(const void* data, size_t length);
~YoloLayerPlugin();
/*...*/
}
然后一步步复写IPluginV2IOExt这类,我们需要对这个类中的所有virtual函数复写。从上到下
IPluginV2IOExt–>IPluginV2Ext–>IPluginV2一点点分析。下面是我们必须要复写的函数
getPluginType:获取Plugin的名字
getPluginVersion:获取Plugin的版本
getNbOutputs:获取layer的输出个数
getOutputDimensions:获取layer的输出维度
initialize:在执行时初始化layer,当engine生成时调用
terminate:释放初始化layer时的系统资源。当engine销毁时调用
getWorkspaceSize:获取layer所需空间大小。这个函数与getSerializationSize不同,它是指除数据和参数外分配给layer的额外空间。
enqueue:执行layer的一系列操作。返回值0,1代表执行是否成功。(layer数据操作都包含在这个函数内)
getSerializationSize:返回序列化反冲区的大小,即分配给model的空间大小,后续会有空间计算的详细说明与示例。
serialize:序列化一个layer
destroy:销毁plugin的资源。
clone:复制plugin或layer。在项目我们常常会多次使用一个plugin,例如conv等模块。
setPluginNamespace:设置plugin的命名空间。
getPluginNamespace:获取plugin的命名空间。
getOutputDataType:获得输出数据类型。
isOutputBroadcastAcrossBatch:如果输出张量在批次中广播,则返回 true。
canBroadcastInputAcrossBatch:如果plugin可以使用跨批次广播而无需复制的输入,则返回 true。
attachToContext:将plugin附加到执行上下文并授予plugin对某些上下文资源的访问权限。
detachFromContext:将插件对象从其执行上下文中分离出来。
configurePlugin:配置layer。传达输入和输出的数量、所有输入和输出的维度和数据类型、所有输入和输出的广播信息、选择的插件格式和最大批量大小。此时,插件设置其内部状态并为给定配置选择最合适的算法和数据结构。
supportsFormatCombination:判断plugin是否支持输出的数据类型。
class IPluginV2
virtual AsciiChar const* getPluginType() const noexcept = 0;
virtual AsciiChar const* getPluginVersion() const noexcept = 0;
virtual int32_t getNbOutputs() const noexcept = 0;
virtual Dims getOutputDimensions(int32_t index, Dims const* inputs, int32_t nbInputDims) noexcept = 0;
virtual bool supportsFormat(DataType type, PluginFormat format) const noexcept = 0;
virtual int32_t initialize() noexcept = 0;
virtual void terminate() noexcept = 0;
virtual size_t getWorkspaceSize(int32_t maxBatchSize) const noexcept = 0;
virtual int32_t enqueue(int32_t batchSize, void const* const* inputs, void* const* outputs, void* workspace,
cudaStream_t stream) noexcept
= 0;
virtual size_t getSerializationSize() const noexcept = 0;
virtual void serialize(void* buffer) const noexcept = 0;
virtual void destroy() noexcept = 0;
virtual IPluginV2* clone() const noexcept = 0;
virtual void setPluginNamespace(AsciiChar const* pluginNamespace) noexcept = 0;
virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
class IPluginV2Ext : public IPluginV2
virtual nvinfer1::DataType getOutputDataType(
int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept = 0;
virtual bool isOutputBroadcastAcrossBatch(
int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept = 0;
virtual bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept = 0;
virtual void attachToContext(cudnnContext* /*cudnn*/, cublasContext* /*cublas*/, IGpuAllocator* /*allocator*/) noexcept {}
virtual void detachFromContext() noexcept {}
class IPluginV2IOExt : public IPluginV2Ext
virtual void configurePlugin(
PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out, int32_t nbOutput) noexcept = 0;
virtual bool supportsFormatCombination(
int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) const noexcept = 0;
ok,也不是很多嘛(/(ㄒoㄒ)/~~),其中有些类函数只是起到一些查询功能,在现阶段,我们可以直接最简化一个括弧即可。
我们来看下https://github.com/wang-xinyu/tensorrtx中yolov5的layer的复写。
class API YoloLayerPlugin : public IPluginV2IOExt
YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel);
YoloLayerPlugin(const void* data, size_t length);
~YoloLayerPlugin();
int getNbOutputs() const TRT_NOEXCEPT override
{
return 1;
}
Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;
int initialize() TRT_NOEXCEPT override;
virtual void terminate() TRT_NOEXCEPT override {};
virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }
virtual int enqueue(int batchSize, const void* const* inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;
virtual size_t getSerializationSize() const TRT_NOEXCEPT override;
virtual void serialize(void* buffer) const TRT_NOEXCEPT override;
bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {
return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;
}
const char* getPluginType() const TRT_NOEXCEPT override;
const char* getPluginVersion() const TRT_NOEXCEPT override;
void destroy() TRT_NOEXCEPT override;
IPluginV2IOExt* clone() const TRT_NOEXCEPT override;
void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;
const char* getPluginNamespace() const TRT_NOEXCEPT override;
DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;
bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;
bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;
void attachToContext(
cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;
void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;
void detachFromContext() TRT_NOEXCEPT override;
密密麻麻的代码好多,其实真正主要实现的部分就一个enqueue,我们来对比着yololayer.cu文件看看各个函数的实现。
这里先查一段YoloKernel类的代码,该代码是设置关于anchor的一些属性,如对应的图像尺寸,anchor尺寸,便于YoloLayerPlugin类的调用
struct YoloKernel
{
int width;
int height;
float anchors[CHECK_COUNT * 2];
};
YoloLayerPlugin(int …):用于设置detect中的识别的类classCount,图像大小netWidth,netHeight,最大输出检测的个数maxOut,vYoloKernel代表三种FPN对应的YoloKernel信息,即特征图大小,对应的anchor尺寸。
cudaMallocHost((void**)(void*)ptr, size, flags):申请不可分页内存。该步骤主要对应于后两行代码中的cudaMalloc。使用cudaMallocHost在主机中申请相应的buffer,然后利用buffer指向在device中开辟相应的空间。
YoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut,
const std::vector<Yolo::YoloKernel> &vYoloKernel)
{
mClassCount = classCount;
mYoloV5NetWidth = netWidth;
mYoloV5NetHeight = netHeight;
mMaxOutObject = maxOut;
mYoloKernel = vYoloKernel;
mKernelCount = vYoloKernel.size();
//在device中申请YoloKernel的相关空间
CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void *)));
size_t AnchorLen = sizeof(float) * CHECK_COUNT * 2;
for (int ii = 0; ii < mKernelCount; ii++)
{
CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));
const auto &yolo = mYoloKernel[ii];
CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));
}
}
YoloLayerPlugin::~YoloLayerPlugin():释放对应的YoloKernel的device和锁页空间。
YoloLayerPlugin::~YoloLayerPlugin()
{
for (int ii = 0; ii < mKernelCount; ii++)
{
CUDA_CHECK(cudaFree(mAnchor[ii]));
}
CUDA_CHECK(cudaFreeHost(mAnchor));
}
YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length):从序列化文件中读取相关参数。一样的原理,先申请锁页内存,然后分配并指向device空间。(下面我们会对序列化的格式进行分析,以便解释read(d,…))
YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)
{
using namespace Tn;
const char *d = reinterpret_cast<const char *>(data), *a = d;
read(d, mClassCount);
read(d, mThreadCount);
read(d, mKernelCount);
read(d, mYoloV5NetWidth);
read(d, mYoloV5NetHeight);
read(d, mMaxOutObject);
mYoloKernel.resize(mKernelCount);
auto kernelSize = mKernelCount * sizeof(YoloKernel);
memcpy(mYoloKernel.data(), d, kernelSize);
d += kernelSize;
CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));
size_t AnchorLen = sizeof(float)* CHECK_COUNT * 2;
for (int ii = 0; ii < mKernelCount; ii++)
{
CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));
const auto& yolo = mYoloKernel[ii];
CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));
}
assert(d == a + length);
}
void YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT:序列化模型文件。这段代码的作用是以某种格式存储yololayer相关的信息。
通常在Tensorrt代码中,你都会看到这样一句话engine->serialize(),其实这个api的背后就是调用plugin中的serialize的函数。
void YoloLayerPlugin::serialize(void* buffer) const
{
using namespace Tn;
char* d = static_cast<char*>(buffer), *a = d;
write(d, mClassCount);
write(d, mThreadCount);
write(d, mKernelCount);
auto kernelSize = mKernelCount*sizeof(YoloKernel);
memcpy(d,mYoloKernel.data(),kernelSize);
d += kernelSize;
assert(d == a + getSerializationSize());
}
这里有人可能会有疑问啊?不对啊,yolov5的detect中是包含conv,为何这里序列化没有看到相关模型参数的改变。下图为yolov5原代码:
class Detect(nn.Module):
stride = None # strides computed during build
onnx_dynamic = False # ONNX export parameter
def __init__(self, nc=80, anchors=(), ch=(), inplace=True): # detection layer
super().__init__()
......
self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) # output conv
self.inplace = inplace # use in-place ops (e.g. slice assignment)
def forward(self, x):
z = [] # inference output
for i in range(self.nl):
x[i] = self.m[i](x[i]) # conv
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
......
return x if self.training else (torch.cat(z, 1), x)
在Tensorrtx中作者将detect实际分为两种部分:
1、负责特征图卷积层IConvolutionLayer* det对应原yolov5代码中三个尺度下self.m—>conv
2、其余操作由YoloLayerPlugin完成。
实际代码如下图:
IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.0.weight"], weightMap["model.24.m.0.bias"]);
IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.1.weight"], weightMap["model.24.m.1.bias"]);
IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.2.weight"], weightMap["model.24.m.2.bias"]);
auto yolo = addYoLoLayer(network, weightMap, "model.24", std::vector<IConvolutionLayer*>{det0, det1, det2});
那么,既然这样接下来的函数就好理解的
这里插段代码:
struct alignas(float) Detection {
//center_x center_y w h
float bbox[LOCATIONS];
float conf; // bbox_conf * cls_conf
float class_id;
};
Detection表示检测结果的属性。
size_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT
{
return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount) +
sizeof(Yolo::YoloKernel) * mYoloKernel.size() + sizeof(mYoloV5NetWidth) + sizeof(mYoloV5NetHeight) +
sizeof(mMaxOutObject);
}
int YoloLayerPlugin::initialize() TRT_NOEXCEPT
{
return 0;
}
Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims *inputs, int nbInputDims) TRT_NOEXCEPT
{
//output the result to channel
int totalsize = mMaxOutObject * sizeof(Detection) / sizeof(float);
return Dims3(totalsize + 1, 1, 1);
}
// Set plugin namespace
void YoloLayerPlugin::setPluginNamespace(const char *pluginNamespace) TRT_NOEXCEPT
{
mPluginNamespace = pluginNamespace;
}
const char *YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT
{
return mPluginNamespace;
}
// Return the DataType of the plugin output at the requested index
DataType
YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType *inputTypes, int nbInputs) const TRT_NOEXCEPT
{
return DataType::kFLOAT;
}
// Return true if output tensor is broadcast across a batch.
bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool *inputIsBroadcasted,
int nbInputs) const TRT_NOEXCEPT
{
return false;
}
// Return true if plugin can use input that is broadcast across batch without replication.
bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT
{
return false;
}
void YoloLayerPlugin::configurePlugin(const PluginTensorDesc *in, int nbInput, const PluginTensorDesc *out,
int nbOutput) TRT_NOEXCEPT
{
}
// Attach the plugin object to an execution context and grant the plugin the access to some context resource.
void YoloLayerPlugin::attachToContext(cudnnContext *cudnnContext, cublasContext *cublasContext,
IGpuAllocator *gpuAllocator) TRT_NOEXCEPT
{
}
// Detach the plugin object from its execution context.
void YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT
{}
const char *YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT
{
return "YoloLayer_TRT";
}
const char *YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT
{
return "1";
}
void YoloLayerPlugin::destroy() TRT_NOEXCEPT
{
delete this;
}
// Clone the plugin
IPluginV2IOExt *YoloLayerPlugin::clone() const TRT_NOEXCEPT
{
YoloLayerPlugin *p = new YoloLayerPlugin(mClassCount, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject,
mYoloKernel);
p->setPluginNamespace(mPluginNamespace);
return p;
}
ok,我们可以看到以上大部分函数只需要简单复写虚函数即可。真正的操作其实主要放在enqueue
int YoloLayerPlugin::enqueue(int batchSize, const void *const *inputs, void *TRT_CONST_ENQUEUE *outputs,
void *workspace, cudaStream_t stream) TRT_NOEXCEPT
{
forwardGpu((const float *const *) inputs, (float *) outputs[0], stream, batchSize);
return 0;
}
说了这么多,终于进入了主体,下面来看看如何使用咱们前几节学的cuda基础知识去编写相关操作。在华丽的音乐也无非可以分解成宫商角徽羽嘛。
forwardGpu:
void YoloLayerPlugin::forwardGpu(const float *const *inputs, float *output, cudaStream_t stream, int batchSize)
{
int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);
for (int idx = 0; idx < batchSize; ++idx)
{
CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));
}
int numElem = 0;
for (unsigned int i = 0; i < mYoloKernel.size(); ++i)
{
const auto &yolo = mYoloKernel[i];
numElem = yolo.width * yolo.height * batchSize;
if (numElem < mThreadCount) mThreadCount = numElem;
//printf("Net: %d %d \n", mYoloV5NetWidth, mYoloV5NetHeight);
CalDetection <<< (numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream >>>
(inputs[i], output, numElem, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float *) mAnchor[i], mClassCount, outputElem);
}
}
1、先使用cudaMemsetAsync异步操作申请device空间,并且为了防止异步操作的缺点,使用stream进行控制。
2、申请空间后使用CalDetection对每个fpn生成的feature进行计算。
在这里我们由调用CalDetection的形式<<<>>>可以看出它是一个kernel函数。首先根据
numElem = yolo.width * yolo.height * batchSize;
计算所有的元素,在代码中mThreadCount=256(人为设置,只要是32的倍数即可),然后根据我们初始前几章所讲的cuda函数计算分配第一维的大小。
ok,在复习下相关概念,因为CalDetection是由cpu调用,并在device使用,所以使用__global__关键字。
__global__ void CalDetection(const float *input, float *output, int noElements,
const int netwidth, const int netheight, int maxoutobject, int yoloWidth,
int yoloHeight, const float anchors[CHECK_COUNT * 2], int classes, int outputElem)
{
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (idx >= noElements) return;
int total_grid = yoloWidth * yoloHeight;
int bnIdx = idx / total_grid;
idx = idx - total_grid * bnIdx;
int info_len_i = 5 + classes;
const float *curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);
for (int k = 0; k < CHECK_COUNT; ++k)
{
float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);
if (box_prob < IGNORE_THRESH) continue;
int class_id = 0;
float max_cls_prob = 0.0;
for (int i = 5; i < info_len_i; ++i)
{
float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);
if (p > max_cls_prob)
{
max_cls_prob = p;
class_id = i - 5;
}
}
float *res_count = output + bnIdx * outputElem;
int count = (int) atomicAdd(res_count, 1);
if (count >= maxoutobject) return;
char *data = (char *) res_count + sizeof(float) + count * sizeof(Detection);
Detection *det = (Detection *) (data);
int row = idx / yoloWidth;
int col = idx % yoloWidth;
//Location
// pytorch:
// y = x[i].sigmoid()
// y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i] # xy
// y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
// X: (sigmoid(tx) + cx)/FeaturemapW * netwidth
det->bbox[0] = (col - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) *
netwidth / yoloWidth;
det->bbox[1] = (row - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) *
netheight / yoloHeight;
// W: (Pw * e^tw) / FeaturemapW * netwidth
// v5: https://github.com/ultralytics/yolov5/issues/471
det->bbox[2] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]);
det->bbox[2] = det->bbox[2] * det->bbox[2] * anchors[2 * k];
det->bbox[3] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]);
det->bbox[3] = det->bbox[3] * det->bbox[3] * anchors[2 * k + 1];
det->conf = box_prob * max_cls_prob;
det->class_id = class_id;
}
}
1、先获取当前线程ID,然后定位元素位置。(在这里我们也可以看到Tensorrt的本质和我们开始几节介绍的简单的矩阵其实原理是一样的,只是操作和形式上稍微复杂亿点点。也是指定单个线程对相应的元素进行操作。)
2、执行x,y,w,h相关操作,具体可参照原有的yolov5中detect进行分析,这里不过多讲述。
__device__ float Logist(float data)
{ return 1.0f / (1.0f + expf(-data)); };
整体一套流程下来,是不是感觉使用cuda编写也不是那么神秘嘛。