51c~TensorRT~合集1

原创

qq6669490e54384 2024-08-14 22:12:27 ©著作权

文章标签 视觉 文章分类 计算机视觉人工智能

©著作权归作者所有：来自51CTO博客作者qq6669490e54384的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、TensorRT-LLM~最佳部署实践

TensorRT-LLM（Large Language Model）部署实践的详细介绍

TRT-LLM简单再介绍

TensorRT-LLM的介绍前几篇中已提到，就不过多赘述了。

这里列一个TensorRT-LLM的功能和定位：

51c~TensorRT~合集1_视觉

trt-llm 功能与架构

TRT-LLM和vllm、lmdeploy、sglang^[6]一样，提供大模型的推理支持，包含了大模型推理的：

模型结构，提前定义好的模型结构
runtime调度（inflight batching、kv cache reuse）
kernels（MMHA、FMHA）
量化技术（FP8、INT8、INT4、kv cache）

这里挨个过下：

模型结构

模型结构就是提前定义好的llama或者其他大模型的网络结构，直接复用就行。

搭建好的模型可以使用TensorRT帮你生成kernel，和小模型走onnx的路子不一样，trt-llm完善了TensorRT-python-api，使其更好用和易于搭建，更灵活一点，不过说实话，相比使用vllm搭建还是稍微难一点。

kernel优化

对于大模型来说，简单对于kernel的优化是不够的。之前小模型的经验，优化模型第一直觉就是优化kernel，但是对于大模型来说runtime、调度也很重要。

优化kernel直接可以优化模型性能，降低latency；而runtime或者说调度可以提升整体的吞吐。

目前trt-llm中比较常用的就是MMHA（MaskedMultiheadAttention）和FMHA（FusedMultiheadAttention），这俩都是fused Multi Head Attention，MMHA是context那边的MMHA，也是fused的，这俩都是从faster transformer借鉴来，目前也更新了很多版本。

runtime调度

Runtime调度也很重要，除了最基本的inflight batching之外，kv cache优化目前更重要一些。

因为目前长context的需求现在比较多，缓存kv cache的部分需要的memory越来越重。所以有了kv cache压缩，也就是量化以及low rank的需求。当context变超长的时候，kv cache需要的显存大小甚至会比模型本身还大，所以kv cache压缩也比较重要，比如INT4。

Runtime方式变化也会影响kernel实现的方式，需要修改kernel的实现方式去配合runtime。另外，TP和PP也是标配，算是runtime的一部分，相比于TensorRT只支持单卡，trt-llm增加了多卡的支持，是通过trt-plugin支持的。

量化

最后是量化，量化的支持trt-llm支持也是不少，这里暂时略过。两量化需要单独的篇幅去说，开放日也有单独的讲座去讲量化相关。

关于开源

TRT比较被诟病的就是开源开的不彻底，能改动的地方不多。

51c~TensorRT~合集1_视觉_02

trt-llm之开源与半开源

究其原因，是有很多针对不同硬件做了定制化优化的kernel，如果放出来，我们就可以通过代码反推到硬件的底层逻辑的设计。

比如FMHA的代码，某一个配置中，可以看到有很多针对不同sm架构的实现代码：

51c~TensorRT~合集1_视觉_03

针对不同size不同显卡架构都有不同的实现，相对比较细致、比较极致，总之就是和硬件比较相关的代码没有开源，其他的代码开源。

值得一提是还有一个runtime代码，之前GptSession好歹是开源的，后来切换成了Executor，直接给你闭源了，想要添加功能只能去官方提需求，没办法自己修改：

51c~TensorRT~合集1_视觉_04

51c~TensorRT~合集1_视觉_05

端到端 workflow

这里我们讲下端到端使用trt-llm开发的流程。

首先提一下Python API，相比于使用原始TensorRT的python-api，TRT-LLM在上面又封装了一层，尽可能和pytorch的风格一致，易于搭建新的网络。

51c~TensorRT~合集1_视觉_06

不过要注意，python-api只是搭建，搭建好的网络只是TensorRT的网络格式（类似于onnx2trt的中间IR形式），不能直接运行（这点要区分于Pytorch），需要build engine后才可以。实际运行还是使用C++，和TensorRT一样。

TRT-LLM也提供了High Level API，类似于torch和huggingface的关系：

51c~TensorRT~合集1_视觉_07

当然流程还是先转换权重格式、搭建trt-llm网络结构、build engine，然后就可以运行了。

51c~TensorRT~合集1_视觉_08

回到端到端的workflow，就是上述聊到的那几个流程：

我们首先需要convert checkpoint，将原始的weight转换为trt-llm的格式
然后将转换好的weight全部填入提前定义好的网络结构中
最后build已经读取了权重而且定义好网络结构的network，这里区别于pytorch，因为pytorch是动态图，而trt-llm是静态图，所以相对来说没有那么方便，需要先定义好网络结构，再build才能得到最终的engine结构，这也是优化后的计算图

最后编译出来的engine可以在python中先进行测试，测试没问题后，就可以部署到C++中，最终通过triton上线。

51c~TensorRT~合集1_视觉_09

安装 && install

开放日中也简单提了下安装过程，可能是TRT-LLM安装坑确实比较多吧...

第一个是在利用docker自行编译trt-llm源码，也是我比较常用的方式，优点是可以修改源码以及不需要考虑环境，不好的就是对网络要求稍高点（懂得都懂）。

51c~TensorRT~合集1_视觉_10

第二种方式是直接通过pip安装，这个建议在之前已经有环境可以跑起来trt-llm的基础上，你想要更新版本，可以这么搞。或者说你有纯净的trt-llm依赖的环境（比如从ngc拉下来的镜像，或者第一个build出来的镜像），直接在这个环境中pip install即可。

如果你想直接在其他开发环境中pip install，可能会和你本地的一些库有一些不兼容的地方（比如你的torch是自己编译的，gcc版本不不一样），可能有些symbolic找不到，所以最好是纯净的环境。

51c~TensorRT~合集1_视觉_11

第三种是借用NGC中提供的镜像。

NVIDIA NGC（NVIDIA GPU Cloud）是一个为深度学习、机器学习和高性能计算（HPC）提供优化的GPU软件的中心。这个平台提供了容器、预训练模型、模型脚本和行业解决方案，帮助数据科学家、开发者和研究人员更快地构建解决方案，我们快速开发使用的镜像一般来源于这里

NGC中的镜像已经提前预装了trtllm-triton-backend和trt-llm这俩库，所以trt-llm需要的系统环境也有了。虽然说有预先编译好的trt-llm，其实后续我们也可以自行编译其他版本的，都比较灵活。

51c~TensorRT~合集1_视觉_12

最后总结了下各种安装方式的优点和缺点：

51c~TensorRT~合集1_视觉_13

转换权重（checkpoint）

之前TRT-LLM每个模型都有一个convert脚本，会比较乱而且不好维护，所以现在TRT-LLM统一了convert接口：

51c~TensorRT~合集1_视觉_14

在convert checkpoint的地方统一了之后会有很多好处：

51c~TensorRT~合集1_视觉_15

权重转换后需要把权重塞到模型中，需要定义模型结构，trt-llm预先提供了一些比较火的模型结构，对这些个模型提供支持：

51c~TensorRT~合集1_视觉_16

第四步就是在权重转换为trt-llm之后，开始进行build构建。有个细节是，在build的时候有很多参数会影响性能，官方预设的参数默认是效果比较好的，但是我们肯定要根据自己实际的需求去调节参数，不论是速度还是精度问题：

51c~TensorRT~合集1_视觉_17

在构建好engine之后，就可以开始运行了，建议首先使用run.py在python端进行测试。然后也可以使用其他的.py文件或者gptManagerBenchmark去评测模型精度或者性能：

51c~TensorRT~合集1_视觉_18

MMLU、公开的LLM测试集，来测试trt-llm模型build之后的精度，一般就是测试一个pytorch的再测试一个trt-llm的，简单对比即可。

51c~TensorRT~合集1_视觉_19

TRT-LLM也提供了benchmark工具，gptManagerBenchmark是提前编译好的可执行文件，专门用来测试性能，也可以测试带上inflight batching的整体吞吐：

51c~TensorRT~合集1_视觉_20

如何debug

调试的话，有两个logger可以使用，也就是可以通过设置环境变量或者传入参数开启某些logger设置。

Logger could provide many useful/important information to help debugging

Python side: controlled by --log, levelin python examples (defined in tensorrt llm/logger.py)
C++ side: 这个比较隐蔽，一般是开发者使用 controlled by TLLM_LOG_LEVEL environment variable (defined in cpp/include/tensorrt llm/common/logger.h）Could print all function calls on C++ level；Help to trace the codes and locate error position

编译

这里也提到了一个加速编译的功能，有时候我们修改了一些源文件，重新编译会比较耗时。

比如改了一个.h的头文件，但是这个头文件被很多C++文件引用，所以这些个c++文件理论上都会被重编译一遍，加上trt-llm有很多kernel需要编译，编译时间很长。

官方提供了一些方法：

51c~TensorRT~合集1_视觉_21

一般我们在某个卡测试的时候，不需要把所有cuda architecture都编译，按需编译自己当前这张卡对应architecture就行。

51c~TensorRT~合集1_视觉_22

issue查找

51c~TensorRT~合集1_视觉_23

可能会影响精度的选项：

用BF16训练出来，使用FP16跑（反之亦然），在小模型上可能影响不大；但是如果在大模型上，还是会有些精度问题；
context_fmha vs context_fmha_fp32_acc 默认是fp16 acc，如果遇到精度问题，可以尝试fp32_acc但是会影响速度；
Disable gemm_plugin；之前我们默认都是打开的，首先会加速编译流程；后来TRT-10优化了编译速度和支持了FP32 acc，可以尝试使用trt内部的gemm去寻找更好的性能，都可以试下；

如何添加一个新的模型

TRT-LLM使用新的Python-API的初衷就是想要后续改动或者添加新模型更方便些。

因为TRT-LLM来源于TRT，因此构建网络想要通过trt-python-api去构建，这里trt-llm对这个api做了改进，但是相比pytorch可能还是难度大些。不过trt-llm已经提供了一些例子，比如我们可以使用llama的实现去适配其他模型。
以下是官方提供的添加新模型的流程：

51c~TensorRT~合集1_视觉_24

具体来说，就是仿照llama的实现，以及一些llm的基本class和内部具体实现的layer：

51c~TensorRT~合集1_视觉_25

除了模型的搭建，还需要实现convert权重相关的地方，从huggingface权重到trt-llm权重格式的转换：

51c~TensorRT~合集1_视觉_26

51c~TensorRT~合集1_视觉_27

如果官方提供的例子没有模型中某些层的实现，但你这个层可以通过官方已经提供的layer接口实现，那么我们可以利用官方提供的Python-API搭建出来：

51c~TensorRT~合集1_视觉_28

当然，如果提供的layer、functional接口也没有你的实现，那就只能自己搓一个kernel出来，不过这个会比较复杂。参考之前TRT中的方式，我们需要先定义一个kernel plugin，写好相关的kernel实现，然后在外部引用：

51c~TensorRT~合集1_视觉_29

51c~TensorRT~合集1_视觉_30

以上就是NVIDIA-AI技术开放日关于TRT-LLM 最佳性能实践的全部内容。

参考

https://www.bilibili.com/video/BV1aT42167mk/\?spm\_id\_from=pageDriver\&vd\_source=eec038509607175d58cdfe2e824e8ba2^[7]

参考资料

[1]

NVIDIA AI技术开放日 2024 夏: https://space.bilibili.com/1320140761/channel/collectiondetail?sid=3446369

[2]TRT-LLM 最佳部署实践: https://www.bilibili.com/video/BV1MS421d7Jm/

[3]TRT-LLM 最佳部署实践: https://www.bilibili.com/video/BV1MS421d7Jm/

[4]TensorRT-LLM初探（一）基于最新commit运行llama，以及triton-tensorrt-llm-backend: https://ai.oldpan.me/t/topic/260

[5]TensorRT-LLM初探（二）简析了结构，用的更明白: https://ai.oldpan.me/t/topic/203

[7]https://www.bilibili.com/video/BV1aT42167mk/?spm_id_from=pageDriver&vd_source=eec038509607175d58cdfe2e824e8ba2: https://www.bilibili.com/video/BV1aT42167mk/?spm_id_from=pageDriver&vd_source=eec038509607175d58cdfe2e824e8ba2

二、兼顾灵活性和性能以及调试的手搓TensorRT网络

用过TensorRT的基本都接触过trtexec^[1]，可以方便快捷地将你的ONNX模型转换为TensorRT的engine：

./trtexec --notallow=model.onnx

其中原理是啥，这就涉及到了另外一个库onnx-tensorrt^[2]，可以解析onnx模型并且将onnx中的每一个op转换为TensorRT的op，进而构建得到engine，trtexec转模型的核心就是onnx-tensorrt。

如果没有onnx-tensorrt^[3]，我们该怎么使用TensorRT去加速你的模型的呢？

幸运的是TensorRT官方提供了API^[4]去搭建网络，你可以像使用Pytorch一样去搓一个网络出来，比如TensorRTx^[5]这个库，就包含了很多直接使用API搭建出来的TensorRT网络：

nvinfer1::IHostMemory* buildEngineYolov8n(nvinfer1::IBuilder* builder,
                                          nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path) {
    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);

    /*******************************************************************************************************
    ******************************************  YOLOV8 INPUT  **********************************************
    *******************************************************************************************************/
    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});
    assert(data);

    /*******************************************************************************************************
    *****************************************  YOLOV8 BACKBONE  ********************************************
    *******************************************************************************************************/
    nvinfer1::IElementWiseLayer* conv0 = convBnSiLU(network, weightMap, *data, 16, 3, 2, 1, "model.0");
    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0), 32, 3, 2, 1, "model.1");
    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), 32, 32, 1, true, 0.5, "model.2");
    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0), 64, 3, 2, 1, "model.3");
    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), 64, 64, 2, true, 0.5, "model.4");
    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0), 128, 3, 2, 1, "model.5");
    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), 128, 128, 2, true, 0.5, "model.6");
    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0), 256, 3, 2, 1, "model.7");
    nvinfer1::IElementWiseLayer* conv8 = C2F(network, weightMap, *conv7->getOutput(0), 256, 256, 1, true, 0.5, "model.8");
    nvinfer1::IElementWiseLayer* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), 256, 256, 5, "model.9");
...
}

这种方式的搭建，相比使用onnx-tensorrt^[6]的优点：

可以更精确控制网络中的每一层，规避onnx中冗余的造成性能下降的结构，所以理论上通过API搭建的trt网络，在构建后性能会更好一些（当然也分情况哈，对于大部分模型来说，现在onnx2trt + TensorRT 配合其实已经和纯API搭建性能几乎一样了）
后期可以比较方便的修改trt网络层中的某一层，以及加plugin

不过缺点很显然，搭网络很耗时，还需要你熟悉TensorRT的api，入手期间可能会经历无数的坑。有那时间使用onnx2trt一行命令就转好了，没有onnx2trt灵活。

不过当然不能无脑使用onnx，遇到网络中不支持的算子，或者你的网络比较特殊的话，会直接GG，看看onnx2TensorRT仓库的issue，直到2023年还会有各种各样的op问题：

51c~TensorRT~合集1_视觉_31

另外，当模型特别大（嗯我说的就是llm），层数特别多的话，onnx就不是很好用了，也不是不能导出来，就是当onnx比较大的时候，看网络结构、定位问题不是很好搞，总得经过onnx这个IR，而ONNX用起来有很多小坑，虽说最后可以完成任务，但过程总归是很辛苦的（苦力活，懂的都懂）。

那么有没有更好的方式呢？同时兼顾灵活性和性能？

更好的方式 v1

想必有些童鞋也用过类似于torch2trt^[7]的TensorRT转换工具，通过遍历你的Pytorch网络，在遍历每一个op的时候将每个op转换为相应的TensorRT-op，搭建好网络后就可以build成TensorRT的engine：

model = deeplabv3_resnet50().cuda().eval（).half()
  data = torch.randn((1, 3, 224, 224)).cuda().half()

  print('Running torch2trt...')
  model_trt = torch2trt_dynamic(
      model, [data], fp16_mode=True, max_workspace_size=1 << 25)

比如下述这个converter，当你模型遍历到torch.nn.functional.leaky_relu这个op的时候，会执行这个转换脚本生成TensorRT-network的op：ctx.network.add_activation(input_trt, trt.ActivationType.LEAKY_RELU)。

@tensorrt_converter('torch.nn.functional.leaky_relu')
@tensorrt_converter('torch.nn.functional.leaky_relu_')
def convert_leaky_relu(ctx):
    input = get_arg(ctx, 'input', pos=0, default=None)
    negative_slope = get_arg(ctx, 'negative_slope', pos=1, default=0.01)
    output = ctx.method_return

    input_trt = trt_(ctx.network, input)
    layer = ctx.network.add_activation(input_trt,
                                       trt.ActivationType.LEAKY_RELU)
    layer.alpha = negative_slope

    output._trt = layer.get_output(0)

这种方式的好处是修改网络比较简单，因为是直接从你pytorch模型去转换而不是经过onnx，虽然说经过onnx也可以修改网络，但是终归是要经过onnx这个IR，有些op从pytorch->onnx的时候会变，到时候出现了问题不好定位。

另外，需要debug的时候你可以很方便的设置哪些是output（直接在网络中找到你想要设置output的地方，将子模型单独截取出来转换即可），方便定位问题。如果是onnx的话，首先需要获取pytorch-onnx的对应层，然后在onnx2trt脚本中设置才可以，虽然TensorRT官方也提供了Polygraphy^[8]这样的debug工具，但是实际使用起来没有直接在pytorch网络上修改方便。

后续的trtorch，又或者叫torch-TensorRT^[9]的工具，原理和torch2trt差不多，也是通过遍历torch的网络去一层一层转化为TensorRT的op：

51c~TensorRT~合集1_视觉_32

更好的方式 v2

上述的v1方法，相比onnx2trt更直接一些，可以直接在pytorch模型中进行转换，不过我们拿到的只是build后的TensorRT-engine，中间TensorRT-network网络的搭建过程被隐藏起来了，之后网络中遇到问题，之后想要进一步debug的时候，对于网络的全局观还是要差那么一点，如果能直接debug使用TensorRT-API搭建的网络会更好更直观一点：

class Centernet_dla34(object):
    def __init__(self, weights) -> None:
        super().__init__()
        self.weights = weights
        self.levels = [1, 1, 1, 2, 2, 1]
        self.channels = [16, 32, 64, 128, 256, 512]
        self.down_ratio = 4
        self.last_level = 5
        self.engine = self.build_engine()

    def add_batchnorm_2d(self, input_tensor, parent):
        gamma = self.weights[parent + '.weight'].numpy()
        beta = self.weights[parent + '.bias'].numpy()
        mean = self.weights[parent + '.running_mean'].numpy()
        var = self.weights[parent + '.running_var'].numpy()
        eps = 1e-5

        scale = gamma / np.sqrt(var + eps)
        shift = beta - mean * gamma / np.sqrt(var + eps)
        power = np.ones_like(scale)

        return self.network.add_scale(input=input_tensor.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=shift, scale=scale, power=power)
...
    def populate_network(self):
        # Configure the network layers based on the self.weights provided.
        input_tensor = self.network.add_input(
            name=ModelData.INPUT_NAME, dtype=ModelData.DTYPE, shape=ModelData.INPUT_SHAPE)

        y = self.add_base(input_tensor, 'module.base')

        first_level = int(np.log2(self.down_ratio))
        last_level = self.last_level
        dla_up = self.add_dla_up(y, first_level, 'module.dla_up')
        ida_up = self.add_ida_up(dla_up[:last_level-first_level], self.channels[first_level], [
                                 2 ** i for i in range(last_level - first_level)], 0, 'module.ida_up')

        hm = self.add_head(ida_up[-1], 80, 'module.hm')
        wh = self.add_head(ida_up[-1], 2, 'module.wh')
        reg = self.add_head(ida_up[-1], 2, 'module.reg')

        hm.get_output(0).name = 'hm'
        wh.get_output(0).name = 'wh'
        reg.get_output(0).name = 'reg'
        self.network.mark_output(tensor=hm.get_output(0))
        self.network.mark_output(tensor=wh.get_output(0))
        self.network.mark_output(tensor=reg.get_output(0))
...

但上文也提到过，这种搭建网络的方式较为费事费力，有没有稍微自动化的方法呢？

用过fx^[10]的童鞋应该记得有个to_folder方法

model = centernet().cuda()  
dummy_input = torch.randn(1, 3, 1024, 1024).cuda()  
res_origin = model(dummy_input)  
  
from torch.fx import symbolic_trace  
m = symbolic_trace(model.fx_model.cpu())  
m.to_folder("fx_debug","centernet_res50")

可以将fx trace后的网络生成出来：

class centernet_res50(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = torch.load(r'fx_debug/backbone.pt') # Module(   (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)   (relu): ReLU(inplace=True)   (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilatinotallow=1, ceil_mode=False)   (layer1): Module(     (0): Module(       (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)         (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   )   (layer2): Module(     (0): Module(       (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)         (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (3): Module(       (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   )   (layer3): Module(     (0): Module(       (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)         (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (3): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (4): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (5): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   )   (layer4): Module(     (0): Module(       (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)         (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   ) )
        self.upsampler = torch.load(r'fx_debug/upsampler.pt') # Module(   (deconv_layers): Module(     (0): ConvTranspose2d(2048, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)     (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (2): ReLU(inplace=True)     (3): ConvTranspose2d(256, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)     (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (5): ReLU(inplace=True)     (6): ConvTranspose2d(256, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)     (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (8): ReLU(inplace=True)   ) )
        self.head = torch.load(r'fx_debug/head.pt') # Module(   (hm): Module(     (0): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))     (1): ReLU(inplace=True)     (2): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))   )   (wh): Module(     (0): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))     (1): ReLU(inplace=True)     (2): Conv2d(64, 2, kernel_size=(1, 1), stride=(1, 1))   )   (reg): Module(     (0): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))     (1): ReLU(inplace=True)     (2): Conv2d(64, 2, kernel_size=(1, 1), stride=(1, 1))   ) )
        self.load_state_dict(torch.load(r'fx_debug/state_dict.pt'))

    def forward(self, input):
        input_1 = input
        backbone_conv1 = self.backbone.conv1(input_1);  input_1 = None
        backbone_bn1 = self.backbone.bn1(backbone_conv1);  backbone_conv1 = None
        backbone_relu = self.backbone.relu(backbone_bn1);  backbone_bn1 = None
        backbone_maxpool = self.backbone.maxpool(backbone_relu);  backbone_relu = None
        ...
        head_reg_1 = getattr(self.head.reg, "1")(head_reg_0);  head_reg_0 = None
        head_reg_2 = getattr(self.head.reg, "2")(head_reg_1);  head_reg_1 = None
        return (head_hm_2, head_wh_2, head_reg_2)
        
if __name__ == '__main__':

    model = centernet_res50()
    dummy_input = torch.randn(1, 3, 1024, 1024)
    output = model(dummy_input)

通过这种方式我们可以简单将trace后模型直接导出成py文件，然后自然而然地可以看到模型的网络结构，这里是拿到了Pytorch模型。

既然可以生成Pytorch模型，那么可不可以生成直接利用TensorRT-API搭建的网络呢？

我们先仿照TensorRT-API的方式去实现类似于Pytorch的network接口：

class Downsample2D(Module):

    def __init__(self,
                 channels,
                 use_cnotallow=False,
                 out_channels=None,
                 padding=1) -> None:
        super().__init__()
        self.channels = channels
        self.out_channels = out_channels or channels
        self.use_conv = use_conv
        self.padding = padding
        stride = (2, 2)

        if use_conv:
            self.conv = Conv2d(self.channels,
                               self.out_channels, (3, 3),
                               stride=stride,
                               padding=(padding, padding))
        else:
            assert self.channels == self.out_channels
            self.conv = AvgPool2d(kernel_size=stride, stride=stride)

    def forward(self, hidden_states):
        assert not hidden_states.is_dynamic()
        batch, channels, _, _ = hidden_states.size()
        assert channels == self.channels

        hidden_states = self.conv(hidden_states)

        return hidden_states

是不是很像Pytorch的网络结构，但这里继承的Module是模仿nn.Module单独实现的一个模块。 细节先不介绍了，这里的类成员Conv2d看起来和Pytorch版本的区别不大:

class Conv2d(Module):

    def __init__(
            self,
            in_channels: int,
            out_channels: int,
            kernel_size: Tuple[int, int],
            stride: Tuple[int, int] = (1, 1),
            padding: Tuple[int, int] = (0, 0),
            dilation: Tuple[int, int] = (1, 1),
            groups: int = 1,
            bias: bool = True,
            padding_mode: str = 'zeros',  # TODO: refine this type
            dtype=None) -> None:
        super().__init__()
        if groups <= 0:
            raise ValueError('groups must be a positive integer')
        if in_channels % groups != 0:
            raise ValueError('in_channels must be divisible by groups')
        if out_channels % groups != 0:
            raise ValueError('out_channels must be divisible by groups')

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.dilation = dilation
        self.groups = groups
        self.padding_mode = padding_mode

        self.weight = Parameter(shape=(out_channels, in_channels // groups,
                                       *kernel_size),
                                dtype=dtype)
        if bias:
            self.bias = Parameter(shape=(out_channels, ), dtype=dtype)
        else:
            self.register_parameter('bias', None)

    def forward(self, input):
        return conv2d(input, self.weight.value,
                      None if self.bias is None else self.bias.value,
                      self.stride, self.padding, self.dilation, self.groups)

是不是很像Pytorch的网络结构，但这里继承的Module是模仿nn.Module单独实现的一个模块。细节先不介绍了，这里的类成员Conv2d看起来和Pytorch版本的区别不大:

class Conv2d(Module):  
  
    def __init__(  
            self,  
            in_channels: int,  
            out_channels: int,  
            kernel_size: Tuple[int, int],  
            stride: Tuple[int, int] = (1, 1),  
            padding: Tuple[int, int] = (0, 0),  
            dilation: Tuple[int, int] = (1, 1),  
            groups: int = 1,  
            bias: bool = True,  
            padding_mode: str = 'zeros',  # TODO: refine this type  
            dtype=None) -> None:  
        super().__init__()  
        if groups <= 0:  
            raise ValueError('groups must be a positive integer')  
        if in_channels % groups != 0:  
            raise ValueError('in_channels must be divisible by groups')  
        if out_channels % groups != 0:  
            raise ValueError('out_channels must be divisible by groups')  
  
        self.in_channels = in_channels  
        self.out_channels = out_channels  
        self.kernel_size = kernel_size  
        self.stride = stride  
        self.padding = padding  
        self.dilation = dilation  
        self.groups = groups  
        self.padding_mode = padding_mode  
  
        self.weight = Parameter(shape=(out_channels, in_channels // groups,  
                                       *kernel_size),  
                                dtype=dtype)  
        if bias:  
            self.bias = Parameter(shape=(out_channels, ), dtype=dtype)  
        else:  
            self.register_parameter('bias', None)  
  
    def forward(self, input):  
        return conv2d(input, self.weight.value,  
                      None if self.bias is None else self.bias.value,  
                      self.stride, self.padding, self.dilation, self.groups)

那我们看核心实现conv2d(input, self.weight.value,...：

def conv2d(input: Tensor,  
           weight: Tensor,  
           bias: Optional[Tensor] = None,  
           stride: Tuple[int, int] = (1, 1),  
           padding: Tuple[int, int] = (0, 0),  
           dilation: Tuple[int, int] = (1, 1),  
           groups: int = 1) -> Tensor:  
  
    assert not input.is_dynamic()  
  
    ndim = input.ndim()  
    if ndim == 3:  
        input = expand_dims(input, 0)  
  
    noutput = weight.size()[0]  
    kernel_size = (weight.size()[-2], weight.size()[-1])  
  
    is_weight_constant = (weight.producer is not None  
                          and weight.producer.type == trt.LayerType.CONSTANT)  
    weight = weight.producer.weights if is_weight_constant else trt.Weights()  
  
    if bias is not None:  
        is_bias_constant = (bias.producer is not None  
                            and bias.producer.type == trt.LayerType.CONSTANT)  
        bias = bias.producer.weights if is_bias_constant else trt.Weights()  
  
    layer = default_trtnet().add_convolution_nd(input.trt_tensor, noutput,  
                                                kernel_size, weight, bias)  
    layer.stride_nd = stride  
    layer.padding_nd = padding  
    layer.dilation = dilation  
    layer.num_groups = groups  
  
    if not is_weight_constant:  
        layer.set_input(1, weight.trt_tensor)  
    if bias is not None and not is_bias_constant:  
        layer.set_input(2, bias.trt_tensor)  
  
    output = _create_tensor(layer.get_output(0), layer)  
  
    if ndim == 3:  
        return output.view(  
            concat([output.size(1),  
                    output.size(2),  
                    output.size(3)]))  
  
    return output

可以看到conv2d的核心实现就是利用TensorRT-API去搭建conv网络。

看到这里，想一想如果可以直接将trace后的网络直接使用类似于Pytorch的TensorRT-API搭建，然后生成，是不是就类似于直接生成一个利用TensorRT-API搭建的网络？

后记

当然这只是个抛砖引玉，很多细节其实还没有提到，我之前也用过一些其他公司的类似于TensorRT的工具，在转换完模型后可以直接生成利用该推理后端API搭建的网络文件（可以是cpp，也可以是python），当然权重和参数也在里头了，如果是量化的话，量化参数也可以放到里头，可以做的事情有很多。这种方式的话，我们可以对推理框架即将要优化的网络一目了然，在修改或者调试的情况下都比较方便。

这里仅是简单的讨论，至于后续的细节实现，之后老潘也会继续写一些文章，大家有想法也可以留言哈~

参考

三、TensorRT-LLM | 大模型部署专用框架

TensorRT-LLM是NVIDIA推出的一款高性能深度学习推理优化库，专注于提升大型语言模型（LLM）在NVIDIA GPU上的推理速度和效率。如果您绕不开Nvidia的芯片，那么一定要好好了解这款推理库。

项目链接：https://github.com/NVIDIA/TensorRT-LLM

51c~TensorRT~合集1_视觉_33

1、TensorRT-LLM的优势

TensorRT-LLM（TensorRT for Large Language Models）旨在解决大型语言模型在实际应用中面临的性能瓶颈问题。通过提供一系列专为LLM推理设计的优化工具和技术，TensorRT-LLM能够显著提升模型的推理速度，降低延迟，并优化内存使用。

2、TensorRT-LLM的核心功能

1）易于使用的Python API

TensorRT-LLM提供了一个简洁易用的Python API，允许用户定义大型语言模型并构建包含先进优化的TensorRT引擎。
该API设计类似于PyTorch，使得具有PyTorch经验的开发者能够轻松迁移和集成。

2）模型优化

TensorRT-LLM支持多种量化选项（如FP16、INT8等），用户可以根据具体需求选择合适的配置，实现性能与精度的平衡。
通过层级融合、内核选择和精度调整等优化技术，TensorRT-LLM能够显著提升模型的推理速度。

3）内存管理

TensorRT-LLM通过智能内存分配和分页注意力机制，优化了内存使用，降低了内存占用。

4）多线程并行与硬件加速

支持多线程并行处理，提高处理速度。
充分利用NVIDIA GPU的计算能力，加速模型推理。

5）动态批处理

TensorRT-LLM支持动态批处理，通过同时处理多个请求来优化文本生成，减少了等待时间并提高了GPU利用率。

6）多GPU与多节点推理

支持在多个GPU或多个节点上进行分布式推理，提高了吞吐量并减少了总体推理时间。

7）FP8支持

配备TensorRT-LLM的NVIDIA H100 GPU能够轻松地将模型权重转换为新的FP8格式，并自动编译模型以利用优化的FP8内核。这得益于NVIDIA Hopper架构，且无需更改任何模型代码。

8）最新GPU支持

TensorRT-LLM 支持基于 NVIDIA Hopper、NVIDIA Ada Lovelace、NVIDIA Ampere、NVIDIA Turing 和 NVIDIA Volta 架构的GPU。

3、TensorRT-LLM支持部署的模型1）LLM系列

51c~TensorRT~合集1_视觉_34

51c~TensorRT~合集1_视觉_35

2）多模态大模型

51c~TensorRT~合集1_视觉_36

4、量化相关

INT8 SmoothQuant (W8A8)

SmoothQuant技术在：https://arxiv.org/abs/2211.10438中被介绍。它是一种使用INT8对激活和权重进行推理的方法，同时保持网络（在下游任务中）的准确性。如研究论文所述，必须对模型的权重进行预处理。TensorRT-LLM包含用于准备模型以使用SmoothQuant方法运行的脚本。

关于如何为GPT、GPT-J和LLaMA启用SmoothQuant的示例，可以在版本的examples/quantization文件夹中找到。

INT4和INT8仅权重量化 (W4A16和W8A16)

INT4和INT8仅权重量化技术包括对模型的权重进行量化，并在线性层（Matmuls）中动态地对这些权重进行反量化。激活使用浮点数（FP16或BF16）进行编码。要使用INT4/INT8仅权重量化方法，用户必须确定用于量化和反量化模型权重的缩放因子。

GPTQ和AWQ (W4A16)

GPTQ和AWQ技术分别在https://arxiv.org/abs/2210.17323和https://arxiv.org/abs/2306.00978中介绍。TensorRT-LLM支持在线性层中使用每组缩放因子和零偏移来实现GPTQ和AWQ方法。有关详细信息，请参阅WeightOnlyGroupwiseQuantMatmulPlugin插件和相应的weight_only_groupwise_quant_matmulPython函数。

代码中包括将GPTQ应用于GPT-NeoX和LLaMA-v2的示例，以及使用AWQ与GPT-J的示例。这些示例是实验性实现，并可能在未来的版本中有所改进。

FP8 (Hopper)

TensorRT-LLM包含为GPT-NeMo、GPT-J和LLaMA实现的FP8。这些示例可以在examples/quantization中找到。

5、TensorRT-LLM支持的硬件和软件

51c~TensorRT~合集1_视觉_37

51c~TensorRT~合集1_视觉_38

6、TensorRT-LLM的应用场景

TensorRT-LLM在多个领域展现了其强大的应用能力，包括但不限于：

在线客服系统：通过实时的对话生成，提供无缝的人工智能辅助服务。
搜索引擎：利用模型对查询进行增强，提供更精准的搜索结果。
自动代码补全：在IDE中集成模型，帮助开发者自动完成代码编写。
内容创作平台：自动生成文章摘要或建议，提升创作者的工作效率。

四、FX2TRT

这个是官方出的呢 torch转trt的~~ 之前都是wangxinyu的~

torch-tensorrt仓库移动到Pytorch主仓库下，更名为pytorch/TensorRT
Pytorch仓库将fx2trt分支由主仓库移到了pytorch/TensorRT仓库

51c~TensorRT~合集1_视觉_39

官方的FX2TRT的 User Guide（https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst）

Pytorch仓库的fx2trt代码库转移到pytorch/TensorRT中，变为了其中的一部分：FX Frontend。pytorch/TensorRT也就是之前的Torch-TensorRT库，现在统一了，除了可以将torchscript的模型转化为TensorRT模型，也可以将FX模型转化为TensorRT模型。

51c~TensorRT~合集1_视觉_40

Pytorch/TensorRT

这个库区别于NVIDIA官方的TensorRT仓库，是Pytorch自己的 TensorRT仓库(https://github.com/pytorch/TensorRT) ，简单介绍如下：

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

其实前身是TRtorch也叫作torch-TensorRT，我之前也写过篇关于这个的回答(https://www.zhihu.com/question/436143525/answer/2267845251) 。这个库的主要功能是无缝将torchscript的模型引入TensorRT的加速，使用最接近Pytorch的torchscript的生态去加速模型，充分利用TensorRT和TVM等优秀的工具，不需要把模型拆成好几部分，直接使用torchscript这个运行时去缝合，对于某些模型来说是很合适的：

51c~TensorRT~合集1_视觉_41

不过本文的重点不是这个，我们关注的fx2trt这个库挪到了这个仓库中，看来Pytorch是想把这些和TensorRT有关的库都整合在一起，也挺好。这里我只用到了fx2trt，所以只执行以下命令即可：

git clone https://github.com/pytorch/TensorRT/commits/master
cd py
python3 setup.py install --fx-only

看了下其中FX部分的代码结构，基本没什么变动，就是单独拎了出来。

fx2trt这个工具就是为了配合FX，将FX后的模型转化为TensorRT，大概分为四个步骤：

先trace模型
然后split trace后的模型，分为支持trt和不支持trt的部分
将支持trt的部分model转化为trt
然后得到一个新的nn.module，其中subgraph就是一个trt的engine嵌入进去了

看个例子

可以简单看下官方的示例代码，在TensorRT/examples/fx/lower_example.py有一个resnet18的例子。首先获取resnet18的模型，没什么好说的：

model = torchvision.models.resnet18(pretrained=True)

然后通过compile函数来对model进行编译，这个compile函数内部其实就是调用了一个Lowerer类，Lowerer类会根据config配置创建fx2trt的pipeline，之后的torch_tensorrt会统一这个接口，根据fx和ts（torchscript）模型来分别进行compile，不过这里就只说fx了：

# 这里model是nn.module 来自 torchvision.models.resnet18(pretrained=True)
lowered_module = compile(
    module,
    input, # input = [torch.rand(128, 3, 224, 224)
    max_batch_size=conf.batch_size,
    lower_precision=LowerPrecision.FP16 if conf.fp16 else LowerPrecision.FP32,
)

# 其中compile调用了Lowerer，是个help类，搭建fx2trt的pipeline
def compile(
    module: nn.Module,
    input,
    max_batch_size: int = 2048,
    max_workspace_size=1 << 25,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
    is_aten=False,
) -> nn.Module:
    lower_setting = LowerSetting(
        max_batch_size=max_batch_size,
        max_workspace_size=max_workspace_size,
        explicit_batch_dimension=explicit_batch_dimension,
        lower_precision=lower_precision,
        verbose_log=verbose_log,
        timing_cache_prefix=timing_cache_prefix,
        save_timing_cache=save_timing_cache,
        cuda_graph_batch_size=cuda_graph_batch_size,
        dynamic_batch=dynamic_batch,
        is_aten=is_aten,
    )
    lowerer = Lowerer.create(lower_setting=lower_setting)
    return lowerer(module, input)

Lowerer.create的时候，根据传递来的lower_setting参数构建pipeline，传递的参数也很容易理解：

比如转换精度，FP16还是FP32
示例输入用于trace以及后续测试
以及一些其他tensorrt常见的参数，比如workspace大小等等

pipeline的话，存在于pass管理器中。上一篇说过FX就是个AI编译器，而编译器中有个概念叫做pass，代表对代码的各种优化，所以FX中的PASS也一样，只不过变化为对模型的各种优化，看了下大概是以下一些：

# 这些pass
def build_trt_lower_pipeline(
        self, input: Input, additional_input: Optional[Input] = None
    ) -> PassManager:
        self._input = input
        self._additional_input = additional_input
        passes = []

        passes.append(self._default_replace_mutable_op_pass())
        passes.append(self._const_fold_pass())
        passes.append(self.graph_optimization_pass())
        passes.append(self._split_pass())
        passes.append(self._trt_lower_pass())

        pm = PassManager.build_from_passlist(passes)
        return pm

上述这些pass操作，其实就是FX中的transform，上一篇也说道过：

Your transform will take in an torch.nn.Module, acquire a Graph from it, do some modifications, and return a new torch.nn.Module. You should think of the torch.nn.Module that your FX transform returns as identical to a regular torch.nn.Module – you can pass it to another FX transform, you can pass it to TorchScript, or you can run it. Ensuring that the inputs and outputs of your FX transform are a torch.nn.Module will allow for composability.

比如replace_mutable_op这个函数，对输入的torch.fx.GraphModule进行修改，修改后recompile()重新构建graphModule，再返回torch.fx.GraphModule：

def replace_mutable_op(module: torch.fx.GraphModule) -> torch.fx.GraphModule:
    if not isinstance(module, torch.fx.GraphModule):
        return module

    # Before any lowering pass, replace mutable ops like torch.fill_
    # Because fx cannot deal with inplace ops
    for n in module.graph.nodes:
        # TODO: add more mutable ops
        if (n.op == "call_method" and n.target == "fill_") or (
            n.op == "call_function" and n.target == torch.fill_
        ):
            # Replace mutable op only if the modified variable
            # is used by the rest of the graph
            # only through this op
            if set(n.args[0].users.keys()) == {n}:
                with module.graph.inserting_after(n):

                    # TODO: move this outside?
                    def fill_with_mul_zero_and_add(*args):
                        return args[0].mul(0.0).add(args[1])

                    new_node = module.graph.create_node(
                        "call_function", fill_with_mul_zero_and_add, args=n.args
                    )
                    n.replace_all_uses_with(new_node)
                    module.graph.erase_node(n)
    module.recompile()
    return module

总之，经过compile的模型内部已经包含trt-engine了，可以直接拿来跑和benchmark：

lowered_module = compile(
    module,
    input,
    max_batch_size=conf.batch_size,
    lower_precision=LowerPrecision.FP16 if conf.fp16 else LowerPrecision.FP32,
)
time = benchmark_torch_function(conf.batch_iter, lambda: lowered_module(*input))

benchmark的结果也很显然，trt模型肯定比原始pytorch快很多，尤其是FP16下，resnet18这种小模型可以提升将近4倍多的QPS：

== Start benchmark iterations
== End benchmark iterations
== Benchmark Result for: Configuration(batch_iter=50, batch_size=128, name='CUDA Eager', trt=False, jit=False, fp16=False, accuracy_rtol=-1)
BS: 128, Time per iter: 31.35ms, QPS: 4082.42, Accuracy: None (rtol=-1)
== Benchmark Result for: Configuration(batch_iter=50, batch_size=128, name='TRT FP32 Eager', trt=True, jit=False, fp16=False, accuracy_rtol=0.001)
BS: 128, Time per iter: 21.53ms, QPS: 5944.90, Accuracy: None (rtol=0.001)
== Benchmark Result for: Configuration(batch_iter=50, batch_size=128, name='TRT FP16 Eager', trt=True, jit=False, fp16=True, accuracy_rtol=0.01)
BS: 128, Time per iter: 7.09ms, QPS: 18056.38, Accuracy: None (rtol=0.01)

运行环境

简单介绍了下Torch-TensorRT，接下来进入正篇。因为写第一篇FX文章比较久了，第二篇也挺久了（好吧我太能拖了），所以写第三篇的时候(2022-10-29)，为了保证文章内容质量...就更新一下测试fx的环境吧。拉的最新环境，torch和torchvision以及torch-tensorrt全部拉成最新，亲手编译的：

torch                   1.14.0a0+gita0c2a7f /root/code/pytorch                                                        
torch-tensorrt          1.3.0a0+5a7ac8f3    
torch-tensorrt-fx2trt   0.1                 /usr/local/lib/python3.8/dist-packages/torch_tensorrt_fx2trt-0.1-py3.8.egg
torchvision             0.14.0a0+d0d7058    /root/code/vision

虽然FX更新挺快，到现在1.14版本为止，FX依然是个beta。但有好的一点，更新了最新的环境后，之前的代码改动稍稍改动（不超2行）就可以运行。可以说明FX的向下兼容做的挺好，大家可以放心使用。

测试模型

因为之前的模型找不到了，所以需要重新找个模型测试FP32（pytorch）和INT8量化后（pytorch-fx以及TensorRT）的精度。

我去年跑fx2trt的时候使用的是resnet50版本的CenterNet，而且修改了Centernet后面的upsample层，将其输入输出通道设为相同：

# 输入in_channels输出通道out_channels必须一致才可以
nn.ConvTranspose2d(
    in_channels=planes,
    out_channels=planes,
    kernel_size=kernel,
    stride=2,
    padding=padding,
    output_padding=output_padding,
    bias=self.deconv_with_bias))

# groups必须为1才行
up = nn.ConvTranspose2d(
    out_dim, out_dim, f * 2, stride=f, padding=f // 2,
    output_padding=0, groups=1, bias=False)

为什么这样搞，因为TensorRT在量化反卷积的时候有bug，必须满足一定条件的反卷积才可以正常解析（当然，不量化的时候没有问题），看了下issue的反馈，大概在8.5版本会解决大部分关于反卷积的问题（反卷积的问题真的多）。相关issue链接：

所以没办法，只能自己训一个模型，我这里采用resnet50为backbone的CenterNet，除了将模型最后部分反卷积改了下通道数，其余和官方的一致。基于自己的数据集训练了个二分类模型，检测人和手的。

FX2TRT

有了模型，开始进入正题！

上文提到过，新版的FX接口略略微微有一些变动，上一篇中prepare_fx参数backend配置名称变为backend_config；以及converter函数封装了一层新的函数convert_to_reference_fx，也就是将is_reference参数挪到里头了，不再使用convert_fx：

def convert_to_reference_fx(
    graph_module: GraphModule,
    convert_custom_config: Union[ConvertCustomConfig, Dict[str, Any], None] = None,
    _remove_qconfig: bool = True,
    qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
) -> torch.nn.Module:
    torch._C._log_api_usage_once("quantization_api.quantize_fx.convert_to_reference_fx")
    return _convert_fx(
        graph_module,
        is_reference=True,
        convert_custom_config=convert_custom_config,
        _remove_qconfig=_remove_qconfig,
        qconfig_mapping=qconfig_mapping,
        backend_config=backend_config,
    )

其他的没啥变化。

我们将模型通过prepare_fx和convert_to_reference_fx之后,得到了最终的reference量化模型。经过convert_to_reference_fx后的模型，其实是simulator quantization，也就是模拟量化版本。并不包含任何INT8的算子，有的只是Q、DQ操作以及FP32的常规算子，以及我们校准得到的scale和offset用于模拟模型的量化误差。实际模型执行的时候是这样：

def forward(self, input):
    input_1 = input
    # 首先得到量化参数scale和zero-point
    backbone_conv1_input_scale_0 = self.backbone_conv1_input_scale_0
    backbone_conv1_input_zero_point_0 = self.backbone_conv1_input_zero_point_0
    # 然后量化输入
    quantize_per_tensor = torch.quantize_per_tensor(input_1, backbone_conv1_input_scale_0, backbone_conv1_input_zero_point_0, torch.qint8);  
    input_1 = backbone_conv1_input_scale_0 = backbone_conv1_input_zero_point_0 = None
    # 然后反量化输入
    dequantize = quantize_per_tensor.dequantize();  quantize_per_tensor = None
    # 实际输入FP32算子的input是反量化后的
    backbone_conv1 = self.backbone.conv1(dequantize);  dequantize = None
    ...
    dequantize_80 = quantize_per_tensor_83.dequantize();  quantize_per_tensor_83 = None
    head_angle_2 = getattr(self.head.angle, "2")(dequantize_80);  dequantize_80 = None
    head_angle_2_output_scale_0 = self.head_angle_2_output_scale_0
    head_angle_2_output_zero_point_0 = self.head_angle_2_output_zero_point_0
    quantize_per_tensor_84 = torch.quantize_per_tensor(head_angle_2, head_angle_2_output_scale_0, head_angle_2_output_zero_point_0, torch.qint8);  head_angle_2 = head_angle_2_output_scale_0 = head_angle_2_output_zero_point_0 = None
    dequantize_81 = quantize_per_tensor_78.dequantize();  quantize_per_tensor_78 = None
    dequantize_82 = quantize_per_tensor_80.dequantize();  quantize_per_tensor_80 = None
    dequantize_83 = quantize_per_tensor_82.dequantize();  quantize_per_tensor_82 = None
    dequantize_84 = quantize_per_tensor_84.dequantize();  quantize_per_tensor_84 = None
    return {'hm': dequantize_81, 'wh': dequantize_82, 'reg': dequantize_83, 'angle': dequantize_84}

这个模型的类型是GraphModule，和nn.Module类似，有对应的forward函数。我们可以直接在Pytorch中执行这个模型测试精度，不过需要注意，这里仅仅是测试模拟的量化模型精度，也是测试校准后得到的scale和offset有没有问题，在转化为TensorRT后精度可能会略有差异，毕竟实际推理框架内部实现的一些算子细节我们是不知道的。简单看一眼上述模型的结构图：

51c~TensorRT~合集1_视觉_42

其中，backbone_conv1_input_scale_0和backbone_conv1_input_zero_point_0就是在校准过程中学习到的scale和offset。权重层不需要校准学习，直接就可以算出来（具体细节见上一篇），这里就不赘述了。

这里我对量化后的FX（sim-INT8）和原始的FX模型（FP32）进行了精度的对比，因为Centernet有三个输出：

51c~TensorRT~合集1_视觉_43

所以我这里对三个输出都进行了简单的精度计算：

original_fx_model.cuda()
res_fp32 = original_fx_model(data)
res_int8 = quantized_fx(data)
for i in range(len(res_fp32)):
    print(torch.max(torch.abs(res_fp32[i] -  res_int8[i])))

简单粗暴，结果看起来差距有点大，其中wh的最大误差都有26了：

tensor(1.5916, device='cuda:0', grad_fn=<MaxBackward1>)
tensor(26.1865, device='cuda:0', grad_fn=<MaxBackward1>)
tensor(0.1195, device='cuda:0', grad_fn=<MaxBackward1>)

不过如果计算下每个输出的余弦相似度，每个输出的相似度都接近于1：

torch_cosine_similarity:  tensor(1.0000)

大家猜猜看，最终的mAP有没有掉点？

acc_tracer

接下来需要acc_tracer来将reference模型转化为acc版本的模型。

Acc Tracer is inherited from FX symbolic tracer Performs tracing and arg normalization specialized for accelerator lowering.

acc的主要作用是将pytorch中reference版本的op转换为相应的acc-op，一共干了这些事儿：

run的时候，对于TRTInterpreter来说，任务就是遍历graph中的node，然后按照注册好的converter一个一个去转换。这里其实比较巧妙，TRTInterpreter继承了torch.fx.Interpreter，重载了其中的这些方法：

acc_op_map的代码主要在：TensorRT/py/torch_tensorrt/fx/tracer/acc_tracer/acc_ops.py 拿一小段代码看看：

@register_acc_op_properties(AccOpProperty.pointwise, AccOpProperty.unary)
@register_acc_op_mapping(op_and_target=("call_function", nn.functional.relu))
@register_acc_op_mapping(
    op_and_target=("call_function", torch.relu),
    arg_replacement_tuples=[("input", "input")],
)
@register_acc_op_mapping(
    op_and_target=("call_method", "relu"),
    arg_replacement_tuples=[("input", "input")],
)
@register_acc_op
def relu(*, input, inplace=False):
    return nn.functional.relu(input=input, inplace=inplace)

可以看到nn.functional.relu、 torch.relu以及call_method的relu这三种形式，最终都会转化为acc_op.relu。

如果不这样的话，可能需要针对三种情况写三份converter代码，那样就比较麻烦了，代码也会比较冗余。

得到acc版本的model之后，就需要针对acc-op一个一个去转换为trt了。至此，trace的过程就结束了（其实acc_trace的过程细节很多，限于篇幅这里就不说了，之后有机会的话单独介绍下）。

TRTInterpreter

TRTInterpreter继承于torch.fx.Interpreter。

An Interpreter executes an FX graph Node-by-Node. This patterncan be useful for many things, including writing code transformations as well as analysis passes.

关于Interpreter，也在第一篇中介绍过。Interpreter，即解释器，就是以一个比较优雅的方式循环一个Graph的node并且执行它们，并同时顺带完成一些任务。我们可以通过这个实现很多功能，比如替换模型中某个操作，比如模型性能分析等等。而在这里，我们利用TRTInterpreter转换acc_op到trt的op， 首先初始化解释器对象，输入常见的参数，这里我转的是dynamic shape，指定了min、opt和max三个大小，explicit_batch_dimension设为True：

interp = TRTInterpreter(
    quantized_fx,
    [InputTensorSpec(torch.Size([1,3,-1,-1]), torch.float,
                    shape_ranges=[((1, 3, 128, 128), (1, 3, 768, 768), (1, 3, 1024, 1024))], has_batch_dim=True)],
    explicit_batch_dimension=True, explicit_precision=True,
    logger_level=trt.Logger.VERBOSE
    )

然后就可以执行了，run的时候传入具体要转换的精度，以及workspace大小：

res = interp.run(lower_precision=LowerPrecision.INT8, strict_type_constraints=True, max_workspace_size=4096000000)

首先将要trace的模型所有un-tracable的部分转化为traceable
然后干掉所有assertions和exception的wrappers
整理模型，去掉dead code
对graph中的所有node的args/kwargs做标准化，将部分符合要求的arg移动到kwarg，making default values explicit.
trace前的模型graph：

graph():
    %input_1 : [#users=1] = placeholder[target=input]
    %backbone_base_base_layer_0_input_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_scale_0]
    %backbone_base_base_layer_0_input_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%input_1, %backbone_base_base_layer_0_input_scale_0, %backbone_base_base_layer_0_input_zero_point_0, torch.qint8), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%quantize_per_tensor,), kwargs = {})
    %backbone_base_base_layer_0_0_weight : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight]
    %backbone_base_base_layer_0_0_weight_scale : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_scale]
    %backbone_base_base_layer_0_0_weight_zero_point : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_zero_point]
    %quantize_per_channel : [#users=1] = call_function[target=torch.quantize_per_channel](args = (%backbone_base_base_layer_0_0_weight, %backbone_base_base_layer_0_0_weight_scale, %backbone_base_base_layer_0_0_weight_zero_point, 0, torch.qint8), kwargs = {})
    %dequantize_1 : [#users=1] = call_method[target=dequantize](args = (%quantize_per_channel,), kwargs = {})
    %backbone_base_base_layer_0_0_bias : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.bias]
    %conv2d : [#users=1] = call_function[target=torch.conv2d](args = (%dequantize, %dequantize_1, %backbone_base_base_layer_0_0_bias, (1, 1), (3, 3), (1, 1), 1), kwargs = {})
    %relu : [#users=1] = call_function[target=torch.nn.functional.relu](args = (%conv2d,), kwargs = {inplace: True})
    %backbone_base_base_layer_0_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_scale_0]
    %backbone_base_base_layer_0_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_zero_point_0]
    %quantize_per_tensor_1 : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%relu, %backbone_base_base_layer_0_scale_0, %backbone_base_base_layer_0_zero_point_0, torch.qint8), kwargs = {})
 ...

trace后的模型graph：

graph():
    %input_1 : [#users=1] = placeholder[target=input]
    %backbone_base_base_layer_0_input_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_scale_0]
    %backbone_base_base_layer_0_input_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_zero_point_0]
    %quantize_per_tensor_92 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %input_1, acc_out_ty: (None, torch.qint8, None, None, None, None, {scale: %backbone_base_base_layer_0_input_scale_0, zero_point: %backbone_base_base_layer_0_input_zero_point_0})})
    %dequantize_153 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.dequantize](args = (), kwargs = {input: %quantize_per_tensor_92})
    %backbone_base_base_layer_0_0_weight : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight]
    %backbone_base_base_layer_0_0_weight_scale : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_scale]
    %backbone_base_base_layer_0_0_weight_zero_point : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_zero_point]
    %quantize_per_channel_61 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.quantize_per_channel](args = (), kwargs = {input: %backbone_base_base_layer_0_0_weight, acc_out_ty: (None, torch.qint8, None, None, None, None, {scale: %backbone_base_base_layer_0_0_weight_scale, zero_point: %backbone_base_base_layer_0_0_weight_zero_point, axis: 0})})
    %dequantize_154 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.dequantize](args = (), kwargs = {input: %quantize_per_channel_61})
    %backbone_base_base_layer_0_0_bias : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.bias]
    %conv2d_55 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.conv2d](args = (), kwargs = {input: %dequantize_153, weight: %dequantize_154, bias: %backbone_base_base_layer_0_0_bias, stride: (1, 1), padding: (3, 3), dilation: (1, 1), groups: 1})
    %relu_48 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.relu](args = (), kwargs = {input: %conv2d_55, inplace: True})
    %backbone_base_base_layer_0_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_scale_0]
    %backbone_base_base_layer_0_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_zero_point_0]
    %quantize_per_tensor_93 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %relu_48, acc_out_ty: (None, torch.qint8, None, None, None, None, {scale: %backbone_base_base_layer_0_scale_0, zero_point: %backbone_base_base_layer_0_zero_point_0})})

可以看到原始版本的dequantize转换为了torch_tensorrt.fx.tracer.acc_tracer.acc_ops.dequantize，为什么要这么干呢，有两点原因：

将一些相同功能的op（ PyTorch ops and builtin ops ），比如 . torch.add, builtin.add and torch.Tensor.add 等等，就可以一并都转化为acc.add
Move args and kwargs into kwargs only for converting simplicity

51c~TensorRT~合集1_视觉_44

而run函数，遍历node的过程是在父类Interpreter中运行：

# torch/fx/interpreter.py
for node in self.module.graph.nodes:
    if node in self.env:
        # Short circuit if we have this value. This could
        # be used, for example, for partial evaluation
        # where the caller has pre-populated `env` with
        # values for a subset of the program.
        continue

    try:
        self.env[node] = self.run_node(node)
    except Exception as e:
        msg = f"While executing {node.format_node()}"
        msg = '{}\n\n{}'.format(e.args[0], msg) if e.args else str(msg)
        msg += f"\nOriginal traceback:\n{node.stack_trace}"
        e.args = (msg,) + e.args[1:]
        if isinstance(e, KeyError):
            raise RuntimeError(*e.args)
        raise

    if self.garbage_collect_values:
        for to_delete in self.user_to_last_uses.get(node, []):
            del self.env[to_delete]

    if node.op == 'output':
        output_val = self.env[node]
        return self.module.graph.process_outputs(output_val) if enable_io_processing else output_val

但是run_node因为重载了，所以会调用子类TRTInterpreter中的方法（我们之后也可以通过这种方式实现自己的解释器，去做一些功能），最终会根据不同node的类型，调用不同的node方法，比如call_module、call_function、call_method这仨，表示FX中的三种IR，每个函数中都会调用CONVERTERS来获取转换op：

def call_module(self, target, args, kwargs):
    assert isinstance(target, str)
    submod = self.fetch_attr(target)
    submod_type = getattr(submod, "_base_class_origin", type(submod))
    converter = CONVERTERS.get(submod_type)

    if not converter:
        raise RuntimeError(
            f"Conversion of module of type {submod_type} not currently supported!"
        )

    assert self._cur_node_name is not None
    return converter(self.network, submod, args, kwargs, self._cur_node_name)

def call_function(self, target, args, kwargs):
    converter = CONVERTERS.get(target)
    if not converter:
        raise RuntimeError(
            f"Conversion of function {torch.typename(target)} not currently supported!"
        )

    assert self._cur_node_name is not None
    return converter(self.network, target, args, kwargs, self._cur_node_name)

def call_method(self, target, args, kwargs):
    assert isinstance(target, str)
    converter = CONVERTERS.get(target)

    if not converter:
        raise RuntimeError(
            f"Conversion of method {target} not currently supported!"
        )

    assert self._cur_node_name is not None
    return converter(self.network, target, args, kwargs, self._cur_node_name)

转换op的注册代码在TensorRT/py/torch_tensorrt/fx/converters/acc_ops_converters.py中，就拿卷积来说，每一个acc-op对应一个converter，每个converter函数会调用trt的api构建网络：

@tensorrt_converter(acc_ops.conv3d)
@tensorrt_converter(acc_ops.conv2d)
def acc_ops_convnd(
    network: TRTNetwork,
    target: Target,
    args: Tuple[Argument, ...],
    kwargs: Dict[str, Argument],
    name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:
    input_val = kwargs["input"]

    if not isinstance(input_val, TRTTensor):
        raise RuntimeError(
            f"Conv received input {input_val} that is not part "
            "of the TensorRT region!"
        )

    if has_dynamic_shape(input_val.shape):
        assert input_val.shape[1] != -1, "Channel dim can't be dynamic for convolution."

    # for now we'll assume bias is constant Tensor or None,
    # and bias being ITensor is not supported in TensorRT api
    # right now
    if kwargs["bias"] is not None and not isinstance(kwargs["bias"], torch.Tensor):
        raise RuntimeError(
            f"linear {name} has bias of type {type(kwargs['bias'])}, Expect Optional[Tenosr]"
        )
    bias = to_numpy(kwargs["bias"])  # type: ignore[arg-type]

    if network.has_explicit_precision:
        weight = get_trt_tensor(network, kwargs["weight"], f"{name}_weight")
        weight_shape = tuple(kwargs["weight"].shape)  # type: ignore[union-attr]
        # will need to use uninitialized weight and set it later to support
        # ITensor weights
        dummy_weight = trt.Weights()
        layer = network.add_convolution_nd(
            input=input_val,
            num_output_maps=weight.shape[0],
            kernel_shape=weight.shape[2:],
            kernel=dummy_weight,
            bias=bias,
        )

        layer.set_input(1, weight)
    else:
        if not isinstance(kwargs["weight"], torch.Tensor):
            raise RuntimeError(
                f"linear {name} has weight of type {type(kwargs['weight'])}, Expect Optional[Tenosr]"
            )
        weight = to_numpy(kwargs["weight"])
        layer = network.add_convolution_nd(
            input=input_val,
            num_output_maps=weight.shape[0],
            kernel_shape=weight.shape[2:],
            kernel=weight,
            bias=bias,
        )

    set_layer_name(layer, target, name)
    layer.stride_nd = kwargs["stride"]
    layer.padding_nd = kwargs["padding"]
    layer.dilation_nd = kwargs["dilation"]
    if kwargs["groups"] is not None:
        layer.num_groups = kwargs["groups"]

    return layer.get_output(0)

构建好网络之后，设置一些build参数，就可以进行build了。

engine = self.builder.build_engine(self.network, builder_config) build完之后，传入TRTModule，就可以直接调用trt_mod来验证精度了。

engine, input_names, output_names = res.engine, res.input_names, res.output_names
trt_mod = TRTModule(engine, input_names, output_names)

这里我验证了这个模型的精度，一共是两个类别，训练图像4w多，校准用了512张图片，评价的分数阈值是0.1，NMS阈值0.2：量化前指标：

|   AP   |  AP50  |  AP60  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 62.745 | 95.430 | 76.175 | 54.004 | 66.575 | 63.692 |

量化后指标：

|   AP   |  AP50  |  AP60  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 60.340 | 95.410 | 70.561 | 50.154 | 64.969 | 62.009 |

量化后转化为TensorRT的指标：

|   AP   |  AP50  |  AP60  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 60.355 | 95.404 | 70.412 | 50.615 | 64.763 | 61.322 |

嗯，AP降了2个点，但是AP50降得不多，还好还好。再看一下速度，在3080显卡上，一帧需要3.8ms，相比FP16的4.8ms貌似快了一些，但貌似还不够快。

简单跑下trt的隐式量化（implict mode ）模式，大概就是先将Centernet模型转化为ONNX，然后再通过使用trtexec强制指定int8（这里不看精度，不传入校准图片，仅仅是为了测试下int8的速度），然后发现速度竟然只需3.1ms。

速度相差了不少，想都不用想可能FX转化为TRT的时候，肯定有些层没有优化到极致。那就对比下两个engine的网络结构图，首先是implict mode下的engine：

[03/07/2022-11:34:20] [I]                                      Conv_101 + Add_103 + Relu_104       16.09           0.0215      0.7
[03/07/2022-11:34:20] [I]                                                Conv_105 + Relu_106       14.89           0.0199      0.6
[03/07/2022-11:34:20] [I]                                                Conv_107 + Relu_108       20.96           0.0280      0.9
[03/07/2022-11:34:20] [I]                                      Conv_109 + Add_110 + Relu_111       15.18           0.0203      0.6
[03/07/2022-11:34:20] [I]                                                Conv_112 + Relu_113       14.31           0.0191      0.6
[03/07/2022-11:34:20] [I]                                                Conv_114 + Relu_115       20.82           0.0278      0.9
[03/07/2022-11:34:20] [I]                                      Conv_116 + Add_117 + Relu_118       15.16           0.0202      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_119       40.61           0.0542      1.7
[03/07/2022-11:34:20] [I]              ConvTranspose_120 + BatchNormalization_121 + Relu_122       31.20           0.0416      1.3
[03/07/2022-11:34:20] [I]              ConvTranspose_123 + BatchNormalization_124 + Relu_125      110.56           0.1476      4.7
[03/07/2022-11:34:20] [I]              ConvTranspose_126 + BatchNormalization_127 + Relu_128      509.55           0.6803     21.7
[03/07/2022-11:34:20] [I]  Conv_129 + Relu_130 || Conv_132 + Relu_133 || Conv_135 + Relu_136      197.13           0.2632      8.4
[03/07/2022-11:34:20] [I]               Reformatting CopyNode for Input Tensor 0 to Conv_131       13.22           0.0177      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_131       12.35           0.0165      0.5
[03/07/2022-11:34:20] [I]               Reformatting CopyNode for Input Tensor 0 to Conv_134       13.12           0.0175      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_134       12.14           0.0162      0.5
[03/07/2022-11:34:20] [I]               Reformatting CopyNode for Input Tensor 0 to Conv_137       13.07           0.0175      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_137       11.99           0.0160      0.5
[03/07/2022-11:34:20] [I]                                                              Total     2352.92           3.1414    100.0

可以看到该融合的都融合了，尤其是 Conv_116 + Add_117 + Relu_118以及ConvTranspose_120 + BatchNormalization_121 + Relu_122和Conv_129 + Relu_130 || Conv_132 + Relu_133 || Conv_135 + Relu_136，都是提速很大的融合，下图是通过trt-engine生成时候产出的log画的图：

51c~TensorRT~合集1_视觉_45

再看下刚才经过FX转换成TRT模型的网络结构：

[03/03/2022-14:46:31] [I]                                                                                                   add_29 + relu_97        8.90           0.0137      0.4
[03/03/2022-14:46:31] [I]   quantize_per_channel_110_input + (Unnamed Layer* 592) [Constant]_output_per_channel_quant + conv2d_107 + relu_98       12.88           0.0199      0.5
[03/03/2022-14:46:31] [I]   quantize_per_channel_111_input + (Unnamed Layer* 603) [Constant]_output_per_channel_quant + conv2d_108 + relu_99       19.11           0.0295      0.8
[03/03/2022-14:46:31] [I]             quantize_per_channel_112_input + (Unnamed Layer* 614) [Constant]_output_per_channel_quant + conv2d_109       12.09           0.0187      0.5
[03/03/2022-14:46:31] [I]                                                                                                  add_30 + relu_100        8.84           0.0136      0.4
[03/03/2022-14:46:31] [I]  quantize_per_channel_113_input + (Unnamed Layer* 630) [Constant]_output_per_channel_quant + conv2d_110 + relu_101       12.61           0.0195      0.5
[03/03/2022-14:46:31] [I]  quantize_per_channel_114_input + (Unnamed Layer* 641) [Constant]_output_per_channel_quant + conv2d_111 + relu_102       18.68           0.0288      0.8
[03/03/2022-14:46:31] [I]             quantize_per_channel_115_input + (Unnamed Layer* 652) [Constant]_output_per_channel_quant + conv2d_112       12.11           0.0187      0.5
[03/03/2022-14:46:31] [I]                                                                                                  add_31 + relu_103        8.84           0.0136      0.4
[03/03/2022-14:46:31] [I]             quantize_per_channel_116_input + (Unnamed Layer* 668) [Constant]_output_per_channel_quant + conv2d_113       37.40           0.0577      1.5
[03/03/2022-14:46:31] [I]     quantize_per_channel_117_input + (Unnamed Layer* 678) [Constant]_output_per_channel_quant + conv_transpose2d_3       30.68           0.0474      1.2
[03/03/2022-14:46:31] [I]                                                                                                      PWN(relu_104)        4.73           0.0073      0.2
[03/03/2022-14:46:31] [I]     quantize_per_channel_118_input + (Unnamed Layer* 693) [Constant]_output_per_channel_quant + conv_transpose2d_4      102.36           0.1580      4.2
[03/03/2022-14:46:31] [I]                                                                                                      PWN(relu_105)       10.18           0.0157      0.4
[03/03/2022-14:46:31] [I]     quantize_per_channel_119_input + (Unnamed Layer* 708) [Constant]_output_per_channel_quant + conv_transpose2d_5      447.84           0.6911     18.2
[03/03/2022-14:46:31] [I]                                                                                                      PWN(relu_106)       34.68           0.0535      1.4
[03/03/2022-14:46:31] [I]  quantize_per_channel_120_input + (Unnamed Layer* 723) [Constant]_output_per_channel_quant + conv2d_114 + relu_107       65.06           0.1004      2.6
[03/03/2022-14:46:31] [I]  quantize_per_channel_122_input + (Unnamed Layer* 742) [Constant]_output_per_channel_quant + conv2d_116 + relu_108       64.46           0.0995      2.6
[03/03/2022-14:46:31] [I]  quantize_per_channel_124_input + (Unnamed Layer* 761) [Constant]_output_per_channel_quant + conv2d_118 + relu_109       64.35           0.0993      2.6
[03/03/2022-14:46:31] [I]             quantize_per_channel_121_input + (Unnamed Layer* 734) [Constant]_output_per_channel_quant + conv2d_115       11.23           0.0173      0.5
[03/03/2022-14:46:31] [I]             quantize_per_channel_123_input + (Unnamed Layer* 753) [Constant]_output_per_channel_quant + conv2d_117       11.16           0.0172      0.5
[03/03/2022-14:46:31] [I]             quantize_per_channel_125_input + (Unnamed Layer* 772) [Constant]_output_per_channel_quant + conv2d_119       11.20           0.0173      0.5
[03/03/2022-14:46:31] [I]                        Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 741) [Quantize]_output_.dequant        6.92           0.0107      0.3
[03/03/2022-14:46:31] [I]                                                                    (Unnamed Layer* 741) [Quantize]_output_.dequant        4.45           0.0069      0.2
[03/03/2022-14:46:31] [I]                        Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 760) [Quantize]_output_.dequant        6.34           0.0098      0.3
[03/03/2022-14:46:31] [I]                                                                    (Unnamed Layer* 760) [Quantize]_output_.dequant        4.56           0.0070      0.2
[03/03/2022-14:46:31] [I]                        Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 779) [Quantize]_output_.dequant        6.00           0.0093      0.2
[03/03/2022-14:46:31] [I]                                                                    (Unnamed Layer* 779) [Quantize]_output_.dequant        4.35           0.0067      0.2
[03/03/2022-14:46:31] [I]                                                                                                              Total     2464.87           3.8038    100.0

可以发现没有Conv_116 + Add_117 + Relu_118以及后续的ConvTranspose_120 + BatchNormalization_121 + Relu_122和Conv_129 + Relu_130 || Conv_132 + Relu_133 || Conv_135 + Relu_136优化，这部分多消耗了0.6ms的时间：

51c~TensorRT~合集1_视觉_46

为什么会这样呢，仔细观察了下FX的模型结构，发现这里多了一个Q、DQ的操作，对于TensorRT来说，不恰当位置的QDQ会导致TensorRT在量化的时候优化不彻底。

51c~TensorRT~合集1_视觉_47

所以理想的应该是这种的，BN层紧接着Add，中间米有QDQ操作，这样TRT会把conv+bn+add以及后续的relu直接融合成Conv_116 + Add_117 + Relu_118：

51c~TensorRT~合集1_视觉_48

另外还有一点，旧版的FX在fuse的时候（第二篇有说），反卷积后续的BN层融合，这个也会对后续的量化造成一些干扰，导致优化不彻底，把这些都解决后TRT就可以正常优化了。

如何批量将多的QDQ操作干掉呢，这个利用刚才介绍的interpreter就OK了，在propagate的时候，将add节点的args直接修改为正确的节点即可，一共17个，批量修改即可：

def propagate(self, *args):
    args_iter = iter(args)
    env : Dict[str, Node] = {}

    def load_arg(a):
        return fx.graph.map_arg(a, lambda n: env[n.name])

    def fetch_attr(target : str):
        target_atoms = target.split('.')
        attr_itr = self.mod
        for i, atom in enumerate(target_atoms):
            if not hasattr(attr_itr, atom):
                raise RuntimeError(f"Node referenced nonexistant target {'.'.join(target_atoms[:i])}")
            attr_itr = getattr(attr_itr, atom)
        return attr_itr

    for node in self.graph.nodes:
        # 这里修改
        if "add" in node.name:
            node.args = (self.change_list[node.name], node.args[1])
 # 修改完之后，需要将置空的节点删除
    self.mod.graph.eliminate_dead_code()
 # 更新graph
    self.mod.recompile()

    return

这样就OK了，修改后的add与上一个conv层（这里BN被conv吸进去了）之间就没有QDQ的操作：

51c~TensorRT~合集1_视觉_49

同样，反卷积也和BN层合并了：

51c~TensorRT~合集1_视觉_50

将修改后的fx模型，再一次经过TensorRT的转换，再一次benchmark一下：

# 修改网络之前的
=== Performance summary ===
Throughput: 260.926 qps
Latency: min = 4.91473 ms, max = 5.23787 ms, mean = 4.97783 ms, median = 4.97583 ms, percentile(99%) = 5.22012 ms
End-to-End Host Latency: min = 4.98529 ms, max = 8.08485 ms, mean = 7.56827 ms, median = 7.58014 ms, percentile(99%) = 8.06438 ms
Enqueue Time: min = 0.375031 ms, max = 0.717957 ms, mean = 0.394493 ms, median = 0.391724 ms, percentile(99%) = 0.470032 ms
H2D Latency: min = 1.03088 ms, max = 1.09827 ms, mean = 1.03257 ms, median = 1.03235 ms, percentile(99%) = 1.03613 ms
GPU Compute Time: min = 3.75397 ms, max = 4.07245 ms, mean = 3.81574 ms, median = 3.81421 ms, percentile(99%) = 4.05913 ms
D2H Latency: min = 0.125977 ms, max = 0.153076 ms, mean = 0.129512 ms, median = 0.129333 ms, percentile(99%) = 0.131836 ms
Total Host Walltime: 3.01235 s
Total GPU Compute Time: 2.99917 s
Explanations of the performance metrics are printed in the verbose logs.

# 修改网络之后
=== Performance summary ===
Throughput: 305.313 qps
Latency: min = 4.35956 ms, max = 4.64665 ms, mean = 4.41392 ms, median = 4.40918 ms, percentile(99%) = 4.62846 ms
End-to-End Host Latency: min = 4.401 ms, max = 6.90311 ms, mean = 6.43806 ms, median = 6.43774 ms, percentile(99%) = 6.88329 ms
Enqueue Time: min = 0.320801 ms, max = 0.559082 ms, mean = 0.334164 ms, median = 0.330078 ms, percentile(99%) = 0.486328 ms
H2D Latency: min = 1.03186 ms, max = 1.03824 ms, mean = 1.03327 ms, median = 1.0332 ms, percentile(99%) = 1.03638 ms
GPU Compute Time: min = 3.20001 ms, max = 3.48364 ms, mean = 3.25109 ms, median = 3.24609 ms, percentile(99%) = 3.46623 ms
D2H Latency: min = 0.126404 ms, max = 0.13208 ms, mean = 0.129566 ms, median = 0.129395 ms, percentile(99%) = 0.13147 ms
Total Host Walltime: 3.01003 s
Total GPU Compute Time: 2.98775 s
Explanations of the performance metrics are printed in the verbose logs.

发现速度从3.8ms->3.2ms了，提升了0.6ms，QPS也提升了15%，当然精度没有变化，此时TensorRT的log显示该融合的都正确融合了。

不过我好奇的是，现在3.2ms，比上述implict mode下的直接通过trtexec量化的engine的3.1ms，还慢0.1ms。于是我尝试使用trtexec，加入校准数据去量化这个模型，发现速度又变为3.2ms了，目前尚不清楚原因，如果有知道的小伙伴欢迎留言。

到目前为止，我们成功使用FX后训练量化了一个模型，并且转化为了TensorRT，精度和速度也比较符合预期！

需要符合TensorRT搭建network的形式

如果遇到模型出来的节点不对、有腾空的节点（即节点输出不是任一层的输入也不是模型的输出）、有错误引用的结点（结点获取某些属性是不存在的，例如backbone_base_fc_bias = self.backbone.base.fc.bias，其中fc是一个ConvRelu2D的）。这个时候TRT构建的时候会报错：Error Code 4: Internal Error ([DECONVOLUTION]-[acc_ops.conv_transpose2d]-[conv_transpose2d_3]: Missing Dequantization layer \- 2nd input to a weighted-layer must include exactly one DQ layer.)。当然也有可能是TensorRT的bug，修改节点的FX网络，在TensorRT-8.2版本以上就没问题，但是TensorRT-8.0.1.6下，就会构建出匪夷所思的模型（下面显示的模型结构，INT8和FP32的节点错乱）：

Layer(CaskConvolution): quantize_per_channel_106_input + 492output_per_channel_quant + conv2d_103 + relu_95, Tactic: 805889586762897346, 489output[Int8(1,1024,-26,-29)] -> 500output[Int8(1,512,-26,-29)]
Layer(CaskConvolution): quantize_per_channel_109_input + 520output_per_channel_quant + conv2d_106, Tactic: 7738495016763012180, 489output[Int8(1,1024,-26,-29)] -> 527output[Int8(1,2048,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_107_input + 503output_per_channel_quant + conv2d_104 + relu_96, Tactic: 6781129591847482048, 500output[Int8(1,512,-26,-29)] -> 511output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_108_input + 514output_per_channel_quant + conv2d_105 + add_29 + relu_97, Tactic: 8234775147403903473, 511output[Int8(1,512,-50,-51)], 527output[Int8(1,2048,-50,-51)] -> 533output[Int8(1,2048,-50,-51)]
Layer(CudnnConvolution): quantize_per_channel_110_input + 536output_per_channel_quant + 538output_.dequant + conv2d_107 + relu_98, Tactic: 1, 535output[Float(1,2048,-50,-51)] -> 542Activation]_output[Float(1,512,-50,-51)]
Layer(Scale): 542Activation]_output_per_tensor_quant, Tactic: 0, 542Activation]_output[Float(1,512,-50,-51)] -> 544output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_111_input + 547output_per_channel_quant + conv2d_108 + relu_99, Tactic: 7438984192263206338, 544output[Int8(1,512,-50,-51)] -> 555output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_112_input + 558output_per_channel_quant + conv2d_109 + add_30 + relu_100, Tactic: 8234775147403903473, 555output[Int8(1,512,-50,-51)], 533output[Int8(1,2048,-50,-51)] -> 567output[Int8(1,2048,-50,-51)]
Layer(CudnnConvolution): quantize_per_channel_113_input + 570output_per_channel_quant + 572output_.dequant + conv2d_110 + relu_101, Tactic: 1, 569output[Float(1,2048,-50,-51)] -> 576Activation]_output[Float(1,512,-50,-51)]
Layer(Scale): 576Activation]_output_per_tensor_quant, Tactic: 0, 576Activation]_output[Float(1,512,-50,-51)] -> 578output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_114_input + 581output_per_channel_quant + conv2d_111 + relu_102, Tactic: 7438984192263206338, 578output[Int8(1,512,-50,-51)] -> 589output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_115_input + 592output_per_channel_quant + conv2d_112 + add_31 + relu_103, Tactic: 8234775147403903473, 589output[Int8(1,512,-50,-51)], 567output[Int8(1,2048,-50,-51)] -> 601output[Int8(1,2048,-50,-51)]

engine是能构建出来，但是速度很慢，精度全无，对于我们的debug更造成了一些困扰和难度。

FX2TRT的另一种方式

TensorRT有显式量化（explicit mod）和隐式量化（implict mode ）两种方式，我们刚才用的是显式量化，即利用QDQ显式声明需要量化的节点，我们也可以用过隐式量化走FX去转TensorRT，这个时候就不能转reference版本的模型，不是模拟量化，而是实际算子就是INT8的模型，quantized_fx = convert_fx(model.fx_model)。

Pytorch有CPU端的INT8操作，实际中模型调用的是torch.nn.quantized.modules.conv.Conv2d算子，在转trt的时候，会调用以下的转换代码：

@tensorrt_converter(torch.nn.quantized.modules.conv.Conv2d)
def quantized_conv2d(network, submod, args, kwargs, layer_name):
    input_val = args[0]

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(
            f"Quantized Conv2d received input {input_val} that is not part "
            "of the TensorRT region!"
        )

    return common_conv(
        network,
        submod,
        dimension=2,
        input_val=input_val,
        layer_name=layer_name,
        is_quantized=True,
    )

过程中我们会传入每一层激活值的scale和zero_point，但是weight还是由tensorrt内部进行校准的：

if is_quantized:
    # Assume the dtype of activation is torch.quint8
    mark_as_int8_layer(
        layer, get_dyn_range(mod.scale, mod.zero_point, torch.quint8)
    )

这里就不演示了，写不动了。

提一嘴TRTModule类

FX2TRT中，最终构造出来的engine是由这个类进行管理，这个类对engine进行了封装，我们在调用该类对象的时候，就和调用普通nn.module一样，非常方便。

可以通过代码看下TRTModule的细节，值得看。

# torch_tensorrt/fx/trt_module.py

class TRTModule(torch.nn.Module):
    def __init__(
        self, engine=None, input_names=None, output_names=None, cuda_graph_batch_size=-1
    ):
        super(TRTModule, self).__init__()
        self._register_state_dict_hook(TRTModule._on_state_dict)
        self.engine = engine
        self.input_names = input_names
        self.output_names = output_names
        self.cuda_graph_batch_size = cuda_graph_batch_size
        self.initialized = False

        if engine:
            self._initialize()

    def _initialize(self):
        self.initialized = True
        self.context = self.engine.create_execution_context()

        # Indices of inputs/outputs in the trt engine bindings, in the order
        # as they are in the original PyTorch model.
        self.input_binding_indices_in_order: Sequence[int] = [
            self.engine.get_binding_index(name) for name in self.input_names
        ]
        self.output_binding_indices_in_order: Sequence[int] = [
            self.engine.get_binding_index(name) for name in self.output_names
        ]
        primary_input_outputs = set()
        primary_input_outputs.update(self.input_binding_indices_in_order)
        primary_input_outputs.update(self.output_binding_indices_in_order)
        self.hidden_output_binding_indices_in_order: Sequence[int] = []
        self.hidden_output_names: Sequence[str] = []
        for i in range(
            self.engine.num_bindings // self.engine.num_optimization_profiles
        ):
            if i not in primary_input_outputs:
                self.hidden_output_binding_indices_in_order.append(i)
                self.hidden_output_names.append(self.engine.get_binding_name(i))

        assert (self.engine.num_bindings // self.engine.num_optimization_profiles) == (
            len(self.input_names)
            + len(self.output_names)
            + len(self.hidden_output_names)
        )

        self.input_dtypes: Sequence[torch.dtype] = [
            torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            for idx in self.input_binding_indices_in_order
        ]
        self.input_shapes: Sequence[Sequence[int]] = [
            tuple(self.engine.get_binding_shape(idx))
            for idx in self.input_binding_indices_in_order
        ]
        self.output_dtypes: Sequence[torch.dtype] = [
            torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            for idx in self.output_binding_indices_in_order
        ]
        self.output_shapes = [
            tuple(self.engine.get_binding_shape(idx))
            if self.engine.has_implicit_batch_dimension
            else tuple()
            for idx in self.output_binding_indices_in_order
        ]
        self.hidden_output_dtypes: Sequence[torch.dtype] = [
            torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            for idx in self.hidden_output_binding_indices_in_order
        ]
        self.hidden_output_shapes = [
            tuple(self.engine.get_binding_shape(idx))
            if self.engine.has_implicit_batch_dimension
            else tuple()
            for idx in self.hidden_output_binding_indices_in_order
        ]

    def _check_initialized(self):
        if not self.initialized:
            raise RuntimeError("TRTModule is not initialized.")

    def _on_state_dict(self, state_dict, prefix, local_metadata):
        self._check_initialized()
        state_dict[prefix + "engine"] = bytearray(self.engine.serialize())
        state_dict[prefix + "input_names"] = self.input_names
        state_dict[prefix + "output_names"] = self.output_names
        state_dict[prefix + "cuda_graph_batch_size"] = self.cuda_graph_batch_size

    def _load_from_state_dict(
        self,
        state_dict,
        prefix,
        local_metadata,
        strict,
        missing_keys,
        unexpected_keys,
        error_msgs,
    ):
        engine_bytes = state_dict[prefix + "engine"]

        logger = trt.Logger()
        runtime = trt.Runtime(logger)
        self.engine = runtime.deserialize_cuda_engine(engine_bytes)

        self.input_names = state_dict[prefix + "input_names"]
        self.output_names = state_dict[prefix + "output_names"]
        self._initialize()

    def __getstate__(self):
        state = self.__dict__.copy()
        state["engine"] = bytearray(self.engine.serialize())
        state.pop("context", None)
        return state

    def __setstate__(self, state):
        logger = trt.Logger()
        runtime = trt.Runtime(logger)
        state["engine"] = runtime.deserialize_cuda_engine(state["engine"])
        self.__dict__.update(state)
        if self.engine:
            self.context = self.engine.create_execution_context()

    def forward(self, *inputs):
        with torch.autograd.profiler.record_function("TRTModule:Forward"):
            self._check_initialized()

            with torch.autograd.profiler.record_function("TRTModule:ProcessInputs"):
                assert len(inputs) == len(
                    self.input_names
                ), f"Wrong number of inputs, expect {len(self.input_names)} get {len(inputs)}."

                # This is only used when the trt engine is using implicit batch dim.
                batch_size = inputs[0].shape[0]
                contiguous_inputs: List[torch.Tensor] = [i.contiguous() for i in inputs]
                bindings: List[Any] = [None] * (
                    len(self.input_names)
                    + len(self.output_names)
                    + len(self.hidden_output_names)
                )

                for i, input_name in enumerate(self.input_names):
                    assert inputs[
                        i
                    ].is_cuda, f"{i}th input({input_name}) is not on cuda device."
                    assert (
                        inputs[i].dtype == self.input_dtypes[i]
                    ), f"Dtype mismatch for {i}th input({input_name}). Expect {self.input_dtypes[i]}, got {inputs[i].dtype}."

                    idx = self.input_binding_indices_in_order[i]
                    bindings[idx] = contiguous_inputs[i].data_ptr()

                    if not self.engine.has_implicit_batch_dimension:
                        self.context.set_binding_shape(
                            idx, tuple(contiguous_inputs[i].shape)
                        )
                    else:
                        assert inputs[i].size()[1:] == self.input_shapes[i], (
                            f"Shape mismatch for {i}th input({input_name}). "
                            f"Expect {self.input_shapes[i]}, got {inputs[i].size()[1:]}."
                        )

            with torch.autograd.profiler.record_function("TRTModule:ProcessOutputs"):
                # create output tensors
                outputs: List[torch.Tensor] = []

                for i, idx in enumerate(self.output_binding_indices_in_order):
                    if self.engine.has_implicit_batch_dimension:
                        shape = (batch_size,) + self.output_shapes[i]
                    else:
                        shape = tuple(self.context.get_binding_shape(idx))

                    output = torch.empty(  # type: ignore[call-overload]
                        size=shape,
                        dtype=self.output_dtypes[i],
                        device=torch.cuda.current_device(),
                    )
                    outputs.append(output)
                    bindings[idx] = output.data_ptr()

                for i, idx in enumerate(self.hidden_output_binding_indices_in_order):
                    if self.engine.has_implicit_batch_dimension:
                        shape = (batch_size,) + self.hidden_output_shapes[i]
                    else:
                        shape = tuple(self.context.get_binding_shape(idx))

                    output = torch.empty(  # type: ignore[call-overload]
                        size=shape,
                        dtype=self.hidden_output_dtypes[i],
                        device=torch.cuda.current_device(),
                    )
                    bindings[idx] = output.data_ptr()

            with torch.autograd.profiler.record_function("TRTModule:TensorRTRuntime"):
                if self.engine.has_implicit_batch_dimension:
                    self.context.execute_async(
                        batch_size, bindings, torch.cuda.current_stream().cuda_stream
                    )
                else:
                    self.context.execute_async_v2(
                        bindings, torch.cuda.current_stream().cuda_stream
                    )

            if len(outputs) == 1:
                return outputs[0]

            return tuple(outputs)

    def enable_profiling(self, profiler: "trt.IProfiler" = None):
        """
        Enable TensorRT profiling. After calling this function, TensorRT will report
        time spent on each layer in stdout for each forward run.
        """
        self._check_initialized()

        if not self.context.profiler:
            self.context.profiler = trt.Profiler() if profiler is None else profiler

    def disable_profiling(self):
        """
        Disable TensorRT profiling.
        """
        self._check_initialized()

        torch.cuda.synchronize()
        del self.context
        self.context = self.engine.create_execution_context()

    def get_layer_info(self) -> str:
        """
        Get layer info of the engine. Only support for TRT > 8.2.
        """
        inspector = self.engine.create_engine_inspector()
        return inspector.get_engine_information(trt.LayerInformationFormat.JSON)

TRTModule我见过最开始出现在torch2trt，也是一个Pytorch的转换TensorRT工具，同样非常好用。

五、TensorRT基础入门

1、为什么trtexec转换engine时，采用FP16推理、INT8量化，推理延时可能变得更久？

答：可能原因是：

a. 量化后可能会引入一些多余的计算操作和内部的一些reshape。对于小模型，多余的计算带来的延时并不明显；而reshape会涉及一些内存操作，这个是延时变长的主要原因。对于reshape引起的延时变长，我们的解决办法是让TensorRT不做一些额外的这些操作，但TensorRT内部产生的reshape我们没有办法解决的。
b. 另外，TensorRT有kernel auto tuning的机制，因此选择的kernel不一定是效率最高的。

2、什么是Myelin？

答：这是TensorRT内部的一个概念，负责graph compilation（图编译）和execution backend（执行后端）的内容。

3、constant cache和constant memory的区别？

答：constant cache和constant memory是两个概念，cache更靠近计算单元，所以速度更快。constant cache是以前GPU版本中的概念，比如早期Fermi架构的SM block（左图）。而现在Ampere架构的SM如右图所示。

4、在cuda, cudnn, tensorrt版本相同的情况下，可以将其他电脑上转换好的trt直接在自己电脑运行吗？

答：不同的GPU架构针对trt的优化方式不一样，所以移植到另外一个平台可能会不兼容。

5、请教一个问题：对于比较大的模型，对于边缘设备trtexec搜寻时间太长有什么好的方法或者技巧么？（转engin搜寻最优的layer的过程时间过长）

答：创建 engine的时候必须要在推理用的设备上跑，边缘设备上可能会稍微慢一点。但是，如果对于同一个模型进行多次创建engine，或者只对模型部分layer做了修改其他大部分layer没有动(比如在调试或者测试的时候)的话，我们可以在第一次创建的时候把各个layer所对应的trt探索得到的最优tactics，也就是核函数和优化方案以某种方式保存下来。第二次以后再创建模型的时候，读取我们所保存的tactics就可以让trt skip掉已经探索所得到的优化方案。这个就是Timing cache。在trtexec命令行和TensorRT API下都可以指定。你试着参考一下这里。https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt-builder-perf

TensorRT模型部署优化1、模型部署后，用什么手段分析推理性能？

答：可以利用Nsight工具分析模型推理性能。通过该工具可以捕获模型各个kernel运行的时间。针对运行情况，我们再做优化。

2、神经网络中吞吐和延迟的关系？

答：吞吐是用来描述一个硬件设备单位时间内可以完成的计算量；延迟是用来描述一个模型推理所需的时间。延迟又分为计算产生的延迟和数据传输（包括数据同步）造成的延迟。我们可以用nsys和Nsight Compute工具定量分析不同阶段的延时情况。

3、tensorrt量化方法？

答：trt默认和推荐的量化算法是entropy，但具体需要看情况，有时候选择minmax或者percentile会达到更好的效果。这个需要结合op的特点一起考虑。

4、模型导出fp32的trt engine没有明显精度损失，导出fp16损失很明显，可能的原因有哪些？

比较突出的几个可能性就是：对一些敏感层进行了量化导致掉精度比较严重，或者权重的分布没有集中导致量化的dynamic range的选择让很多有数值的权重都归0了。另外，minmax, entropy, percentile这些计算scale的选择没有根据op进行针对性的选择也会出现掉点。

5、onnx模型推理结果正确，但tensorRT量化后的推理结果不正确，大概原因有哪些？

出现这种问题的时候，需要先确认两种模型推理的前处理（例如，对输入的各种预处理需要和pytoch模型的训练预处理完全一致）和后处理是否一致。确认是量化引起的问题时，可能原因有：

a. calibrator的算法选择不对；
b. calibration过程使用的数据不够；
c. 对网络敏感层进行了量化；
d. 对某些算子选择了不适合OP特性的scale计算。

6、采用tensorRT PTQ量化时，若用不同batchsize校正出来模型精度不一致，这个现象是否正常？

答：这个现象是正常的，因为calibration（校正）是以tensor为单位计算的。对于每次计算，如果histogram的最大值需要更新，那么PTQ会把histogram的range进行翻倍。参考链接：https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#enable_int8_c

不考虑内存不足的问题，推荐使用更大的batch_size，这样每个batch中包含样本更加丰富，校准后的精度会更好。但具体设置多大，需要通过实验确定（从大的batch size开始测试。一点一点往下减）。需要注意的是batch_size越大，校准时间越长。

7、关于对齐内存访问的疑问：如果使用L1cache，访问的颗粒度为128B，对齐的首地址应该为128B偶数倍，不应该是0B，256B，512B.......吗？

答：实际上这里的偶数倍（even multiple）指的是地址是偶数倍的，并非128B的偶数倍。比较官方的解释可以参考如下链接：https://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Fundamental_Optimizations.pdf

8、同一个模型，3090 GPU转换成功，但RTX4000转换失败，该如何解决？（具体错误信息见下图）

答：此处提示SM相关错误，所以可以检查makefile或CMakeLists.txt中对nvcc编译器option的设定是否存在问题。

9、如何使用nsight或CUDA runtime api分析模型推理性能？

答：通过nsight可以看到核函数的名字（可通过名字推测它是用cuda core或tensor core, fp16还是int8）还有可以查看memory的流动。

10、如何尽量减少GPU和CPU之间的数据交互或内存分配与回收？

答：由于在推理过程中，CPU与GPU之间的数据拷贝耗时较长或出现频繁分配和回收内存的现象，这大大降低了模型推理性能。我们可以采用在推理模型前分配好所需要的最大内存（做到内存复用）以降低内存分配或回收的次数。针对CPU与GPU之间数据相互拷贝问题，我们需要优化代码流程，尽量减少拷贝的次数或寻找更好的方法去掩盖这个动作需要的时间。

11、如果QAT可以使模型尽可能减少量化带来的误差，那么可以不做敏感层分析，直接将整个网络量化为INT8吗？

答：不建议这么做，从经验来看，敏感层量化到INT8精度会下降很多，所以还是有必要进行敏感层分析。

12、模型量化到INT8后，推理时间反而比FP16慢，这正常吗？

答：正常的，这可能是tensorrt中内核auto tuning机制作怪（会把所有的优化策略都运行一遍，结果发现量化后涉及一堆其他的操作，反而效率不高，索性使用cuda core，而非tensorrt core）。当网络参数和模型架构设计不合理时，trt会添加额外的处理，导致INT8推理时间比FP16长。我们可以通过trt-engine explorer工具可视化engine模型看到。

13、请教一下，engine推理的时候，batchsize=1和batchsize=4，推理时间相差也接近4倍合理吗？有什么办法让多batch的推理时间接近单batch吗？比如加大显存？

答：这个可能出现的原因有很多，有可能单个batchsize的推理就已经把GPU资源全部吃满了，所以batchsize=4的时候看似加大了并行度，实际上也可能是在串行。建议把模型推理放在nsight system上分析一下，看看硬件资源占用率。

14、在device固定的情况下呢？有什么参数设置或者增加streams的方式吗？试过把workspace设到最大，只有轻微的提升

答：workspace的大小跟性能提升关联不大，workspace是使用在创建推理引擎时TensorRT选择tactics来进行优化的，workspace越大可以选择的tactics越丰富。但除非特别的小，一般关联不是那么大。试试fp16, int8这种量化参数来试试量化，cuda-graph来试试kernel launch的隐藏，builderOptimizationLevel的等级设置高一点等等。光靠参数优化还是有点局限。可以看看模型架构是否有冗长。