嘿,这篇文章是不是发布很快,没想到吧。

承接上文,本节文章对代码进行分析。

先对Tensorrt的流程进行一些简要的介绍,后续如果有必要会单开一节进行详细说明。

Tensorrt的加速步骤可分为三步:

1、tensorrt api定义一个网络或者pytorch–>onnx导出模型

2、制定一个容器builder用于生成engine。builder负责指定engine所需内存大小等。

3、使用builder生成engine。engine可用于tensorrt的推理模式。

注:无论是tensorrt直接编写model还是onnx转换的model都需要构建一个builder,然后从builder中生成.engine文件。仔细观察.onnx—>.engine这部分转换代码其实就是一个读取onnx的参数,构建builder,并生成engine。

备:.onnx—>.engine有多种方式,例如NVIDIA官方提供的trtexec脚本,或者使用代码编写相关步骤,如https://github.com/noahmr/yolov5-tensorrt/tree/main/examples/builder noahmr大佬所写代码,大家有兴趣可以去看一看。(有没有哪位同学想了解下,俺也可以写篇文章分析下onnx—>builder—>engine中的步骤及各个参数的意义。嘿嘿)

builder用于生成engine的步骤被称为序列化(serialize),engine可用于tensorrt的推理模式时的操作被称为反序列化(deserialize),序列化的目的主要起到转换模型使其适用于Tensorrt的框架,并可以将转换后engine保存,用于后续推理。(序列化常常会花费一段时间大约几分钟,所以在真正使用时我们常常先生成engine保存文件,以后在推理时可以直接调用。不同的机器通常需要在本地生成相应的engine文件,毕竟调用的硬件或者库文件位置可能不同,相关优化也就不一样)

首先我们先对结构进行分析,建立一个YoloLayerPlugin类。直接两大件先上手。

class API YoloLayerPlugin : public IPluginV2IOExt
{
public:
    YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel);
    //YoloLayerPlugin(const void* data, size_t length);
    ~YoloLayerPlugin();
    /*...*/
}

然后一步步复写IPluginV2IOExt这类,我们需要对这个类中的所有virtual函数复写。从上到下

IPluginV2IOExt–>IPluginV2Ext–>IPluginV2一点点分析。下面是我们必须要复写的函数

getPluginType:获取Plugin的名字

getPluginVersion:获取Plugin的版本

getNbOutputs:获取layer的输出个数

getOutputDimensions:获取layer的输出维度

initialize:在执行时初始化layer,当engine生成时调用

terminate:释放初始化layer时的系统资源。当engine销毁时调用

getWorkspaceSize:获取layer所需空间大小。这个函数与getSerializationSize不同,它是指除数据和参数外分配给layer的额外空间。

enqueue:执行layer的一系列操作。返回值0,1代表执行是否成功。(layer数据操作都包含在这个函数内)

getSerializationSize:返回序列化反冲区的大小,即分配给model的空间大小,后续会有空间计算的详细说明与示例。

serialize:序列化一个layer

destroy:销毁plugin的资源。

clone:复制plugin或layer。在项目我们常常会多次使用一个plugin,例如conv等模块。

setPluginNamespace:设置plugin的命名空间。

getPluginNamespace:获取plugin的命名空间。

getOutputDataType:获得输出数据类型。

isOutputBroadcastAcrossBatch:如果输出张量在批次中广播,则返回 true。

canBroadcastInputAcrossBatch:如果plugin可以使用跨批次广播而无需复制的输入,则返回 true。

attachToContext:将plugin附加到执行上下文并授予plugin对某些上下文资源的访问权限。

detachFromContext:将插件对象从其执行上下文中分离出来。

configurePlugin:配置layer。传达输入和输出的数量、所有输入和输出的维度和数据类型、所有输入和输出的广播信息、选择的插件格式和最大批量大小。此时,插件设置其内部状态并为给定配置选择最合适的算法和数据结构。

supportsFormatCombination:判断plugin是否支持输出的数据类型。

class IPluginV2

    virtual AsciiChar const* getPluginType() const noexcept = 0;
    virtual AsciiChar const* getPluginVersion() const noexcept = 0;
    virtual int32_t getNbOutputs() const noexcept = 0;
    virtual Dims getOutputDimensions(int32_t index, Dims const* inputs, int32_t nbInputDims) noexcept = 0;
    virtual bool supportsFormat(DataType type, PluginFormat format) const noexcept = 0;
    virtual int32_t initialize() noexcept = 0;
    virtual void terminate() noexcept = 0;
    virtual size_t getWorkspaceSize(int32_t maxBatchSize) const noexcept = 0;
    virtual int32_t enqueue(int32_t batchSize, void const* const* inputs, void* const* outputs, void* workspace,
        cudaStream_t stream) noexcept
        = 0;
    virtual size_t getSerializationSize() const noexcept = 0;
    virtual void serialize(void* buffer) const noexcept = 0;
    virtual void destroy() noexcept = 0;
    virtual IPluginV2* clone() const noexcept = 0;
    virtual void setPluginNamespace(AsciiChar const* pluginNamespace) noexcept = 0;
    virtual AsciiChar const* getPluginNamespace() const noexcept = 0;

class IPluginV2Ext : public IPluginV2

    virtual nvinfer1::DataType getOutputDataType(
        int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept = 0;
    virtual bool isOutputBroadcastAcrossBatch(
        int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept = 0;
    virtual bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept = 0;
    virtual void attachToContext(cudnnContext* /*cudnn*/, cublasContext* /*cublas*/, IGpuAllocator* /*allocator*/) noexcept {}
    virtual void detachFromContext() noexcept {}

class IPluginV2IOExt : public IPluginV2Ext
    virtual void configurePlugin(
        PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out, int32_t nbOutput) noexcept = 0;
    virtual bool supportsFormatCombination(
        int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) const noexcept = 0;

ok,也不是很多嘛(/(ㄒoㄒ)/~~),其中有些类函数只是起到一些查询功能,在现阶段,我们可以直接最简化一个括弧即可。

我们来看下https://github.com/wang-xinyu/tensorrtx中yolov5的layer的复写。

class API YoloLayerPlugin : public IPluginV2IOExt

    YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel);
    YoloLayerPlugin(const void* data, size_t length);
    ~YoloLayerPlugin();

    int getNbOutputs() const TRT_NOEXCEPT override
    {
        return 1;
    }

    Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;

    int initialize() TRT_NOEXCEPT override;

    virtual void terminate() TRT_NOEXCEPT override {};

    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }

    virtual int enqueue(int batchSize, const void* const* inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;

    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;

    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;

    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {
        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;
    }

    const char* getPluginType() const TRT_NOEXCEPT override;

    const char* getPluginVersion() const TRT_NOEXCEPT override;

    void destroy() TRT_NOEXCEPT override;

    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;

    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;

    const char* getPluginNamespace() const TRT_NOEXCEPT override;

    DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;

    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;

    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;

    void attachToContext(
        cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;

    void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;

    void detachFromContext() TRT_NOEXCEPT override;

密密麻麻的代码好多,其实真正主要实现的部分就一个enqueue,我们来对比着yololayer.cu文件看看各个函数的实现。

这里先查一段YoloKernel类的代码,该代码是设置关于anchor的一些属性,如对应的图像尺寸,anchor尺寸,便于YoloLayerPlugin类的调用

struct YoloKernel
{
    int width;
    int height;
    float anchors[CHECK_COUNT * 2];
};

YoloLayerPlugin(int …):用于设置detect中的识别的类classCount,图像大小netWidth,netHeight,最大输出检测的个数maxOut,vYoloKernel代表三种FPN对应的YoloKernel信息,即特征图大小,对应的anchor尺寸。

cudaMallocHost((void**)(void*)ptr, size, flags):申请不可分页内存。该步骤主要对应于后两行代码中的cudaMalloc。使用cudaMallocHost在主机中申请相应的buffer,然后利用buffer指向在device中开辟相应的空间。

YoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut,
                                     const std::vector<Yolo::YoloKernel> &vYoloKernel)
    {
        mClassCount = classCount;
        mYoloV5NetWidth = netWidth;
        mYoloV5NetHeight = netHeight;
        mMaxOutObject = maxOut;
        mYoloKernel = vYoloKernel;
        mKernelCount = vYoloKernel.size();
        
        //在device中申请YoloKernel的相关空间
        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void *)));
        size_t AnchorLen = sizeof(float) * CHECK_COUNT * 2;
        for (int ii = 0; ii < mKernelCount; ii++)
        {
            CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));
            const auto &yolo = mYoloKernel[ii];
            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));
        }
    }
YoloLayerPlugin::~YoloLayerPlugin():释放对应的YoloKernel的device和锁页空间。

    YoloLayerPlugin::~YoloLayerPlugin()
    {
        for (int ii = 0; ii < mKernelCount; ii++)
        {
            CUDA_CHECK(cudaFree(mAnchor[ii]));
        }
        CUDA_CHECK(cudaFreeHost(mAnchor));
    }

YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length):从序列化文件中读取相关参数。一样的原理,先申请锁页内存,然后分配并指向device空间。(下面我们会对序列化的格式进行分析,以便解释read(d,…))

YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)
    {
        using namespace Tn;
        const char *d = reinterpret_cast<const char *>(data), *a = d;
        read(d, mClassCount);
        read(d, mThreadCount);
        read(d, mKernelCount);
        read(d, mYoloV5NetWidth);
        read(d, mYoloV5NetHeight);
        read(d, mMaxOutObject);
        mYoloKernel.resize(mKernelCount);
        auto kernelSize = mKernelCount * sizeof(YoloKernel);
        memcpy(mYoloKernel.data(), d, kernelSize);
        d += kernelSize;
        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));
        size_t AnchorLen = sizeof(float)* CHECK_COUNT * 2;
        for (int ii = 0; ii < mKernelCount; ii++)
        {
            CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));
            const auto& yolo = mYoloKernel[ii];
            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));
        }
        assert(d == a + length);
    }

void YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT:序列化模型文件。这段代码的作用是以某种格式存储yololayer相关的信息。

通常在Tensorrt代码中,你都会看到这样一句话engine->serialize(),其实这个api的背后就是调用plugin中的serialize的函数。

void YoloLayerPlugin::serialize(void* buffer) const
{
    using namespace Tn;
    char* d = static_cast<char*>(buffer), *a = d;
    write(d, mClassCount);
    write(d, mThreadCount);
    write(d, mKernelCount);
    auto kernelSize = mKernelCount*sizeof(YoloKernel);
    memcpy(d,mYoloKernel.data(),kernelSize);
    d += kernelSize;

    assert(d == a + getSerializationSize());
}

这里有人可能会有疑问啊?不对啊,yolov5的detect中是包含conv,为何这里序列化没有看到相关模型参数的改变。下图为yolov5原代码:

class Detect(nn.Module):
    stride = None  # strides computed during build
    onnx_dynamic = False  # ONNX export parameter

    def __init__(self, nc=80, anchors=(), ch=(), inplace=True):  # detection layer
        super().__init__()
        ......
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv
        self.inplace = inplace  # use in-place ops (e.g. slice assignment)

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            ......

        return x if self.training else (torch.cat(z, 1), x)

在Tensorrtx中作者将detect实际分为两种部分:

1、负责特征图卷积层IConvolutionLayer* det对应原yolov5代码中三个尺度下self.m—>conv

2、其余操作由YoloLayerPlugin完成。

实际代码如下图:

IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.0.weight"], weightMap["model.24.m.0.bias"]);
IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.1.weight"], weightMap["model.24.m.1.bias"]);
IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.2.weight"], weightMap["model.24.m.2.bias"]);
auto yolo = addYoLoLayer(network, weightMap, "model.24", std::vector<IConvolutionLayer*>{det0, det1, det2});

那么,既然这样接下来的函数就好理解的

这里插段代码:

struct alignas(float) Detection {
    //center_x center_y w h
    float bbox[LOCATIONS];
    float conf;  // bbox_conf * cls_conf
    float class_id;
};

Detection表示检测结果的属性。

size_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT
{
    return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount) +
           sizeof(Yolo::YoloKernel) * mYoloKernel.size() + sizeof(mYoloV5NetWidth) + sizeof(mYoloV5NetHeight) +
           sizeof(mMaxOutObject);
}

int YoloLayerPlugin::initialize() TRT_NOEXCEPT
{
    return 0;
}

Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims *inputs, int nbInputDims) TRT_NOEXCEPT
{
    //output the result to channel
    int totalsize = mMaxOutObject * sizeof(Detection) / sizeof(float);

    return Dims3(totalsize + 1, 1, 1);
}

// Set plugin namespace
void YoloLayerPlugin::setPluginNamespace(const char *pluginNamespace) TRT_NOEXCEPT
{
    mPluginNamespace = pluginNamespace;
}

const char *YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT
{
    return mPluginNamespace;
}

// Return the DataType of the plugin output at the requested index
DataType
YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType *inputTypes, int nbInputs) const TRT_NOEXCEPT
{
    return DataType::kFLOAT;
}

// Return true if output tensor is broadcast across a batch.
bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool *inputIsBroadcasted,
                                                   int nbInputs) const TRT_NOEXCEPT
{
    return false;
}

// Return true if plugin can use input that is broadcast across batch without replication.
bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT
{
    return false;
}

void YoloLayerPlugin::configurePlugin(const PluginTensorDesc *in, int nbInput, const PluginTensorDesc *out,
                                      int nbOutput) TRT_NOEXCEPT
{
}

// Attach the plugin object to an execution context and grant the plugin the access to some context resource.
void YoloLayerPlugin::attachToContext(cudnnContext *cudnnContext, cublasContext *cublasContext,
                                      IGpuAllocator *gpuAllocator) TRT_NOEXCEPT
{
}

// Detach the plugin object from its execution context.
void YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT
{}

const char *YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT
{
    return "YoloLayer_TRT";
}

const char *YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT
{
    return "1";
}

void YoloLayerPlugin::destroy() TRT_NOEXCEPT
{
    delete this;
}

// Clone the plugin
IPluginV2IOExt *YoloLayerPlugin::clone() const TRT_NOEXCEPT
{
    YoloLayerPlugin *p = new YoloLayerPlugin(mClassCount, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject,
                                             mYoloKernel);
    p->setPluginNamespace(mPluginNamespace);
    return p;
}

ok,我们可以看到以上大部分函数只需要简单复写虚函数即可。真正的操作其实主要放在enqueue

int YoloLayerPlugin::enqueue(int batchSize, const void *const *inputs, void *TRT_CONST_ENQUEUE *outputs,
                             void *workspace, cudaStream_t stream) TRT_NOEXCEPT
{
    forwardGpu((const float *const *) inputs, (float *) outputs[0], stream, batchSize);
    return 0;
}

说了这么多,终于进入了主体,下面来看看如何使用咱们前几节学的cuda基础知识去编写相关操作。在华丽的音乐也无非可以分解成宫商角徽羽嘛。

forwardGpu:

    void YoloLayerPlugin::forwardGpu(const float *const *inputs, float *output, cudaStream_t stream, int batchSize)
    {
        int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);
        for (int idx = 0; idx < batchSize; ++idx)
        {
            CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));
        }
        int numElem = 0;
        for (unsigned int i = 0; i < mYoloKernel.size(); ++i)
        {
            const auto &yolo = mYoloKernel[i];
            numElem = yolo.width * yolo.height * batchSize;
            if (numElem < mThreadCount) mThreadCount = numElem;

            //printf("Net: %d  %d \n", mYoloV5NetWidth, mYoloV5NetHeight);
            CalDetection <<< (numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream >>>
            (inputs[i], output, numElem, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float *) mAnchor[i], mClassCount, outputElem);
        }
    }

1、先使用cudaMemsetAsync异步操作申请device空间,并且为了防止异步操作的缺点,使用stream进行控制。

2、申请空间后使用CalDetection对每个fpn生成的feature进行计算。

在这里我们由调用CalDetection的形式<<<>>>可以看出它是一个kernel函数。首先根据

numElem = yolo.width * yolo.height * batchSize;

计算所有的元素,在代码中mThreadCount=256(人为设置,只要是32的倍数即可),然后根据我们初始前几章所讲的cuda函数计算分配第一维的大小。

ok,在复习下相关概念,因为CalDetection是由cpu调用,并在device使用,所以使用__global__关键字。

__global__ void CalDetection(const float *input, float *output, int noElements,
                                 const int netwidth, const int netheight, int maxoutobject, int yoloWidth,
                                 int yoloHeight, const float anchors[CHECK_COUNT * 2], int classes, int outputElem)
    {

        int idx = threadIdx.x + blockDim.x * blockIdx.x;
        if (idx >= noElements) return;

        int total_grid = yoloWidth * yoloHeight;
        int bnIdx = idx / total_grid;
        idx = idx - total_grid * bnIdx;
        int info_len_i = 5 + classes;
        const float *curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);

        for (int k = 0; k < CHECK_COUNT; ++k)
        {
            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);
            if (box_prob < IGNORE_THRESH) continue;
            int class_id = 0;
            float max_cls_prob = 0.0;
            for (int i = 5; i < info_len_i; ++i)
            {
                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);
                if (p > max_cls_prob)
                {
                    max_cls_prob = p;
                    class_id = i - 5;
                }
            }
            float *res_count = output + bnIdx * outputElem;
            int count = (int) atomicAdd(res_count, 1);
            if (count >= maxoutobject) return;
            char *data = (char *) res_count + sizeof(float) + count * sizeof(Detection);
            Detection *det = (Detection *) (data);

            int row = idx / yoloWidth;
            int col = idx % yoloWidth;

            //Location
            // pytorch:
            //  y = x[i].sigmoid()
            //  y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy
            //  y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
            //  X: (sigmoid(tx) + cx)/FeaturemapW *  netwidth
            det->bbox[0] = (col - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) *
                           netwidth / yoloWidth;
            det->bbox[1] = (row - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) *
                           netheight / yoloHeight;

            // W: (Pw * e^tw) / FeaturemapW * netwidth
            // v5: https://github.com/ultralytics/yolov5/issues/471
            det->bbox[2] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]);
            det->bbox[2] = det->bbox[2] * det->bbox[2] * anchors[2 * k];
            det->bbox[3] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]);
            det->bbox[3] = det->bbox[3] * det->bbox[3] * anchors[2 * k + 1];
            det->conf = box_prob * max_cls_prob;
            det->class_id = class_id;
        }
    }

1、先获取当前线程ID,然后定位元素位置。(在这里我们也可以看到Tensorrt的本质和我们开始几节介绍的简单的矩阵其实原理是一样的,只是操作和形式上稍微复杂亿点点。也是指定单个线程对相应的元素进行操作。)

2、执行x,y,w,h相关操作,具体可参照原有的yolov5中detect进行分析,这里不过多讲述。

__device__ float Logist(float data)
{ return 1.0f / (1.0f + expf(-data)); };

整体一套流程下来,是不是感觉使用cuda编写也不是那么神秘嘛。