开机设置

查看配置

cat /etc/issue
uname -a

设置搜狗输入法

sudo apt install libqt5qml5 libqt5quick5 libqt5quickwidgets5 qml-module-qtquick2

sudo apt install libgsettings-qt1

设置镜像源

sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup

sudo gedit /etc/apt/sources.list

sudo apt-get update
sudo apt-get upgrade
sudo apt install net-tools

Anaconda

https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/ https:///Archiconda/build-tools/releases
ubuntu18.04中修改默认Python解释器为Python3.7:
刷BIOS修改N卡硬件ID_python
bash Anaconda3-2021.04-Linux-aarch64.sh
gedit ~/.bashrc
export PATH=“/home/nvidia/anaconda3/bin:$PATH”
source ~/.bashrc

numpy 1.20.1
python 3.8.8

MIniconda安装

[教程_马翔](https://docs.conda.io/en/latest/miniconda.html minicond)

  • 找不到conda指令?
  • 路径问题

  • export PATH=“/root/miniconda3/bin:$PATH”

pytorch

  • 安装地址:https://forums.developer.nvidia.com/t/pytorch-for-jetson/72048
  • torchvison安装地址:https://gitcode.net/mirrors/pytorch/vision?utm_sourcegithub_accelerator
  • conda env list
  • conda create -n pytorch python=3.7
  • source activate pytorch

  • conda install pytorch1.6.0 torchvision0.7.0 cudatoolkit=10.2 -c pytorch


pip源

pwd
mkdir .pip
cd .pip
touch pip.conf
gedit pip.conf

[global]
timeout = 6000
index-url = https://pypi.doubanio.com/simple
trusted-host=pypi.doubanio.com

pycharm

教程_马翔 https://www.jetbrains.com/pycharm/download/other.html tar -xf pycharm-community-2022.3.1.tar.gz

sh ./

vscode

sudo dpkg -i code_1.74.3-1673284080_arm64.deb

  • vscode打开加载不了docker问题:

显示

java

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

cuda

lspci | grep -i nvidia
 nvidia-smiARMv8 Processor rev 0 (v8l) × 6
sudo apt-get upgrade
 sudo apt-get update
 sudo su
 apt-get install gcc-7 g+±7
 ls /usr/bin/gcc*
 ls /usr/bin/g++*CUDA Toolkit 10.2 Download
 [帖子]()wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/sbsa/cuda-ubuntu1804.pin
 sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
 wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_arm64.deb
 sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_arm64.deb
 sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
 sudo apt-get update
 sudo apt-get -y install cuda英伟达密码:GXgx661625
sudo apt-get -y install cuda
 Reading package lists… Done
 Building dependency tree
 Reading state information… Done
 Some packages could not be installed. This may mean that you have
 requested an impossible situation or if you are using the unstable
 distribution that some required packages have not yet been created
 or been moved out of Incoming.
 The following information may help to resolve the situation:The following packages have unmet dependencies:
 cuda : Depends: cuda-11-5 (>= 11.5.0) but it is not going to be installed
 E: Unable to correct problems, you have held broken packages.

查看nx已装加速包版本

  • cuda banben cat /usr/local/cuda/version.txt

刷机

ssh service

问题

  • sdk manager刷机错误
  • sudo apt update返回错误

刷机

双系统

主机系统镜像

  • NX有两个版本,一个是SD卡版本可以先刷可以卡刷
  • 买的这个只能线刷
制作U盘启动
  • 双击打开下载后的【ubuntu-18.04】文件夹
  • 鼠标右击【UltraISO】压缩包选择【解压到UltraISO】
  • 打开解压文件夹,鼠标右击【UltraISO_9.7.1.3519_cn】选择【以管理员身份运行】
  • 勾选【我接受】,点击【下一步】,一直【下一步】
  • 点击【安装】
  • 点击【输入注册码】
  • 输入注册名【Guoxiao】,输入注册码【A06C-83A7-701D-6CFC】,点击【确定】
  • 插入U盘(4G以上),双击桌面【UltraISO】图标启动软件
  • 点击【文件】选择【打开】
  • 选择下载后【ubuntu-18.04】文件夹中的【ubuntu-18.04.1-desktop-amd64】
  • 点击【启动】选择【写入镜像】
  • 点击【写入】
  • 点击【是】
  • 刻录成功,点击右上角【X】退出
磁盘分区
  • 打开下载后【ubuntu-18.04】文件夹,鼠标右击【DiskGenius】压缩包选择【解压到 DiskGenius】
  • 打开解压后的文件夹,双击运行【DiskGeniusLoad】
  • 分配新建分区的容量(不少于20G,推荐分配更多的内存到Linux系统中),点击【开始】

在刷NX的时候需要下载jetson os和SDK占用至少20G

  • 点击【是】
  • 分区完成,点击【完成】
笔记本安装双系统
  • 进入电脑Bios设置U盘为第一启动项(当方法一无法选择U盘启动时,需用此方法进行设置),不同类型的机器进入BIOS按键和设置方法均不同(建议自行搜索对应的电脑型号设置教程),总体来说就是在boot里面设置U盘为第一启动项,设置后按F10保存,重启电脑自动从U盘启动
  • 选择英文–>避免路径问题

SDK Manager刷机

看图和文档
存在的坑

  • 最好刷jetpack5.1及以上 对应20.04ubuntu系统
  • 系统低会影响pytorch的选择(很多人在编译pytorch的时候 高版本的资源多、轮子多哦)
  • 在刷完系统后就要挂在SSD,防止机身内存eMMc不足
  • 主机读取不到NX
  • 换线
  • 再次插拔
  • 进入刷机状态(跳线短接)

torch gpu安装

根据jetpack选择pytorch

  • 一定根据jetpack选择官方版本
  • 18.04这里选用的torch是1.10.0
  • 一般18.04的系统很难找到对应的torchvision需要自己进行编译
  • 注意下载版本
  • $ sudo apt-get install libjpeg-dev zlib1g-dev libpython3-dev libavcodec-dev libavformat-dev libswscale-dev $ git clone --branch <version> https:///pytorch/vision torchvision # see below for version of torchvision to download $ cd torchvision $ export BUILD_VERSION=0.x.0 # where 0.x.0 is the torchvision version $ python3 setup.py install --user $ cd ../ # attempting to load torchvision from build dir will result in import error $ pip install 'pillow<7' # always needed for Python 2.7, not needed torchvision v0.5.0+ with Python 3
  • 校验cuda能否正确调用:
  • usr/bin/python3
  • import torch
  • print(torch.cuda.is_available())

根据pytorch要求编译的python版本为3.6

numpy1.9.5问题解决

开机使用

  • 开机密码和权限密码:661625
  • 查看版本:nvidia-smi nvcc -V
  • nvcc -V不能使用问题:(主要是cuda没有加入环境变量)
  • sudo vim ~/.bashrc #使用 Vim 编辑环境变量
  • 在最后添加下面两行
  • export PATH=/usr/local/cuda/bin:$PATH
  • export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  • source ~/.bashrc# 保存退出 再运行 nvcc -V 就能看到 CUDA 版本
  • 查看开机之后有没有安装好对应的环境?cuda cudnn anaconda python trt pytorch和 Jetson-stats 管理工具可用来控制风扇、性能模式?
# 安装pip3
sudo apt install python3-pip
 
# 安装Jtop工具
sudo -H pip3 install -U jetson-stats
 
# 启动jtop
sudo jtop
  • 查看本机pip安装列表:usr/bin/pip3 list
  • 安装和升级pip3
  • sudo apt install python3-pip python3-dev
  • python3 -m pip install --upgrade pip #升级 pip
  • 查看Anaconda创建的环境:conda env list
  • 激活Anaconda创建的环境:source activate pytorch1.9
  • 更改Anaconda中python的优先级:vim ~./bashrc
  • 更新配置文件:source ~./bashrc
  • 编辑只读文件:sudo gedit [文件]

trt转换推理流程

onnx2trt格式转换

/usr/src/tensorrt/bin/trtexec --onnx=./data/strawberry_v5s.onnx --saveEngine=./data/strawberry_v5s.trt --fp16
/usr/src/tensorrt/bin/trtexec --loadEngine=./data/strawberry_v5s.trt --shapes=images:1x3x768x1024

编译

- 进入cdnormTRT文件夹: normTRT/
- ls
- make
- make clean
- 编译生成yolo文件:make

推理运行

./yolo data/straberry_v5s.trt -i data/1.jpg

环境说明

本机环境

  • python2.7
  • python3.6
    - 调用:usr/bin/python3
  • pytorch:1.10.0
  • torchvision:0.11.1
  • cuda:10.2

Anaconda环境

base
  • python3.9
pytorch
  • python3.7
  • cpu版本的torch1.12.1
  • torchvision:0.13.1
paddlepaddle
  • python3.8.8

远程

  • ssh guoxiao@ip地址

部署

# TensorRT


## onnx to trt
```bash
/usr/src/tensorrt/bin/trtexec --onnx=./data/strawberry_v5s.onnx --saveEngine=./data/strawberry_v5s.trt --fp16
/usr/src/tensorrt/bin/trtexec --loadEngine=./data/strawberry_v5s.trt --shapes=images:1x3x768x1024

inference

  • ./yolo data/straberry_v5s.trt -i data/1.jpg

https://xeno5.top/api/v1/client/subscribe?token=f74de8b2dfccb49547a068422cc6c42e

yolov5 -s yolov5s.wts yolov5s.engine s

/usr/bin/python3 gen_wts.py -w -o yolov5s.wts
/usr/bin/python3 gen_wts.py -w -o straberry.wts

sudo ./yolov5 -s yolov5s.wts yolov5s.engine s
sudo ./yolov5 -s straberry.wts straberry.engine s

sudo ./yolov5 -v straberry.engine

## 参考源码

[王鑫宇](https:///wang-xinyu/tensorrtx)

- 仓库

- ```bash
  $ git clone https:///wang-xinyu/tensorrtx.git

特别yolo不同的版本也要使用不同版本的tensorrtx

环境

  • TensorRT
  • CUDA
  • CUDNN
  • Opencv

部分源码配置

  • 输入尺寸、类别定义:yololayer.h文件
  • INT8/FP16/FP32选择:yolov5.cpp
  • NMS阈值、BBox置信度阈值、Batchsize设置:yolov5.cpp

运行

.pt格式文件直接转换成.wts文件

  • tensorrtx项目通过tensorrt的Layer API一层层搭建模型
  • 模型权重的加载通过自定义方式实现
  • 通过get_wts.py文件将模型的权重.pt文件保存成.wts文件,生成的.wts文件(自定的权重文件)方便后续加载使用
  • 将生成的.wts文件拷贝至yolo系列文件夹下
python gen_wts.py -w weights/ yolov5s.wts # 通过get_wts.py文件将.pt格式文件直接转换成.wts文件

build

  • 加载.wts文件
  • 通过tensorrt生成engine引擎文件

修改CLASS_NUM为草莓的类别

  • 编译生成引擎文件
cd tensorrtx/yolov5
mkdir build && cd build
cp ../yolov5s.wts ./
cmake .. && make
sudo ./yolov5 -s yolov5s.wts yolov5s.engine s

run

通过tensorrt生成的engine文件进行模型推理

sudo ./yolov5 -d yolov5s.engine ../images/  # 将对应文件夹中的文件进行推理

调用摄像头进行实时检测

  • tensorrtx官方没有给出调用摄像头的代码
  • 删掉生成的.engine文件
  • 用以下代码替掉yolov5-6.0.cpp
#include <iostream>
#include <chrono>
#include "cuda_utils.h"
#include "logging.h"
#include "common.hpp"
#include "utils.h"
#include "calibrator.h"

#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32
#define DEVICE 0  // GPU id
#define NMS_THRESH 0.4
#define CONF_THRESH 0.5
#define BATCH_SIZE 1

// stuff we know about the network and the input/output blobs
static const int INPUT_H = Yolo::INPUT_H;
static const int INPUT_W = Yolo::INPUT_W;
static const int CLASS_NUM = Yolo::CLASS_NUM;
static const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * sizeof(Yolo::Detection) / sizeof(float) + 1;  // we assume the yololayer outputs no more than MAX_OUTPUT_BBOX_COUNT boxes that conf >= 0.1
const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "prob";
static Logger gLogger;

char *my_classes[]={ "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
         "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
         "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
         "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard","surfboard",
         "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
         "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
         "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone",
         "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
         "hair drier", "toothbrush" };

static int get_width(int x, float gw, int divisor = 8) {
    //return math.ceil(x / divisor) * divisor
    if (int(x * gw) % divisor == 0) {
        return int(x * gw);
    }
    return (int(x * gw / divisor) + 1) * divisor;
}

static int get_depth(int x, float gd) {
    if (x == 1) {
        return 1;
    } else {
        return round(x * gd) > 1 ? round(x * gd) : 1;
    }
}

ICudaEngine* build_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {
    INetworkDefinition* network = builder->createNetworkV2(0U);

    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME
    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });
    assert(data);

    std::map<std::string, Weights> weightMap = loadWeights(wts_name);

    /* ------ yolov5 backbone------ */
    auto focus0 = focus(network, weightMap, *data, 3, get_width(64, gw), 3, "model.0");
    auto conv1 = convBlock(network, weightMap, *focus0->getOutput(0), get_width(128, gw), 3, 2, 1, "model.1");
    auto bottleneck_CSP2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, "model.2");
    auto conv3 = convBlock(network, weightMap, *bottleneck_CSP2->getOutput(0), get_width(256, gw), 3, 2, 1, "model.3");
    auto bottleneck_csp4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(9, gd), true, 1, 0.5, "model.4");
    auto conv5 = convBlock(network, weightMap, *bottleneck_csp4->getOutput(0), get_width(512, gw), 3, 2, 1, "model.5");
    auto bottleneck_csp6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, "model.6");
    auto conv7 = convBlock(network, weightMap, *bottleneck_csp6->getOutput(0), get_width(1024, gw), 3, 2, 1, "model.7");
    auto spp8 = SPP(network, weightMap, *conv7->getOutput(0), get_width(1024, gw), get_width(1024, gw), 5, 9, 13, "model.8");

    /* ------ yolov5 head ------ */
    auto bottleneck_csp9 = C3(network, weightMap, *spp8->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, "model.9");
    auto conv10 = convBlock(network, weightMap, *bottleneck_csp9->getOutput(0), get_width(512, gw), 1, 1, 1, "model.10");

    auto upsample11 = network->addResize(*conv10->getOutput(0));
    assert(upsample11);
    upsample11->setResizeMode(ResizeMode::kNEAREST);
    upsample11->setOutputDimensions(bottleneck_csp6->getOutput(0)->getDimensions());

    ITensor* inputTensors12[] = { upsample11->getOutput(0), bottleneck_csp6->getOutput(0) };
    auto cat12 = network->addConcatenation(inputTensors12, 2);
    auto bottleneck_csp13 = C3(network, weightMap, *cat12->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, "model.13");
    auto conv14 = convBlock(network, weightMap, *bottleneck_csp13->getOutput(0), get_width(256, gw), 1, 1, 1, "model.14");

    auto upsample15 = network->addResize(*conv14->getOutput(0));
    assert(upsample15);
    upsample15->setResizeMode(ResizeMode::kNEAREST);
    upsample15->setOutputDimensions(bottleneck_csp4->getOutput(0)->getDimensions());
	
    ITensor* inputTensors16[] = { upsample15->getOutput(0), bottleneck_csp4->getOutput(0) };
    auto cat16 = network->addConcatenation(inputTensors16, 2);

    auto bottleneck_csp17 = C3(network, weightMap, *cat16->getOutput(0), get_width(512, gw), get_width(256, gw), get_depth(3, gd), false, 1, 0.5, "model.17");

    // yolo layer 0
    IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.0.weight"], weightMap["model.24.m.0.bias"]);
    auto conv18 = convBlock(network, weightMap, *bottleneck_csp17->getOutput(0), get_width(256, gw), 3, 2, 1, "model.18");
    ITensor* inputTensors19[] = { conv18->getOutput(0), conv14->getOutput(0) };
    auto cat19 = network->addConcatenation(inputTensors19, 2);
    auto bottleneck_csp20 = C3(network, weightMap, *cat19->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, "model.20");
    //yolo layer 1
    IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.1.weight"], weightMap["model.24.m.1.bias"]);
    auto conv21 = convBlock(network, weightMap, *bottleneck_csp20->getOutput(0), get_width(512, gw), 3, 2, 1, "model.21");
    ITensor* inputTensors22[] = { conv21->getOutput(0), conv10->getOutput(0) };
    auto cat22 = network->addConcatenation(inputTensors22, 2);
    auto bottleneck_csp23 = C3(network, weightMap, *cat22->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, "model.23");
    IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.2.weight"], weightMap["model.24.m.2.bias"]);

    auto yolo = addYoLoLayer(network, weightMap, det0, det1, det2);
    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);
    network->markOutput(*yolo->getOutput(0));

    // Build engine
    builder->setMaxBatchSize(maxBatchSize);
    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB
#if defined(USE_FP16)
    config->setFlag(BuilderFlag::kFP16);
#elif defined(USE_INT8)
    std::cout << "Your platform support int8: " << (builder->platformHasFastInt8() ? "true" : "false") << std::endl;
    assert(builder->platformHasFastInt8());
    config->setFlag(BuilderFlag::kINT8);
    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, "./coco_calib/", "int8calib.table", INPUT_BLOB_NAME);
    config->setInt8Calibrator(calibrator);
#endif

    std::cout << "Building engine, please wait for a while..." << std::endl;
    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    std::cout << "Build engine successfully!" << std::endl;

    // Don't need the network any more
    network->destroy();

    // Release host memory
    for (auto& mem : weightMap)
    {
        free((void*)(mem.second.values));
    }

    return engine;
}

ICudaEngine* build_engine_p6(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {
    INetworkDefinition* network = builder->createNetworkV2(0U);

    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME
    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });
    assert(data);

    std::map<std::string, Weights> weightMap = loadWeights(wts_name);

    /* ------ yolov5 backbone------ */
    auto focus0 = focus(network, weightMap, *data, 3, get_width(64, gw), 3, "model.0");
    auto conv1 = convBlock(network, weightMap, *focus0->getOutput(0), get_width(128, gw), 3, 2, 1, "model.1");
    auto c3_2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, "model.2");
    auto conv3 = convBlock(network, weightMap, *c3_2->getOutput(0), get_width(256, gw), 3, 2, 1, "model.3");
    auto c3_4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(9, gd), true, 1, 0.5, "model.4");
    auto conv5 = convBlock(network, weightMap, *c3_4->getOutput(0), get_width(512, gw), 3, 2, 1, "model.5");
    auto c3_6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, "model.6");
    auto conv7 = convBlock(network, weightMap, *c3_6->getOutput(0), get_width(768, gw), 3, 2, 1, "model.7");
    auto c3_8 = C3(network, weightMap, *conv7->getOutput(0), get_width(768, gw), get_width(768, gw), get_depth(3, gd), true, 1, 0.5, "model.8");
    auto conv9 = convBlock(network, weightMap, *c3_8->getOutput(0), get_width(1024, gw), 3, 2, 1, "model.9");
    auto spp10 = SPP(network, weightMap, *conv9->getOutput(0), get_width(1024, gw), get_width(1024, gw), 3, 5, 7, "model.10");
    auto c3_11 = C3(network, weightMap, *spp10->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, "model.11");

    /* ------ yolov5 head ------ */
    auto conv12 = convBlock(network, weightMap, *c3_11->getOutput(0), get_width(768, gw), 1, 1, 1, "model.12");
    auto upsample13 = network->addResize(*conv12->getOutput(0));
    assert(upsample13);
    upsample13->setResizeMode(ResizeMode::kNEAREST);
    upsample13->setOutputDimensions(c3_8->getOutput(0)->getDimensions());
    ITensor* inputTensors14[] = { upsample13->getOutput(0), c3_8->getOutput(0) };
    auto cat14 = network->addConcatenation(inputTensors14, 2);
    auto c3_15 = C3(network, weightMap, *cat14->getOutput(0), get_width(1536, gw), get_width(768, gw), get_depth(3, gd), false, 1, 0.5, "model.15");

    auto conv16 = convBlock(network, weightMap, *c3_15->getOutput(0), get_width(512, gw), 1, 1, 1, "model.16");
    auto upsample17 = network->addResize(*conv16->getOutput(0));
    assert(upsample17);
    upsample17->setResizeMode(ResizeMode::kNEAREST);
    upsample17->setOutputDimensions(c3_6->getOutput(0)->getDimensions());
    ITensor* inputTensors18[] = { upsample17->getOutput(0), c3_6->getOutput(0) };
    auto cat18 = network->addConcatenation(inputTensors18, 2);
    auto c3_19 = C3(network, weightMap, *cat18->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, "model.19");

    auto conv20 = convBlock(network, weightMap, *c3_19->getOutput(0), get_width(256, gw), 1, 1, 1, "model.20");
    auto upsample21 = network->addResize(*conv20->getOutput(0));
    assert(upsample21);
    upsample21->setResizeMode(ResizeMode::kNEAREST);
    upsample21->setOutputDimensions(c3_4->getOutput(0)->getDimensions());
    ITensor* inputTensors21[] = { upsample21->getOutput(0), c3_4->getOutput(0) };
    auto cat22 = network->addConcatenation(inputTensors21, 2);
    auto c3_23 = C3(network, weightMap, *cat22->getOutput(0), get_width(512, gw), get_width(256, gw), get_depth(3, gd), false, 1, 0.5, "model.23");

    auto conv24 = convBlock(network, weightMap, *c3_23->getOutput(0), get_width(256, gw), 3, 2, 1, "model.24");
    ITensor* inputTensors25[] = { conv24->getOutput(0), conv20->getOutput(0) };
    auto cat25 = network->addConcatenation(inputTensors25, 2);
    auto c3_26 = C3(network, weightMap, *cat25->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, "model.26");

    auto conv27 = convBlock(network, weightMap, *c3_26->getOutput(0), get_width(512, gw), 3, 2, 1, "model.27");
    ITensor* inputTensors28[] = { conv27->getOutput(0), conv16->getOutput(0) };
    auto cat28 = network->addConcatenation(inputTensors28, 2);
    auto c3_29 = C3(network, weightMap, *cat28->getOutput(0), get_width(1536, gw), get_width(768, gw), get_depth(3, gd), false, 1, 0.5, "model.29");

    auto conv30 = convBlock(network, weightMap, *c3_29->getOutput(0), get_width(768, gw), 3, 2, 1, "model.30");
    ITensor* inputTensors31[] = { conv30->getOutput(0), conv12->getOutput(0) };
    auto cat31 = network->addConcatenation(inputTensors31, 2);
    auto c3_32 = C3(network, weightMap, *cat31->getOutput(0), get_width(2048, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, "model.32");

    /* ------ detect ------ */
    IConvolutionLayer* det0 = network->addConvolutionNd(*c3_23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.33.m.0.weight"], weightMap["model.33.m.0.bias"]);
    IConvolutionLayer* det1 = network->addConvolutionNd(*c3_26->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.33.m.1.weight"], weightMap["model.33.m.1.bias"]);
    IConvolutionLayer* det2 = network->addConvolutionNd(*c3_29->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.33.m.2.weight"], weightMap["model.33.m.2.bias"]);
    IConvolutionLayer* det3 = network->addConvolutionNd(*c3_32->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.33.m.3.weight"], weightMap["model.33.m.3.bias"]);

    auto yolo = addYoLoLayer(network, weightMap, det0, det1, det2);
    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);
    network->markOutput(*yolo->getOutput(0));

    // Build engine
    builder->setMaxBatchSize(maxBatchSize);
    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB
#if defined(USE_FP16)
    config->setFlag(BuilderFlag::kFP16);
#elif defined(USE_INT8)
    std::cout << "Your platform support int8: " << (builder->platformHasFastInt8() ? "true" : "false") << std::endl;
    assert(builder->platformHasFastInt8());
    config->setFlag(BuilderFlag::kINT8);
    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, "./coco_calib/", "int8calib.table", INPUT_BLOB_NAME);
    config->setInt8Calibrator(calibrator);
#endif

    std::cout << "Building engine, please wait for a while..." << std::endl;
    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    std::cout << "Build engine successfully!" << std::endl;

    // Don't need the network any more
    network->destroy();

    // Release host memory
    for (auto& mem : weightMap)
    {
        free((void*)(mem.second.values));
    }

    return engine;
}

void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream, float& gd, float& gw, std::string& wts_name) {
    // Create builder
    IBuilder* builder = createInferBuilder(gLogger);
    IBuilderConfig* config = builder->createBuilderConfig();

    // Create model to populate the network, then set the outputs and create an engine
    ICudaEngine* engine = build_engine(maxBatchSize, builder, config, DataType::kFLOAT, gd, gw, wts_name);
    assert(engine != nullptr);

    // Serialize the engine
    (*modelStream) = engine->serialize();

    // Close everything down
    engine->destroy();
    builder->destroy();
    config->destroy();
}

void doInference(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* input, float* output, int batchSize) {
    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueue(batchSize, buffers, stream, nullptr);
    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);
}

bool parse_args(int argc, char** argv, std::string& engine) {
    if (argc < 3) return false;
    if (std::string(argv[1]) == "-v" && argc == 3) {
        engine = std::string(argv[2]);
    } else {
        return false;
    }
    return true;
}

int main(int argc, char** argv) {
    cudaSetDevice(DEVICE);

    //std::string wts_name = "";
    std::string engine_name = "";
    //float gd = 0.0f, gw = 0.0f;
    //std::string img_dir;
    
	if(!parse_args(argc,argv,engine_name)){
		std::cerr << "arguments not right!" << std::endl;
        	std::cerr << "./yolov5 -v [.engine] // run inference with camera" << std::endl;
		return -1;
	}

	std::ifstream file(engine_name, std::ios::binary);
        if (!file.good()) {
		std::cerr<<" read "<<engine_name<<" error! "<<std::endl;
		return -1;
	}
	char *trtModelStream{ nullptr };
	size_t size = 0;
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        assert(trtModelStream);
        file.read(trtModelStream, size);
        file.close();
	

    // prepare input data ---------------------------
    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];
    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)
    //    data[i] = 1.0;
    static float prob[BATCH_SIZE * OUTPUT_SIZE];
    IRuntime* runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);
    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);
    assert(engine != nullptr);
    IExecutionContext* context = engine->createExecutionContext();
    assert(context != nullptr);
    delete[] trtModelStream;
    assert(engine->getNbBindings() == 2);
    void* buffers[2];
    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);
    const int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);
    assert(inputIndex == 0);
    assert(outputIndex == 1);
    // Create GPU buffers on device
    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], BATCH_SIZE * 3 * INPUT_H * INPUT_W * sizeof(float)));
    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float)));
    // Create stream
    cudaStream_t stream;
    CUDA_CHECK(cudaStreamCreate(&stream));


	cv::VideoCapture capture(0);
    //cv::VideoCapture capture("../overpass.mp4");
    //int fourcc = cv::VideoWriter::fourcc('M','J','P','G');
    //capture.set(cv::CAP_PROP_FOURCC, fourcc);
    if(!capture.isOpened()){
        std::cout << "Error opening video stream or file" << std::endl;
        return -1;
    }

	int key;
    int fcount = 0;
    while(1)
    {
        cv::Mat frame;
        capture >> frame;
        if(frame.empty())
        {
            std::cout << "Fail to read image from camera!" << std::endl;
            break;
        }
        fcount++;
        //if (fcount < BATCH_SIZE && f + 1 != (int)file_names.size()) continue;
        for (int b = 0; b < fcount; b++) {
            //cv::Mat img = cv::imread(img_dir + "/" + file_names[f - fcount + 1 + b]);
			cv::Mat img = frame;
            if (img.empty()) continue;
            cv::Mat pr_img = preprocess_img(img, INPUT_W, INPUT_H); // letterbox BGR to RGB
            int i = 0;
            for (int row = 0; row < INPUT_H; ++row) {
                uchar* uc_pixel = pr_img.data + row * pr_img.step;
                for (int col = 0; col < INPUT_W; ++col) {
                    data[b * 3 * INPUT_H * INPUT_W + i] = (float)uc_pixel[2] / 255.0;
                    data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = (float)uc_pixel[1] / 255.0;
                    data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = (float)uc_pixel[0] / 255.0;
                    uc_pixel += 3;
                    ++i;
                }
            }
        }

        // Run inference
        auto start = std::chrono::system_clock::now();
        doInference(*context, stream, buffers, data, prob, BATCH_SIZE);
        auto end = std::chrono::system_clock::now();
        //std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
        int fps = 1000.0/std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
		std::vector<std::vector<Yolo::Detection>> batch_res(fcount);
        for (int b = 0; b < fcount; b++) {
            auto& res = batch_res[b];
            nms(res, &prob[b * OUTPUT_SIZE], CONF_THRESH, NMS_THRESH);
        }
        for (int b = 0; b < fcount; b++) {
            auto& res = batch_res[b];
            //std::cout << res.size() << std::endl;
            //cv::Mat img = cv::imread(img_dir + "/" + file_names[f - fcount + 1 + b]);
            for (size_t j = 0; j < res.size(); j++) {
                cv::Rect r = get_rect(frame, res[j].bbox);
                cv::rectangle(frame, r, cv::Scalar(0x27, 0xC1, 0x36), 2);
		std::string label = my_classes[(int)res[j].class_id];
                cv::putText(frame, label, cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);
				std::string jetson_fps = "Jetson Nano FPS: " + std::to_string(fps);
				cv::putText(frame, jetson_fps, cv::Point(11,80), cv::FONT_HERSHEY_PLAIN, 3, cv::Scalar(0, 0, 255), 2, cv::LINE_AA);
			}
            //cv::imwrite("_" + file_names[f - fcount + 1 + b], img);
        }
		cv::imshow("yolov5",frame);
        key = cv::waitKey(1);
        if (key == 'q'){
            break; 
        }
        fcount = 0;
    }

	capture.release();
    // Release stream and buffers
    cudaStreamDestroy(stream);
    CUDA_CHECK(cudaFree(buffers[inputIndex]));
    CUDA_CHECK(cudaFree(buffers[outputIndex]));
    // Destroy the engine
    context->destroy();
    engine->destroy();
    runtime->destroy();

    return 0;
}