openvino系列 8. 训练后优化工具 Post-training Optimization Tool (POT) 在物体识别模型中的使用
本章节介绍英特尔 OpenVINO Post-training Optimization Tool 在物体识别模型量化/优化的一个案例。 POT 可以说是在CPU端加速推理的一个工具,而 POT API 有助于为单个或级联/复合深度学习模型实现自定义优化管道。实现自定义优化/量化模型的流程大致如下:
- 导入IR全精度模型。如果模型是PyTorch或TensorFlow格式,则需要先把其转换成IR格式模型;
-
DataLoader
(DetectionDataLoader
类)模块自定义:此模块负责数据集的加载,包括数据预处理。 -
Metric
(MAPMetric
类)模块自定义:此模块负责计算模型的准确度指标。 -
Engine
模块自定义:此模块负责模型推理,并为模型提供统计数据和准确度指标。 - 组建并运行管道(
Pipeline
)。
环境描述:
- 本案例运行环境:Win10,10代i5笔记本
- IDE:VSCode
- openvino版本:2022.1
-
代码链接,
6_pot_objectdetection
文章目录
- openvino系列 8. 训练后优化工具 Post-training Optimization Tool (POT) 在物体识别模型中的使用
- 1 背景
- 1.1 训练后优化工具 Post-Training Optimization Tool(POT)
- 1.2 训练后优化工具 API (POT API)
- 1.3 关于导入的模型
- 2 导入IR模型
- 3 `DataLoader`(`DetectionDataLoader`类)
- 3.1 介绍
- 3.2 `DetectionDataLoader`类
- 4 `Metric`(`MAPMetric`类)
- 5 `Engine`
- 6 组建并运行管道(Pipeline)
- 7 模型比较
- 7.1 mAP
- 7.2 模型大小
- 7.3 使用`benchmark_app`比较原始全精度模型和量化模型的性能
1 背景
1.1 训练后优化工具 Post-Training Optimization Tool(POT)
训练后优化工具 (POT) 旨在通过应用无需模型重新训练或微调的特殊方法来加速深度学习模型的推理(英特尔没有开源,所以我们就把它当成一个黑箱吧)。因此,该工具不需要训练数据集或管道。要应用POT,我们需要:
- 一个浮点精度的模型,比如FP32或者FP16,这个模型可以被转化成OpenVINO的IR格式,然后在CPU上运行;
- 代表用例场景的代表性校准数据集,例如 300 张图像。
该工具旨在完全自动化模型转换过程,而无需在用户端更改模型。需要注意的是,这个POT是英特尔针对CPU进行的模型优化工具。从benchmarking网页。
下图是英伟达对比了四款CPU,在使用了POT之后,模型推理速度普遍得到提升:
下图是英伟达对比了四款CPU,在使用了POT之后,对比原本的全精度模型,精度有略微地下降:
关于POT的官方描述见此链接。
1.2 训练后优化工具 API (POT API)
我们从上一章节中可以大致了解,其实POT可以说是在CPU端加速推理的一个工具。那么,怎么使用这个工具呢?这里有两种方式。一种是简化模式(Simplified Mode),就是说,我们不需要做什么设置,一个全精度模型进去,INT8精度模型出来,一行代码搞定。这个模式的案例我们在5-pot-int8-simplifiedmode
举了一个案例。这种模式虽然简单直白,但我们只能把中间的过程完全当成一个黑盒子。另外一种方式就是这里要介绍的POT API。
POT API 有助于为单个或级联/复合深度学习模型实现自定义优化管道。官方描述链接。
从上图中(我们主要看左半边User API
),我们看到,POT API分为三个模块:Engine
, Matric
, DataLoader
。
-
Engine
负责模型推理,并为模型提供统计数据和准确度指标。 -
DataLoader
负责数据集的加载,包括数据预处理。 -
Metric
负责计算模型的准确度指标。
其实这背后的逻辑也是比较清晰的。首先我们使用DataLoader
模块来读取和解析校准数据集(Dataset&Annotation
),然后我们定义一个指标来确定优化之后的模型的性能(Metric
),然后我们对Engine
做一些配置,最后将这些模块放进优化管道(optimization pipeline),运行后,生成优化模型。
1.3 关于导入的模型
这个案例的目的是识别人。因此,我们导入的模型是person-detection-retail-0013。
英特尔提供了一个open model zoo
,里面包含了非常多的已经训练好的模型。并且每一个模型都有详细的描述,这点做得非常好。关于这个模型:这是零售场景的行人检测器。它基于类似于MobileNetV2的主干网络,包括深度卷积以减少3x3卷积块的计算量。来自1/16比例特征图的单个SSD头具有12个先验框。
- 模型的输入:[1,3,320,544],格式为[B,C,H,W],即[批量大小,通道数,图像高度,图像宽度];
- 模型的输出:[1,1,200,7],其中N是检测到的边界框的数量。每个检测的格式为
[image_id,label,conf,x_min,y_min,x_max,y_max]
,即:[批处理中图像的ID,标签-预测的类别ID(1-人),预测类的置信度,左上边界框角的坐标,右下边界框角的坐标]
。
正如之前的章节所述,POT API分为三个模块:Engine
, Matric
, DataLoader
。接下来我们将介绍使用POT API的步骤。
2 导入IR模型
首先,我们需要导入IR模型。如果模型是PyTorch或TensorFlow格式,则需要先把其转换成IR格式模型。相关代码:
print("Download the model from Open Model Zoo.")
ir_path = Path("intel/person-detection-retail-0013/FP32/person-detection-retail-0013.xml")
if not ir_path.exists():
! omz_downloader --name "person-detection-retail-0013" --precisions FP32
print("Load the IR model, and get information about network inputs and outputs.")
ie = Core()
model = ie.read_model(model=ir_path)
compiled_model = ie.compile_model(model=model, device_name="CPU")
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)
print("model input info: {}".format(input_layer))
print("model output info: {}".format(output_layer))
input_size = input_layer.shape
_, _, input_height, input_width = input_size
Terminal中打印:
Download the model from Open Model Zoo.
Load the IR model, and get information about network inputs and outputs.
model input info: <ConstOutput: names[data] shape{1,3,320,544} type: f32>
model output info: <ConstOutput: names[detection_out] shape{1,1,200,7} type: f32>
3 DataLoader
(DetectionDataLoader
类)
3.1 介绍
要实现 Metric
和 Dataloader
,我们需要知道模型的输出和注释格式。DataLoader
负责数据集的加载,包括数据预处理。
此示例中的数据集使用 JSON 格式的注释,键为:['categories', 'annotations', 'images']
。 annotations
是一个字典列表,每个注解一个条目。 此类项目包含一个 boxes
键,它以[xmin, xmax, ymin, ymax]
格式保存预测框。 在这个数据集中只有一个标签:“人”。
categories
示例:
"categories": [
{
"id": 1,
"name": "person",
"supercategory": ""
}
],
annotations
示例:
"annotations": [
{
"category_id": 1,
"segmentation": null,
"bbox": [
1008,
199,
185,
458
],
"iscrowd": 0,
"area": 84730,
"id": 0,
"image_id": 0,
"attributes": {},
"is_occluded": false
},
images
示例:
"images": [
{
"date_captured": null,
"flickr_url": null,
"width": 1920,
"dataset": "IOTG_RSD_Team_datasets",
"file_name": "image_000000.jpg",
"license": null,
"image": "train/image_000000.jpg",
"id": 0,
"height": 1080,
"coco_url": null
},
3.2 DetectionDataLoader
类
DetectionDataLoader
类遵循 POT 的 compression.api.DataLoader
接口,它实现了 __init__
、__getitem__
和 __len__
,其中 __getitem__
返回以 (annotation, image)
或者 ( annotation, image, metadata)
,注解为(index, label)
。
需要注意的是,当我们在实例化DetectionDataLoader
类的时候,我们需输入:
-
basedir
:指的是包含校准数据集以及annotation文件的文件夹路径; -
target_size
:类型是Tuple[int, int],比如(input_width, input_height)
,需要输入的是这些校准图片输入进IR模型之后需要调整到的尺寸大小,即IR模型的输入尺寸大小。
DetectionDataLoader
类的代码如下:
class DetectionDataLoader(DataLoader):
def __init__(self, basedir: str, target_size: Tuple[int, int]):
"""
:param basedir: Directory that contains images and annotation as "annotation.json"
:param target_size: Tuple of (width, height) to resize images to.
"""
self.images = sorted(Path(basedir).glob("*.jpg"))
self.target_size = target_size
with open(f"{basedir}/annotation_person_train.json") as f:
self.annotations = json.load(f)
self.image_ids = {
Path(item["file_name"]).name: item["id"]
for item in self.annotations["images"]
}
for image_filename in self.images:
annotations = [
item
for item in self.annotations["annotations"]
if item["image_id"] == self.image_ids[Path(image_filename).name]
]
assert (
len(annotations) != 0
), f"No annotations found for image id {image_filename}"
print(
f"Created dataset with {len(self.images)} items. Data directory: {basedir}"
)
def __getitem__(self, index):
"""
Get an item from the dataset at the specified index.
Detection boxes are converted from absolute coordinates to relative coordinates
between 0 and 1 by dividing xmin, xmax by image width and ymin, ymax by image height.
:return: (annotation, input_image, metadata) where annotation is (index, target_annotation)
with target_annotation as a dictionary with keys category_id, image_width, image_height
and bbox, containing the relative bounding box coordinates [xmin, ymin, xmax, ymax]
(with values between 0 and 1) and metadata a dictionary: {"filename": path_to_image}
"""
image_path = self.images[index]
image = cv2.imread(str(image_path))
image = cv2.resize(image, self.target_size)
image_id = self.image_ids[Path(image_path).name]
# image_info contains height and width of the annotated image
image_info = [
image for image in self.annotations["images"] if image["id"] == image_id
][0]
# image_annotations contains the boxes and labels for the image
image_annotations = [
item
for item in self.annotations["annotations"]
if item["image_id"] == image_id
]
# annotations are in xmin, ymin, width, height format. Convert to
# xmin, ymin, xmax, ymax and normalize to image width and height as
# stored in the annotation
target_annotations = []
for annotation in image_annotations:
xmin, ymin, width, height = annotation["bbox"]
xmax = xmin + width
ymax = ymin + height
xmin /= image_info["width"]
ymin /= image_info["height"]
xmax /= image_info["width"]
ymax /= image_info["height"]
target_annotation = {}
target_annotation["category_id"] = annotation["category_id"]
target_annotation["image_width"] = image_info["width"]
target_annotation["image_height"] = image_info["height"]
target_annotation["bbox"] = [xmin, ymin, xmax, ymax]
target_annotations.append(target_annotation)
item_annotation = (index, target_annotations)
input_image = np.expand_dims(image.transpose(2, 0, 1), axis=0).astype(
np.float32
)
return (
item_annotation,
input_image,
{"filename": str(image_path), "shape": image.shape},
)
def __len__(self):
return len(self.images)
4 Metric
(MAPMetric
类)
定义一个metric
来确定模型的性能。如果我们选择默认量化算法,定义metric
不是必须的,但它可用于将量化的 INT8 模型与原始的全精度 IR 模型进行比较。
在本教程中,我们使用来自 TorchMetrics 的 MAP 指标。另外,POT 的 metric 继承自 compression.api.Metric
。
相关代码:
class MAPMetric(Metric):
def __init__(self, map_value="map"):
"""
Mean Average Precision Metric. Wraps torchmetrics implementation, see
https://torchmetrics.readthedocs.io/en/latest/references/modules.html#map
:map_value: specific metric to return. Default: "map"
Change `to one of the values in the list below to return a different value
['mar_1', 'mar_10', 'mar_100', 'mar_small', 'mar_medium', 'mar_large',
'map', 'map_50', 'map_75', 'map_small', 'map_medium', 'map_large']
See torchmetrics documentation for more details.
"""
assert (
map_value
in torchmetrics.detection.map.MARMetricResults.__slots__
+ torchmetrics.detection.map.MAPMetricResults.__slots__
)
self._name = map_value
self.metric = torchmetrics.detection.map.MAP()
super().__init__()
@property
def value(self):
"""
Returns metric value for the last model output.
Possible format: {metric_name: [metric_values_per_image]}
"""
return {self._name: [0]}
@property
def avg_value(self):
"""
Returns average metric value for all model outputs.
Possible format: {metric_name: metric_value}
"""
return {self._name: self.metric.compute()[self._name].item()}
def update(self, output, target):
"""
Convert network output and labels to the format that torchmetrics' MAP
implementation expects, and call `metric.update()`.
:param output: model output
:param target: annotations for model output
"""
targetboxes = []
targetlabels = []
predboxes = []
predlabels = []
scores = []
image_width = target[0][0]["image_width"]
image_height = target[0][0]["image_height"]
for single_target in target[0]:
txmin, tymin, txmax, tymax = single_target["bbox"]
category = single_target["category_id"]
txmin *= image_width
txmax *= image_width
tymin *= image_height
tymax *= image_height
targetbox = [round(txmin), round(tymin), round(txmax), round(tymax)]
targetboxes.append(targetbox)
targetlabels.append(category)
for single_output in output:
for pred in single_output[0, 0, ::]:
image_id, label, conf, xmin, ymin, xmax, ymax = pred
xmin *= image_width
xmax *= image_width
ymin *= image_height
ymax *= image_height
predbox = [round(xmin), round(ymin), round(xmax), round(ymax)]
predboxes.append(predbox)
predlabels.append(label)
scores.append(conf)
preds = [
dict(
boxes=torch.Tensor(predboxes).float(),
labels=torch.Tensor(predlabels).short(),
scores=torch.Tensor(scores),
)
]
targets = [
dict(
boxes=torch.Tensor(targetboxes).float(),
labels=torch.Tensor(targetlabels).short(),
)
]
self.metric.update(preds, targets)
def reset(self):
"""
Resets metric
"""
self.metric.reset()
def get_attributes(self):
"""
Returns a dictionary of metric attributes {metric_name: {attribute_name: value}}.
Required attributes: 'direction': 'higher-better' or 'higher-worse'
'type': metric type
"""
return {self._name: {"direction": "higher-better", "type": "mAP"}}
5 Engine
至此,POT API三大模块就只剩下了Engine
。Engine
模块负责模型推理,并为模型提供统计数据和准确度指标。首先我们需要进行一些配置。
-
model_config
包含了 IR 模型的名称,路径(变量ir_path
指向 IR 模型的 xml 文件),以及权重文件; -
engine_config
包含了对这个优化推理模型的配置,这里只是设置了使用CPU进行模型推理; -
default_algorithms
优化算法的选择,这里使用DefaultQuantization
算法。
请参阅 训练后优化最佳实践 和POT官方文档文档 页面以获取有关设置和最佳实践的更多信息。
相关代码:
# Model config specifies the model name and paths to model .xml and .bin file
model_config = addict.Dict(
{
"model_name": ir_path.stem,
"model": ir_path,
"weights": ir_path.with_suffix(".bin"),
}
)
# Engine config
engine_config = addict.Dict({"device": "CPU"})
# Standard DefaultQuantization config. For this tutorial stat_subset_size is ignored
# because there are fewer than 300 images. For production use 300 is recommended.
default_algorithms = [
{
"name": "DefaultQuantization",
"stat_subset_size": 300,
"params": {
"target_device": "ANY",
"preset": "mixed", # choose between "mixed" and "performance"
},
}
]
print(f"model_config: {model_config}")
6 组建并运行管道(Pipeline)
当我们配置完 POT API这三个模块:Engine
, Matric
, DataLoader
之后,我们就需要组件管道(Pipeline)了。配置管道需要如下几个步骤:
- 实例化
DetectionDataLoader
。我们在DataLoader
这个模块的时候,编辑了DetectionDataLoader
类,这里我们需要对其进行实例化,命名为data_loader
。 - 加载模型。
load_model()
加载在model_config
中指定的 IR 模型。 - 实例化
MAPMetric
。我们在Metric
这个模块的时候,编辑了MAPMetric
类,这里我们需要对其进行实例化,命名为metric
。 - 初始化
Engine
。IEEngine
是推理引擎的 POT 实现,它将被传递到由create_pipeline()
创建的 POT 管道。IEEngine
需要三个输入:engine_config
,data_loader
,metric
。这也就回到了文章开头的那张图。当我们初始化DataLoader
以及Matric
之后,我们将它们的实例作为输入到IEEngine
。 - 初始化和运行管道。创建和运行 POT 管道只需要两行代码。 我们使用
create_pipeline
函数创建管道,然后使用pipeline.run()
运行该管道。 - 保存优化后的模型。为了稍后重用量化模型,我们压缩模型权重并将压缩模型保存到磁盘。
相关代码:
# Step 1: create data loader
data_loader = DetectionDataLoader(
basedir="data", target_size=(input_width, input_height)
)
# Step 2: load model
ir_model = load_model(model_config=model_config)
# Step 3: initialize the metric
# For DefaultQuantization, specifying a metric is optional: metric can be set to None
metric = MAPMetric(map_value="map")
# Step 4: Initialize the engine for metric calculation and statistics collection.
engine = IEEngine(config=engine_config, data_loader=data_loader, metric=metric)
# Step 5: Create a pipeline of compression algorithms.
# algorithms is defined in the Config cell above this cell
pipeline = create_pipeline(default_algorithms, engine)
# Step 6: Execute the pipeline to quantize the model
algorithm_name = pipeline.algo_seq[0].name
with yaspin(
text=f"Executing POT pipeline on {model_config['model']} with {algorithm_name}"
) as sp:
start_time = time.perf_counter()
compressed_model = pipeline.run(ir_model)
end_time = time.perf_counter()
sp.ok("✔")
print(f"Quantization finished in {end_time - start_time:.2f} seconds")
# Step 7 (Optional): Compress model weights to quantized precision
# in order to reduce the size of the final .bin file
compress_model_weights(compressed_model)
# Step 8: Save the compressed model to the desired path.
# Set save_path to the directory where the compressed model should be stored
preset = pipeline._algo_seq[0].config["preset"]
algorithm_name = pipeline.algo_seq[0].name
compressed_model_paths = save_model(
model=compressed_model,
save_path="optimized_model",
model_name=f"{ir_model.name}_{preset}_{algorithm_name}",
)
compressed_model_path = compressed_model_paths[0]["model"]
print("The quantized model is stored at", compressed_model_path)
7 模型比较
最后,我们来比较一下全精度模型与优化/量化之后的模型,从以下几个角度:
mAP | 模型大小 |
| |
FP32 全精度模型 | 0.67329 | 2823.60 KB | 38.51 FPS |
INT8 量化模型 | 0.66534 | 806.62 KB | 90.03 FPS |
7.1 mAP
# Compute the mAP on the quantized model and compare with the mAP on the FP16 IR model.
ir_model = load_model(model_config=model_config)
evaluation_pipeline = create_pipeline(algo_config=dict(), engine=engine)
with yaspin(text="Evaluating original IR model") as sp:
original_metric = evaluation_pipeline.evaluate(ir_model)
with yaspin(text="Evaluating quantized IR model") as sp:
quantized_metric = pipeline.evaluate(compressed_model)
if original_metric:
for key, value in original_metric.items():
print(f"The {key} score of the original FP32 model is {value:.5f}")
if quantized_metric:
for key, value in quantized_metric.items():
print(f"The {key} score of the quantized INT8 model is {value:.5f}")
我们得到的结果如下:
The map score of the original FP32 model is 0.67329
The map score of the quantized INT8 model is 0.66534
7.2 模型大小
original_model_size = Path(ir_path).with_suffix(".bin").stat().st_size / 1024
quantized_model_size = (
Path(compressed_model_path).with_suffix(".bin").stat().st_size / 1024
)
print(f"FP32 model size: {original_model_size:.2f} KB")
print(f"INT8 model size: {quantized_model_size:.2f} KB")
我们得到的结果如下:
FP32 model size: 2823.60 KB
INT8 model size: 806.62 KB
7.3 使用benchmark_app
比较原始全精度模型和量化模型的性能
为了测量 FP32 和 INT8 模型的推理性能,我们使用 OpenVINO 的基准测试解决方案 Benchmark Tool。 可以在笔记本中运行:!benchmark_app
或 %sx benchmark_app
。
对于 Benchmark FP32 模型:
!benchmark_app -m $ir_path -d CPU -api async -t 15 -b 1 -cdir model_cache
我们得到结果
Output exceeds the size limit. Open the full output data in a text editor
[Step 1/11] Parsing and validating input arguments
[ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device CPU performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
API version............. 2022.1.0-7019-cdb9bec7210-releases/2022/1
[ INFO ] Device info
CPU
openvino_intel_cpu_plugin version 2022.1
Build................... 2022.1.0-7019-cdb9bec7210-releases/2022/1
[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 55.00 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'data' precision u8, dimensions ([N,C,H,W]): 1 3 320 544
[ INFO ] Model output 'detection_out' precision f32, dimensions ([...]): 1 1 200 7
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 333.00 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ] AVAILABLE_DEVICES , ['']
...
AVG: 103.64 ms
MIN: 30.67 ms
MAX: 555.79 ms
Throughput: 38.51 FPS
对于量化后的 INT8 模型:
!benchmark_app -m $compressed_model_path -d CPU -api async -t 15 -b 1 -cdir model_cache
我们得到结果
Output exceeds the size limit. Open the full output data in a text editor
[Step 1/11] Parsing and validating input arguments
[ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device CPU performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
API version............. 2022.1.0-7019-cdb9bec7210-releases/2022/1
[ INFO ] Device info
CPU
openvino_intel_cpu_plugin version 2022.1
Build................... 2022.1.0-7019-cdb9bec7210-releases/2022/1
[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 93.97 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'data' precision u8, dimensions ([N,C,H,W]): 1 3 320 544
[ INFO ] Model output 'detection_out' precision f32, dimensions ([...]): 1 1 200 7
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 681.53 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ] AVAILABLE_DEVICES , ['']
...
AVG: 44.31 ms
MIN: 32.49 ms
MAX: 153.90 ms
Throughput: 90.03 FPS