Light-Head R-CNN: In Defense of Two-Stage Object Detector
Code will be make publicly available
本文对 Two-Stage Object Detector 进行改进,主要侧重的是网络结构的简化速度的提升,性能稍有提升。
首先说说什么是 Two-Stage Object Detector ,就是将目标检测分为两个步骤:候选区域提取+候选区域分类,代表性的方法有 Faster R-CNN [28] and R-FCN [17]
相对于 Two-Stage Object Detector,就有 One-Stage Object Detector,没有候选区域提取这个步骤,直接检测分类,代表性的方法有YOLO [26, 27] and SSD [22]。
Two-Stage Object Detector 能否在速度和精度上都超越 One-Stage Object Detector 了?
我们发现 Two-Stage Object Detector 具有一些共性: a heavy head attached to the backbone network,例如 Faster R-CNN 中使用了较复杂的网络用于每个候选区域的分类和回归,另一个就是 ROI pooling 之后的 feature channels 数目较大导致内存消耗和计算量较大。
所以这里我们提出了一个 轻量级的分类回归网络设计,得到一个 efficient yet accurate two-stage detector,我们主要做了两件事:
1) apply a large-kernel separable convolution to produce“thin” feature maps with small channel number
2)A cheap single fully-connected layer is attached to the pooling layer
3 Our Approach
3.1. Light-Head R-CNN
Faster R-CNN 中的分类器使用了two large fully connected layers or whole Resnet stage5,虽然精度较高,但是计算量大。为了加速 RoI-wise subnet,R-FCN 对每个区域生成一组 score maps,然后pool along each RoI, average vote the final prediction,使用一个 computation-free R-CNN subnet, R-FCN 通过将计算量前移至 RoI shared score maps generation 得到相当检测结果。 Faster R-CNN and R-FCN 都是 heavy head,但是位于网络不同的位置。从 精度的角度来说,尽管 Faster R-CNN 在 区域分类上不错,但是为了降低第一全连接层的计算量进行了 global average pooling ,这对空间定位具有一定的伤害性。 对于 R-FCN 来说, it directly pools the prediction results after the position-sensitive pooling,它的性能要差点。 从速度的角度来分析: Faster R-CNN 对每个候选区域使用了一个 costly R-CNN subnet,所以当候选区域较多时,其整体速度变慢。R-FCN 虽然使用了一个 cost-free R-CNN subnet,但是 对每个 RoI pooling 生成了大量 score map,这导致整个网络的 内存和时间都较大。
3.1.2 Thin feature maps for RoI warping
为了降低计算量,我们提出使用小数目的特征通道 small channel number (thin feature maps)
RoI warping on thin feature maps will not only improves the accuracy but also saves memory and computation during training and inference
3.2. Light-Head R-CNN for Object Detection
这里我们设计了两个网络:1)setting “L” to validate the performance our algorithm when integrated with a large backbone network
2) setting“S” to validate the effectiveness and efficiency of our algorithm when uses a small backbone network
Basic feature extractor:setting L: ResNet 101 setting S: utilize the Xception-like small base model
Xception like architecture:
Thin feature maps: large separable convolution layers [35, 25] on C5
RPN (Region Proposal Network) is a sliding-window class-agnostic object detector that use features from C4
non-maximum suppression (NMS) is used to reduce the number of proposals
4 Experiments
We investigate the impact of reducing channels of feature maps for ROI warping
COCO test-dev