总体架构

RPN从FPN输出的特征图中,选出1000个proposal以及每个Proposal对应的前景Score,先看一下总体架构1

resnet50基模版本_resnet50基模版本


RPN主要包括这么几大部分:

  1. RPN Head,卷积网络层,并做前景分数的逻辑回归,和Bounding-Box Delta数值回归
  2. Anchors产生器,这里主要是基于各层特征图,产生anchors
  3. Reshape,讲RPN Head输出各层进行拼接
  4. BoxCoder的解码器,将卷积层输出的Delta值和Anchor一同转换为左上角坐标和右下角坐标boxes
  5. Proposal Filter,选1000个Proposal(boxes坐标以及前景Score)

产生出来的1000个proposal boxes大概如下:

resnet50基模版本_深度学习_02

RPNHead

对应代码为:

class RPNHead(nn.Module):
    """
    Adds a simple RPN Head with classification and regression heads

    Arguments:
        in_channels (int): number of channels of the input feature
        num_anchors (int): number of anchors to be predicted
    """

    def __init__(self, in_channels, num_anchors):
        super(RPNHead, self).__init__()
        self.conv = nn.Conv2d(
            in_channels, in_channels, kernel_size=3, stride=1, padding=1
        )
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        self.bbox_pred = nn.Conv2d(
            in_channels, num_anchors * 4, kernel_size=1, stride=1
        )

        for layer in self.children():
            torch.nn.init.normal_(layer.weight, std=0.01)
            torch.nn.init.constant_(layer.bias, 0)

    def forward(self, x):
        # type: (List[Tensor]) -> Tuple[List[Tensor], List[Tensor]]
        logits = []
        bbox_reg = []
        for feature in x:
            t = F.relu(self.conv(feature))
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

具体原理可以参考Faste-RCNN的RPN如何训练以及训练的参数集的详细解释。

Anchors产生器

Base Anchors

首先AnchorGenerator会产生一个base anchors,总共15个(每层3个,总共5层,每层的anchor的面积是一样的,比如第一层是resnet50基模版本_resnet50基模版本_03, 第二层是resnet50基模版本_resnet50基模版本_04…)

# Feature Level#0 (size: 32, aspect-ratio [2:1, 1:1, 1:2])
[-23., -11.,  23.,  11.],
[-16., -16.,  16.,  16.],
[-11., -23.,  11.,  23.],
# Feature Level#1 (size: 64, aspect-ratio [2:1, 1:1, 1:2])
[-45., -23.,  45.,  23.],
[-32., -32.,  32.,  32.],
[-23., -45.,  23.,  45.],
# Feature Level#2 (size: 128, aspect-ratio [2:1, 1:1, 1:2])
[-91., -45.,  91.,  45.],
[-64., -64.,  64.,  64.],
[-45., -91.,  45.,  91.],
# Feature Level#3 (size: 256, aspect-ratio [2:1, 1:1, 1:2])
[-181.,  -91.,  181.,   91.],
[-128., -128.,  128.,  128.],
[ -91., -181.,   91.,  181.],
# Feature Level#4 (size: 512, aspect-ratio [2:1, 1:1, 1:2])
[-362., -181.,  362.,  181.],
[-256., -256.,  256.,  256.],
[-181., -362.,  181.,  362.]

Layer

anchor_area

aspect_ratios

Anchor Counts

Size of feature-map of Anchors

Number of Anchors

Feature-Map#0

{2:1, 1:1, 1:2}

3

200x304

182400

Feature-Map#1

{2:1, 1:1, 1:2}

3

100x152

45600

Feature-Map#2

{2:1, 1:1, 1:2}

3

50x76

11400

Feature-Map#3

{2:1, 1:1, 1:2}

3

25x38

2850

Feature-Map#4

{2:1, 1:1, 1:2}

3

13x19

741

Total

242991

如何根据size和aspect-ratio产生这组数据呢?

def generate_anchors(self, scales, aspect_ratios, dtype=torch.float32, device="cpu"):
        # type: (List[int], List[float], int, Device) -> Tensor  # noqa: F821
        scales = torch.as_tensor(scales, dtype=dtype, device=device)
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1 / h_ratios

        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)

        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
        return base_anchors.round(

比如第一组aspect_ratios[0.5, 1.0, 2.0],scale是32那么第一个anchor是:
resnet50基模版本_深度学习_10
然后这个anchor的面积是:resnet50基模版本_计算机视觉_11

Grid Anchors

上面的15个基本的Anchor,在每一个层的每个点上都有,而且产生的Anchors的坐标都是基于统一坐标,所以有一个strides的概念

[[tensor(4), tensor(4)], 
 [tensor(8), tensor(8)], 
 [tensor(16), tensor(16)], 
 [tensor(32), tensor(32)], 
 [tensor(61), tensor(64)]]

在每个level的特征图上需要乘上各自对应的strides,然后再加上base_anchors就得到最终的anchors,这是对应的代码:

# For every combination of (a, (g, s), i) in (self.cell_anchors, zip(grid_sizes, strides), 0:2),
    # output g[i] anchors that are s[i] distance apart in direction i, with the same dimensions as a.
    def grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        anchors = []
        cell_anchors = self.cell_anchors
        assert cell_anchors is not None

        for size, stride, base_anchors in zip(
            grid_sizes, strides, cell_anchors
        ):
            grid_height, grid_width = size
            stride_height, stride_width = stride
            device = base_anchors.device

            # For output anchor, compute [x_center, y_center, x_center, y_center]
            shifts_x = torch.arange(
                0, grid_width, dtype=torch.float32, device=device
            ) * stride_width
            shifts_y = torch.arange(
                0, grid_height, dtype=torch.float32, device=device
            ) * stride_height
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)
            shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)

            # For every (base anchor, output anchor) pair,
            # offset each zero-centered base anchor by the center of the output anchor.
            anchors.append(
                (shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4)
            )

        return anchors

Anchors

上面每幅图片会产生一个长度为5的张量列表,需要将其拼接起来,能和RPN Head输出对应起来:

class AnchorGenerator(nn.Module):
	......
    def forward(self, image_list, feature_maps):
        # type: (ImageList, List[Tensor]) -> List[Tensor]
        grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])
        image_size = image_list.tensors.shape[-2:]
        dtype, device = feature_maps[0].dtype, feature_maps[0].device
        strides = [[torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
                    torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device)] for g in grid_sizes]
        self.set_cell_anchors(dtype, device)
        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)
        anchors = torch.jit.annotate(List[List[torch.Tensor]], [])
        for i, (image_height, image_width) in enumerate(image_list.image_sizes):
            anchors_in_image = []
            for anchors_per_feature_map in anchors_over_all_feature_maps:
                anchors_in_image.append(anchors_per_feature_map)
            anchors.append(anchors_in_image)
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        # Clear the cache in case that memory leaks.
        self._cache.clear()
        return anchors

Reshape

上面RPN Head产生的objectness和pred_bbox_deltas都是一个长度为5的张量列表,需要将他们拼接起来,转换成和Anchor相同格式和Layout的张量。

def concat_box_prediction_layers(box_cls, box_regression):
    # type: (List[Tensor], List[Tensor]) -> Tuple[Tensor, Tensor]
    box_cls_flattened = []
    box_regression_flattened = []
    # for each feature level, permute the outputs to make them be in the
    # same format as the labels. Note that the labels are computed for
    # all feature levels concatenated, so we keep the same representation
    # for the objectness and the box_regression
    for box_cls_per_level, box_regression_per_level in zip(
        box_cls, box_regression
    ):
        N, AxC, H, W = box_cls_per_level.shape
        Ax4 = box_regression_per_level.shape[1]
        A = Ax4 // 4
        C = AxC // A
        box_cls_per_level = permute_and_flatten(
            box_cls_per_level, N, A, C, H, W
        )
        box_cls_flattened.append(box_cls_per_level)

        box_regression_per_level = permute_and_flatten(
            box_regression_per_level, N, A, 4, H, W
        )
        box_regression_flattened.append(box_regression_per_level)
    # concatenate on the first dimension (representing the feature levels), to
    # take into account the way the labels were generated (with all feature maps
    # being concatenated as well)
    box_cls = torch.cat(box_cls_flattened, dim=1).flatten(0, -2)
    box_regression = torch.cat(box_regression_flattened, dim=1).reshape(-1, 4)
    return box_cls, box_regression

最后产生出来的Anchors, Objectness(前景分类)和box_delta_regression(box delta数值回归)都是按照下面来排列的

resnet50基模版本_计算机视觉_12

Box-Coder.Decode

通过RPN Head得到的是预测出来的delta值[resnet50基模版本_pytorch_13,resnet50基模版本_resnet50基模版本_14,resnet50基模版本_ide_15,resnet50基模版本_resnet50基模版本_16],下图中黑色边框的box是anchor,红色边框的是预测出来的proposal,这些预测出来的proposal(resnet50基模版本_pytorch_17)。

resnet50基模版本_计算机视觉_18

class BoxCoder(object):
	......
    def decode(self, rel_codes, boxes):
        # type: (Tensor, List[Tensor]) -> Tensor
        assert isinstance(boxes, (list, tuple))
        assert isinstance(rel_codes, torch.Tensor)
        boxes_per_image = [b.size(0) for b in boxes]
        concat_boxes = torch.cat(boxes, dim=0)
        box_sum = 0
        for val in boxes_per_image:
            box_sum += val
        pred_boxes = self.decode_single(
            rel_codes.reshape(box_sum, -1), concat_boxes
        )
        return pred_boxes.reshape(box_sum, -1, 4)

    def decode_single(self, rel_codes, boxes):
        """
        From a set of original boxes and encoded relative box offsets,
        get the decoded boxes.

        Arguments:
            rel_codes (Tensor): encoded boxes
            boxes (Tensor): reference boxes.
        """

        boxes = boxes.to(rel_codes.dtype)

        widths = boxes[:, 2] - boxes[:, 0]
        heights = boxes[:, 3] - boxes[:, 1]
        ctr_x = boxes[:, 0] + 0.5 * widths
        ctr_y = boxes[:, 1] + 0.5 * heights

        wx, wy, ww, wh = self.weights
        dx = rel_codes[:, 0::4] / wx
        dy = rel_codes[:, 1::4] / wy
        dw = rel_codes[:, 2::4] / ww
        dh = rel_codes[:, 3::4] / wh

        # Prevent sending too large values into torch.exp()
        dw = torch.clamp(dw, max=self.bbox_xform_clip)
        dh = torch.clamp(dh, max=self.bbox_xform_clip)

        pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
        pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
        pred_w = torch.exp(dw) * widths[:, None]
        pred_h = torch.exp(dh) * heights[:, None]

        pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
        pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
        pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
        pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
        pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)
        return pred_boxes

Proposal Filter

_get_top_n_idx

对于5层特征图每层取得分高的前1000个,

Layer

Anchor Counts

Size of feature-map of Anchors

Number of bboxes

max(1000, num_of_bboxes

Feature-Map#0

3

200x304

182400

1000

Feature-Map#1

3

100x152

45600

1000

Feature-Map#2

3

50x76

11400

1000

Feature-Map#3

3

25x38

2850

1000

Feature-Map#4

3

13x19

741

741

Total

242991

4741

所以通过这个步骤,会总共选取4741个bboxes。

clip_boxes_to_image/remove_small_boxes

这个步骤将把proposal bbox从800x1216的区域裁剪到800x1202,这个避免将padding的边缘包含进来,然后再将小的boxes给剔除掉2:

def remove_small_boxes(boxes, min_size):
    # type: (Tensor, float) -> Tensor
    """
    Remove boxes which contains at least one side smaller than min_size.

    Arguments:
        boxes (Tensor[N, 4]): boxes in (x1, y1, x2, y2) format
        min_size (float): minimum size

    Returns:
        keep (Tensor[K]): indices of the boxes that have both sides
            larger than min_size
    """
    ws, hs = boxes[:, 2] - boxes[:, 0], boxes[:, 3] - boxes[:, 1]
    keep = (ws >= min_size) & (hs >= min_size)
    keep = keep.nonzero().squeeze(1)
    return keep

batched_nms

这步将在每层特征图的proposal bbox, score进行非极大值抑制(nms)处理,进一步过滤proposal,如果过滤出来的proposal数目大于1000,只保留前1000个。这里nms_thresh阈值缺省值为0.7。详细步骤结合下面代码看一下:

def filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level):
        # type: (Tensor, Tensor, List[Tuple[int, int]], List[int]) -> Tuple[List[Tensor], List[Tensor]]
        num_images = proposals.shape[0]
        device = proposals.device
        # do not backprop throught objectness
        objectness = objectness.detach()
        objectness = objectness.reshape(num_images, -1)

        levels = [
            torch.full((n,), idx, dtype=torch.int64, device=device)
            for idx, n in enumerate(num_anchors_per_level)
        ]
        levels = torch.cat(levels, 0)
        levels = levels.reshape(1, -1).expand_as(objectness)

        # select top_n boxes independently per level before applying nms
        top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)

        image_range = torch.arange(num_images, device=device)
        batch_idx = image_range[:, None]

        objectness = objectness[batch_idx, top_n_idx]
        levels = levels[batch_idx, top_n_idx]
        proposals = proposals[batch_idx, top_n_idx]

        final_boxes = []
        final_scores = []
        for boxes, scores, lvl, img_shape in zip(proposals, objectness, levels, image_shapes):
            boxes = box_ops.clip_boxes_to_image(boxes, img_shape)
            keep = box_ops.remove_small_boxes(boxes, self.min_size)
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
            # non-maximum suppression, independently done per level
            keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
            # keep only topk scoring predictions
            keep = keep[:self.post_nms_top_n()]
            boxes, scores = boxes[keep], scores[keep]
            final_boxes.append(boxes)
            final_scores.append(scores)
        return final_boxes, final_scores

创作不易,望赐个赞!😃


  1. 假设输入图像高为599,宽为900,经过转换后产生的特征图是基于(800x1216) ↩︎
  2. 缺省长宽小于阈值(0.01)将会淘汰掉 ↩︎