这篇2019AAAI论文主要提出了一种新的特征金字塔网络:MLFPN,基于新的特征网络,在COCO数据集上取得了优异结果。本文将以检测一张照片的流程进行解读,另附部分代码
代码详细地址:M2Det


M2Det

  • 为什么要提出这种新的特征金字塔架构
  • BackBone Network
  • MLFPN
  • TUM
  • FFM
  • FFAM
  • Detection stage
  • 总结:


为什么要提出这种新的特征金字塔架构

我们知道,一个目标检测框架的性能跟他的特征提取程度有很大的关系,为了充分提取特征同时解决目标的尺度问题(距离摄像头的远近不同,同一类物体的检测效果不同),大佬创造了两种金字塔网络,一种是图像金字塔网络,即将输入图像通过缩放等操作,在多个尺度进行目标检测,但是此类算法计算量大,速度慢。人们更倾向于特征金字塔网络(FPN)并在FPN上做了许多变体,M2Det其实也是一种FPN变体,如下图

ssd目标检测为什么预测的图片不会显示出来 m2det目标检测_神经网络


图d即是本文的主要架构,乍一看很复杂,但其中主要是重复工作。

BackBone Network

在这篇论文中给出了两大类BackBone:VGG和Resnet,本文不再阐述

ssd目标检测为什么预测的图片不会显示出来 m2det目标检测_神经网络_02

MLFPN

MLFPN是该论文提出的一种金字塔网络,主要有TUM(Thinned U-shape Module)、FFM(Feature Fusion Module)、FFAM(Scale-wise Feature Aggregation Module)三大部分组成。具体结构如下

ssd目标检测为什么预测的图片不会显示出来 m2det目标检测_ide_03

TUM

采用FPN模型,一共用到了8个TUM,单个TUM结构如下:

ssd目标检测为什么预测的图片不会显示出来 m2det目标检测_ide_04


输入tensor为(256.40,40)的向量后经过一系列下采样然后上采样再卷积(1*1,论文说是提高平滑度),最后每个TUP产生6个尺度不同的特征向量,tensor越小,其深度信息越强烈,tensor越大,浅度信息越强烈。同时产生的(128,40,40)也通过FFM参与到下一个TUP的初始输入中

class TUM(nn.Module):
    def __init__(self, first_level=True, input_planes=128, is_smooth=True, side_channel=512, scales=6):
        super(TUM, self).__init__()
        self.is_smooth = is_smooth
        self.side_channel = side_channel
        self.input_planes = input_planes
        self.planes = 2 * self.input_planes
        self.first_level = first_level
        self.scales = scales
        self.in1 = input_planes + side_channel if not first_level else input_planes

        self.layers = nn.Sequential()
        self.layers.add_module('{}'.format(len(self.layers)), BasicConv(self.in1, self.planes, 3, 2, 1))
        for i in range(self.scales-2):
            if not i == self.scales - 3:
                self.layers.add_module(
                        '{}'.format(len(self.layers)),
                        BasicConv(self.planes, self.planes, 3, 2, 1)
                        )
            else:
                self.layers.add_module(
                        '{}'.format(len(self.layers)),
                        BasicConv(self.planes, self.planes, 3, 1, 0)
                        )
        self.toplayer = nn.Sequential(BasicConv(self.planes, self.planes, 1, 1, 0))
        
        self.latlayer = nn.Sequential()
        for i in range(self.scales-2):
            self.latlayer.add_module(
                    '{}'.format(len(self.latlayer)),
                    BasicConv(self.planes, self.planes, 3, 1, 1)
                    )
        self.latlayer.add_module('{}'.format(len(self.latlayer)),BasicConv(self.in1, self.planes, 3, 1, 1))

        if self.is_smooth:
            smooth = list()
            for i in range(self.scales-1):
                smooth.append(
                        BasicConv(self.planes, self.planes, 1, 1, 0)
                        )
            self.smooth = nn.Sequential(*smooth)

    def _upsample_add(self, x, y, fuse_type='interp'):
        _,_,H,W = y.size()
        if fuse_type=='interp':
            return F.interpolate(x, size=(H,W), mode='nearest') + y
        else:
            raise NotImplementedError
            #return nn.ConvTranspose2d(16, 16, 3, stride=2, padding=1)

    def forward(self, x, y):
        if not self.first_level:
            x = torch.cat([x,y],1)
        conved_feat = [x]
        for i in range(len(self.layers)):
            x = self.layers[i](x)
            conved_feat.append(x)
        
        deconved_feat = [self.toplayer[0](conved_feat[-1])]
        for i in range(len(self.latlayer)):
            deconved_feat.append(
                    self._upsample_add(
                        deconved_feat[i], self.latlayer[i](conved_feat[len(self.layers)-1-i])
                        )
                    )
        if self.is_smooth:
            smoothed_feat = [deconved_feat[0]]
            for i in range(len(self.smooth)):
                smoothed_feat.append(
                        self.smooth[i](deconved_feat[i+1])
                        )
            return smoothed_feat
        return deconved_feat

FFM

ssd目标检测为什么预测的图片不会显示出来 m2det目标检测_ide_05


FFM分为FFMv1(图a)FFMv2(图b),FFMv1是将backbone的后两层特征concat,注意最后一层要上采样保持尺度大小相等,FFMv2则将FFMv1的输出和上一层的TUP输出concat。

FFAM

至此,我们已经得到了8个128128(1、3、5、10、20、40)特征,SFAM的目标是将TUMs生成的多层次多尺度特征聚合成多层次的特征金字塔。现在我们把特征大小相等的特征进行拼接,此时应该拼接后的特征都是n×n×1024(128*8=1024),每个特征都包含了不同深度的特征,随后作者将每个特征压缩成1×1×1024的大小,每个1×1×1024特征(共6个)随后两个全卷积用于学习参数,以此来选择最适合的检测尺寸。

ssd目标检测为什么预测的图片不会显示出来 m2det目标检测_计算机视觉_06

class SFAM(nn.Module):
    def __init__(self, planes, num_levels, num_scales, compress_ratio=16):
        super(SFAM, self).__init__()
        self.planes = planes
        self.num_levels = num_levels
        self.num_scales = num_scales
        self.compress_ratio = compress_ratio

        self.fc1 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels,
                                                 self.planes*self.num_levels // 16,
                                                 1, 1, 0)] * self.num_scales)
        self.relu = nn.ReLU(inplace=True)
        self.fc2 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels // 16,
                                                 self.planes*self.num_levels,
                                                 1, 1, 0)] * self.num_scales)
        self.sigmoid = nn.Sigmoid()
        self.avgpool = nn.AdaptiveAvgPool2d(1)

    def forward(self, x):
        attention_feat = []
        for i, _mf in enumerate(x):
            _tmp_f = self.avgpool(_mf)
            _tmp_f = self.fc1[i](_tmp_f)
            _tmp_f = self.relu(_tmp_f)
            _tmp_f = self.fc2[i](_tmp_f)
            _tmp_f = self.sigmoid(_tmp_f)
            attention_feat.append(_mf*_tmp_f)
        return attention_feat

Detection stage

检测阶段,为每个特征连接了两个全卷积层,分别用于回归和分类。每个像素点设置了6个anchor,三对不同比例,bbox检测范围和SSD一样。然后,使用0.05的threshold作为阈值来过滤掉大部分低分值的anchor。然后使用oft-NMS 进行后期处理,留下更精确的bbox。将threshold降为0.01可以得到更好的检测结果,但速度会慢。
基于tensorflow实现的focal loss:

import tensorflow as tf

def calc_focal_loss(cls_outputs, cls_targets, alpha=0.25, gamma=2.0):
    """
    Args:
        cls_outputs: [batch_size, num_anchors, num_classes]
        cls_targets: [batch_size, num_anchors, num_classes]
    Returns:
        cls_loss: [batch_size]
    Compute focal loss:
        FL = -(1 - pt)^gamma * log(pt), where pt = p if y == 1 else 1 - p
        cf. https://arxiv.org/pdf/1708.02002.pdf
    """
    positive_mask = tf.equal(cls_targets, 1.0)
    pos = tf.where(positive_mask, 1.0 - cls_outputs, tf.zeros_like(cls_outputs))
    neg = tf.where(positive_mask, tf.zeros_like(cls_outputs), cls_outputs)
    pos_loss = - alpha * tf.pow(pos, gamma) * tf.log(tf.clip_by_value(cls_outputs, 1e-15, 1.0))
    neg_loss = - (1 - alpha) * tf.pow(neg, gamma) * tf.log(tf.clip_by_value(1.0 - cls_outputs, 1e-15, 1.0))
    loss = tf.reduce_sum(pos_loss + neg_loss, axis=[1, 2])
    return loss
    
def calc_cls_loss(cls_outputs, cls_targets, positive_flag):
    batch_size = tf.shape(cls_outputs)[0]
    num_anchors = tf.to_float(tf.shape(cls_outputs)[1])
    num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
    num_negatives = tf.minimum(3 * num_positives, num_anchors - num_positives) # neg_pos_ratio is 3
    negative_mask = tf.greater(num_negatives, 0)

    cls_outputs = tf.clip_by_value(cls_outputs, 1e-15, 1 - 1e-15)
    conf_loss = -tf.reduce_sum(cls_targets * tf.log(cls_outputs), axis=-1)
    pos_conf_loss = tf.reduce_sum(conf_loss * positive_flag, axis=1) 
    
    has_min = tf.to_float(tf.reduce_any(negative_mask)) # would be 0.0 if ALL num_neg are 0
    num_neg = tf.concat(axis=0, values=[num_negatives, [(1 - has_min) * 100]])
    # minimum value under the condition the value > 0
    num_neg_batch = tf.reduce_min(tf.boolean_mask(num_negatives, tf.greater(num_negatives, 0)))
    num_neg_batch = tf.to_int32(num_neg_batch)
    max_confs = tf.reduce_max(cls_outputs[:, :, 1:], axis=2) # except backgound class
    _, indices = tf.nn.top_k(max_confs * (1 - positive_flag), k=num_neg_batch)
    batch_idx = tf.expand_dims(tf.range(0, batch_size), 1)
    batch_idx = tf.tile(batch_idx, (1, num_neg_batch))
    full_indices = (tf.reshape(batch_idx, [-1]) * tf.to_int32(num_anchors) + tf.reshape(indices, [-1]))
    neg_conf_loss = tf.gather(tf.reshape(conf_loss, [-1]), full_indices)
    neg_conf_loss = tf.reshape(neg_conf_loss, [batch_size, num_neg_batch])
    neg_conf_loss = tf.reduce_sum(neg_conf_loss, axis=1)

    cls_loss = pos_conf_loss + neg_conf_loss
    cls_loss /= (num_positives + tf.to_float(num_neg_batch))
    return cls_loss
    
def calc_box_loss(box_outputs, box_targets, positive_flag, delta=0.1):
    num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
    normalizer = num_positives * 4
    normalizer = tf.where(tf.not_equal(normalizer, 0), normalizer, tf.ones_like(normalizer)) # to avoid division by 0

    loss_scale = 2.0 - box_targets[:, :, 2:3] * box_targets[:, :, 3:4]

    sq_loss = 0.5 * (box_targets - box_outputs) ** 2
    abs_loss = 0.5 * delta ** 2 + delta * (tf.abs(box_outputs - box_targets) - delta)
    l1_loss = tf.where(tf.less(tf.abs(box_outputs - box_targets), delta), sq_loss, abs_loss)

    box_loss = tf.reduce_sum(l1_loss, axis=-1, keepdims=True)
    box_loss = box_loss * loss_scale
    box_loss = tf.reduce_sum(box_loss, axis=-1)
    box_loss = tf.reduce_sum(box_loss * positive_flag, axis=-1)
    box_loss = box_loss / normalizer

    return box_loss

def calc_loss(y_true, y_pred, box_loss_weight):
    """
    Args:
        y_true: [batch_size, num_anchors, 4 + num_classes + 1]
        y_pred: [batch_size, num_anchors, 4 + num_classes]
            num_classes is including the back-ground class
            last element of y_true denotes if the box is positive or negative:
    Returns:
        total_loss:
    cf. https://github.com/tensorflow/tpu/blob/master/models/official/retinanet/retinanet_model.py
    """
    
    box_outputs = y_pred[:, :, :4]
    box_targets = y_true[:, :, :4]
    cls_outputs = y_pred[:, :, 4:]
    cls_targets = y_true[:, :, 4:-1]
    positive_flag = y_true[:, :, -1]
    num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]

    box_loss = calc_box_loss(box_outputs, box_targets, positive_flag)
    ##cls_loss = calc_cls_loss(cls_outputs, cls_targets, positive_flag)
    cls_loss = calc_focal_loss(cls_outputs, cls_targets)

    total_loss = cls_loss + box_loss_weight * box_loss

    return tf.reduce_mean(total_loss)

总结:

本文主要注重网络结构的改善,但是我感觉8个TUP计算量太大了。如果适当的减少TUP数量同时给不同TUP一个可学习的权重参数应该可以达到更好的效果