这篇2019AAAI论文主要提出了一种新的特征金字塔网络:MLFPN,基于新的特征网络,在COCO数据集上取得了优异结果。本文将以检测一张照片的流程进行解读,另附部分代码
代码详细地址:M2Det
M2Det
- 为什么要提出这种新的特征金字塔架构
- BackBone Network
- MLFPN
- TUM
- FFM
- FFAM
- Detection stage
- 总结:
为什么要提出这种新的特征金字塔架构
我们知道,一个目标检测框架的性能跟他的特征提取程度有很大的关系,为了充分提取特征同时解决目标的尺度问题(距离摄像头的远近不同,同一类物体的检测效果不同),大佬创造了两种金字塔网络,一种是图像金字塔网络,即将输入图像通过缩放等操作,在多个尺度进行目标检测,但是此类算法计算量大,速度慢。人们更倾向于特征金字塔网络(FPN)并在FPN上做了许多变体,M2Det其实也是一种FPN变体,如下图
图d即是本文的主要架构,乍一看很复杂,但其中主要是重复工作。
BackBone Network
在这篇论文中给出了两大类BackBone:VGG和Resnet,本文不再阐述
MLFPN
MLFPN是该论文提出的一种金字塔网络,主要有TUM(Thinned U-shape Module)、FFM(Feature Fusion Module)、FFAM(Scale-wise Feature Aggregation Module)三大部分组成。具体结构如下
TUM
采用FPN模型,一共用到了8个TUM,单个TUM结构如下:
输入tensor为(256.40,40)的向量后经过一系列下采样然后上采样再卷积(1*1,论文说是提高平滑度),最后每个TUP产生6个尺度不同的特征向量,tensor越小,其深度信息越强烈,tensor越大,浅度信息越强烈。同时产生的(128,40,40)也通过FFM参与到下一个TUP的初始输入中
class TUM(nn.Module):
def __init__(self, first_level=True, input_planes=128, is_smooth=True, side_channel=512, scales=6):
super(TUM, self).__init__()
self.is_smooth = is_smooth
self.side_channel = side_channel
self.input_planes = input_planes
self.planes = 2 * self.input_planes
self.first_level = first_level
self.scales = scales
self.in1 = input_planes + side_channel if not first_level else input_planes
self.layers = nn.Sequential()
self.layers.add_module('{}'.format(len(self.layers)), BasicConv(self.in1, self.planes, 3, 2, 1))
for i in range(self.scales-2):
if not i == self.scales - 3:
self.layers.add_module(
'{}'.format(len(self.layers)),
BasicConv(self.planes, self.planes, 3, 2, 1)
)
else:
self.layers.add_module(
'{}'.format(len(self.layers)),
BasicConv(self.planes, self.planes, 3, 1, 0)
)
self.toplayer = nn.Sequential(BasicConv(self.planes, self.planes, 1, 1, 0))
self.latlayer = nn.Sequential()
for i in range(self.scales-2):
self.latlayer.add_module(
'{}'.format(len(self.latlayer)),
BasicConv(self.planes, self.planes, 3, 1, 1)
)
self.latlayer.add_module('{}'.format(len(self.latlayer)),BasicConv(self.in1, self.planes, 3, 1, 1))
if self.is_smooth:
smooth = list()
for i in range(self.scales-1):
smooth.append(
BasicConv(self.planes, self.planes, 1, 1, 0)
)
self.smooth = nn.Sequential(*smooth)
def _upsample_add(self, x, y, fuse_type='interp'):
_,_,H,W = y.size()
if fuse_type=='interp':
return F.interpolate(x, size=(H,W), mode='nearest') + y
else:
raise NotImplementedError
#return nn.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
def forward(self, x, y):
if not self.first_level:
x = torch.cat([x,y],1)
conved_feat = [x]
for i in range(len(self.layers)):
x = self.layers[i](x)
conved_feat.append(x)
deconved_feat = [self.toplayer[0](conved_feat[-1])]
for i in range(len(self.latlayer)):
deconved_feat.append(
self._upsample_add(
deconved_feat[i], self.latlayer[i](conved_feat[len(self.layers)-1-i])
)
)
if self.is_smooth:
smoothed_feat = [deconved_feat[0]]
for i in range(len(self.smooth)):
smoothed_feat.append(
self.smooth[i](deconved_feat[i+1])
)
return smoothed_feat
return deconved_feat
FFM
FFM分为FFMv1(图a)FFMv2(图b),FFMv1是将backbone的后两层特征concat,注意最后一层要上采样保持尺度大小相等,FFMv2则将FFMv1的输出和上一层的TUP输出concat。
FFAM
至此,我们已经得到了8个128128(1、3、5、10、20、40)特征,SFAM的目标是将TUMs生成的多层次多尺度特征聚合成多层次的特征金字塔。现在我们把特征大小相等的特征进行拼接,此时应该拼接后的特征都是n×n×1024(128*8=1024),每个特征都包含了不同深度的特征,随后作者将每个特征压缩成1×1×1024的大小,每个1×1×1024特征(共6个)随后两个全卷积用于学习参数,以此来选择最适合的检测尺寸。
class SFAM(nn.Module):
def __init__(self, planes, num_levels, num_scales, compress_ratio=16):
super(SFAM, self).__init__()
self.planes = planes
self.num_levels = num_levels
self.num_scales = num_scales
self.compress_ratio = compress_ratio
self.fc1 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels,
self.planes*self.num_levels // 16,
1, 1, 0)] * self.num_scales)
self.relu = nn.ReLU(inplace=True)
self.fc2 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels // 16,
self.planes*self.num_levels,
1, 1, 0)] * self.num_scales)
self.sigmoid = nn.Sigmoid()
self.avgpool = nn.AdaptiveAvgPool2d(1)
def forward(self, x):
attention_feat = []
for i, _mf in enumerate(x):
_tmp_f = self.avgpool(_mf)
_tmp_f = self.fc1[i](_tmp_f)
_tmp_f = self.relu(_tmp_f)
_tmp_f = self.fc2[i](_tmp_f)
_tmp_f = self.sigmoid(_tmp_f)
attention_feat.append(_mf*_tmp_f)
return attention_feat
Detection stage
检测阶段,为每个特征连接了两个全卷积层,分别用于回归和分类。每个像素点设置了6个anchor,三对不同比例,bbox检测范围和SSD一样。然后,使用0.05的threshold作为阈值来过滤掉大部分低分值的anchor。然后使用oft-NMS 进行后期处理,留下更精确的bbox。将threshold降为0.01可以得到更好的检测结果,但速度会慢。
基于tensorflow实现的focal loss:
import tensorflow as tf
def calc_focal_loss(cls_outputs, cls_targets, alpha=0.25, gamma=2.0):
"""
Args:
cls_outputs: [batch_size, num_anchors, num_classes]
cls_targets: [batch_size, num_anchors, num_classes]
Returns:
cls_loss: [batch_size]
Compute focal loss:
FL = -(1 - pt)^gamma * log(pt), where pt = p if y == 1 else 1 - p
cf. https://arxiv.org/pdf/1708.02002.pdf
"""
positive_mask = tf.equal(cls_targets, 1.0)
pos = tf.where(positive_mask, 1.0 - cls_outputs, tf.zeros_like(cls_outputs))
neg = tf.where(positive_mask, tf.zeros_like(cls_outputs), cls_outputs)
pos_loss = - alpha * tf.pow(pos, gamma) * tf.log(tf.clip_by_value(cls_outputs, 1e-15, 1.0))
neg_loss = - (1 - alpha) * tf.pow(neg, gamma) * tf.log(tf.clip_by_value(1.0 - cls_outputs, 1e-15, 1.0))
loss = tf.reduce_sum(pos_loss + neg_loss, axis=[1, 2])
return loss
def calc_cls_loss(cls_outputs, cls_targets, positive_flag):
batch_size = tf.shape(cls_outputs)[0]
num_anchors = tf.to_float(tf.shape(cls_outputs)[1])
num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
num_negatives = tf.minimum(3 * num_positives, num_anchors - num_positives) # neg_pos_ratio is 3
negative_mask = tf.greater(num_negatives, 0)
cls_outputs = tf.clip_by_value(cls_outputs, 1e-15, 1 - 1e-15)
conf_loss = -tf.reduce_sum(cls_targets * tf.log(cls_outputs), axis=-1)
pos_conf_loss = tf.reduce_sum(conf_loss * positive_flag, axis=1)
has_min = tf.to_float(tf.reduce_any(negative_mask)) # would be 0.0 if ALL num_neg are 0
num_neg = tf.concat(axis=0, values=[num_negatives, [(1 - has_min) * 100]])
# minimum value under the condition the value > 0
num_neg_batch = tf.reduce_min(tf.boolean_mask(num_negatives, tf.greater(num_negatives, 0)))
num_neg_batch = tf.to_int32(num_neg_batch)
max_confs = tf.reduce_max(cls_outputs[:, :, 1:], axis=2) # except backgound class
_, indices = tf.nn.top_k(max_confs * (1 - positive_flag), k=num_neg_batch)
batch_idx = tf.expand_dims(tf.range(0, batch_size), 1)
batch_idx = tf.tile(batch_idx, (1, num_neg_batch))
full_indices = (tf.reshape(batch_idx, [-1]) * tf.to_int32(num_anchors) + tf.reshape(indices, [-1]))
neg_conf_loss = tf.gather(tf.reshape(conf_loss, [-1]), full_indices)
neg_conf_loss = tf.reshape(neg_conf_loss, [batch_size, num_neg_batch])
neg_conf_loss = tf.reduce_sum(neg_conf_loss, axis=1)
cls_loss = pos_conf_loss + neg_conf_loss
cls_loss /= (num_positives + tf.to_float(num_neg_batch))
return cls_loss
def calc_box_loss(box_outputs, box_targets, positive_flag, delta=0.1):
num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
normalizer = num_positives * 4
normalizer = tf.where(tf.not_equal(normalizer, 0), normalizer, tf.ones_like(normalizer)) # to avoid division by 0
loss_scale = 2.0 - box_targets[:, :, 2:3] * box_targets[:, :, 3:4]
sq_loss = 0.5 * (box_targets - box_outputs) ** 2
abs_loss = 0.5 * delta ** 2 + delta * (tf.abs(box_outputs - box_targets) - delta)
l1_loss = tf.where(tf.less(tf.abs(box_outputs - box_targets), delta), sq_loss, abs_loss)
box_loss = tf.reduce_sum(l1_loss, axis=-1, keepdims=True)
box_loss = box_loss * loss_scale
box_loss = tf.reduce_sum(box_loss, axis=-1)
box_loss = tf.reduce_sum(box_loss * positive_flag, axis=-1)
box_loss = box_loss / normalizer
return box_loss
def calc_loss(y_true, y_pred, box_loss_weight):
"""
Args:
y_true: [batch_size, num_anchors, 4 + num_classes + 1]
y_pred: [batch_size, num_anchors, 4 + num_classes]
num_classes is including the back-ground class
last element of y_true denotes if the box is positive or negative:
Returns:
total_loss:
cf. https://github.com/tensorflow/tpu/blob/master/models/official/retinanet/retinanet_model.py
"""
box_outputs = y_pred[:, :, :4]
box_targets = y_true[:, :, :4]
cls_outputs = y_pred[:, :, 4:]
cls_targets = y_true[:, :, 4:-1]
positive_flag = y_true[:, :, -1]
num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
box_loss = calc_box_loss(box_outputs, box_targets, positive_flag)
##cls_loss = calc_cls_loss(cls_outputs, cls_targets, positive_flag)
cls_loss = calc_focal_loss(cls_outputs, cls_targets)
total_loss = cls_loss + box_loss_weight * box_loss
return tf.reduce_mean(total_loss)
总结:
本文主要注重网络结构的改善,但是我感觉8个TUP计算量太大了。如果适当的减少TUP数量同时给不同TUP一个可学习的权重参数应该可以达到更好的效果