相信学算法的同学们在刚入门目标检测的时候,都会学到YOLOV1算法,毕竟它是YOLO算法的开端,当然为了做笔记,自己也就直接在这个博客上面进行,供大家一起参考学习。下面我直接根据YOLOv1算法的实现所需要的知识大致分享一下:
我们首先对YOLOv1有一个大致的了解,那就是如下图,输入一张图片或者一段图像序列(视频)到模型(训练完成)中,可以直接完成分类和定位功能,比如下面图片的小狗类别和汽车类别,以及他们所在图片上的位置,中间一部分我们先直接当做模型,别多想,我们先看下面的内容。本文的相关内容来自YOLO-从零开始入门目标检测 - 知乎,大家有需要也可以去看这位大佬所讲述的目标检测算法。
一、数据集
数据集是一个代码的关键部分,如果没有数据集,一切都是空的,那么对于目标检测的数据集我就直接推荐的是VOC2007/2012,以及COCO数据集。下面这是我自己下载的,所以也推荐给大家,省的大家找。
链接:https://pan.baidu.com/s/1niIAmSmoHa84aNywk-MZQw
提取码:2222
二、模型架构
首先我们来看看YOLOv1算法的整体模型,如下图:
从模型中我们可以大致看到,一张规格为[w,h,3]的图片,被缩放成[448 ,448,3]大小的图片,经过卷积网络后得到[7,7,30],其中7代表的是特征图的大小,30可以分为1+4+20,其中1代表是否有物体的概率,4代表的是中心点坐标的偏移量(tx,ty)和宽高(注意这不是偏移量,就是正常值)(tw,th)。下面会给一张这个模型的代码给大家:
import torch
import torch.nn as nn
import torch.utils.model_zoo as model_zoo
import torch.nn.functional as F
import numpy as np
__all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
'resnet152']
model_urls = {
'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth',
'resnet34': 'https://download.pytorch.org/models/resnet34-333f7ec4.pth',
'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth',
'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth',
'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth',
}
def conv3x3(in_planes, out_planes, stride=1):
"""3x3 convolution with padding"""
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
padding=1, bias=False)
def conv1x1(in_planes, out_planes, stride=1):
"""1x1 convolution"""
return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(BasicBlock, self).__init__()
self.conv1 = conv3x3(inplanes, planes, stride)
self.bn1 = nn.BatchNorm2d(planes)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes)
self.bn2 = nn.BatchNorm2d(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(Bottleneck, self).__init__()
self.conv1 = conv1x1(inplanes, planes)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = conv3x3(planes, planes, stride)
self.bn2 = nn.BatchNorm2d(planes)
self.conv3 = conv1x1(planes, planes * self.expansion)
self.bn3 = nn.BatchNorm2d(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Module):
def __init__(self, block, layers, zero_init_residual=False):
super(ResNet, self).__init__()
self.inplanes = 64
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck):
nn.init.constant_(m.bn3.weight, 0)
elif isinstance(m, BasicBlock):
nn.init.constant_(m.bn2.weight, 0)
def _make_layer(self, block, planes, blocks, stride=1):
downsample = None
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
conv1x1(self.inplanes, planes * block.expansion, stride),
nn.BatchNorm2d(planes * block.expansion),
)
layers = []
layers.append(block(self.inplanes, planes, stride, downsample))
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes))
return nn.Sequential(*layers)
def forward(self, x):
C_1 = self.conv1(x)
C_1 = self.bn1(C_1)
C_1 = self.relu(C_1)
C_1 = self.maxpool(C_1)
C_2 = self.layer1(C_1)
C_3 = self.layer2(C_2)
C_4 = self.layer3(C_3)
C_5 = self.layer4(C_4)
return C_5
def resnet18(pretrained=False, **kwargs):
"""Constructs a ResNet-18 model.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)
if pretrained:
# strict = False as we don't need fc layer params.
model.load_state_dict(model_zoo.load_url(model_urls['resnet18']), strict=False)
return model
def resnet34(pretrained=False, **kwargs):
"""Constructs a ResNet-34 model.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = ResNet(BasicBlock, [3, 4, 6, 3], **kwargs)
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet34']), strict=False)
return model
def resnet50(pretrained=False, **kwargs):
"""Constructs a ResNet-50 model.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet50']), strict=False)
return model
def resnet101(pretrained=False, **kwargs):
"""Constructs a ResNet-101 model.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = ResNet(Bottleneck, [3, 4, 23, 3], **kwargs)
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet101']), strict=False)
return model
def resnet152(pretrained=False, **kwargs):
"""Constructs a ResNet-152 model.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = ResNet(Bottleneck, [3, 8, 36, 3], **kwargs)
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet152']))
return model
#搭建SPP模块
class Conv(nn.Module):
def __init__(self, c1, c2, k, s=1, p=0, d=1, g=1, act=True):
super(Conv, self).__init__()
self.convs = nn.Sequential(
nn.Conv2d(c1, c2, k, stride=s, padding=p, dilation=d, groups=g),
nn.BatchNorm2d(c2),
nn.LeakyReLU(0.1, inplace=True) if act else nn.Identity()
)
def forward(self, x):
return self.convs(x)
class SPP(nn.Module):
"""
Spatial Pyramid Pooling
"""
def __init__(self):
super(SPP, self).__init__()
def forward(self, x):
x_1 = torch.nn.functional.max_pool2d(x, 5, stride=1, padding=2)
x_2 = torch.nn.functional.max_pool2d(x, 9, stride=1, padding=4)
x_3 = torch.nn.functional.max_pool2d(x, 13, stride=1, padding=6)
x = torch.cat([x, x_1, x_2, x_3], dim=1)
return x
#搭建整体网络
class Yolov1(nn.Module):
def __init__(self,num_class=20):
super(Yolov1, self).__init__()
self.num_class=num_class
self.backbone=resnet18(pretrained=False)
c5 = 512
self.neck = nn.Sequential(
SPP(),
Conv(c5 * 4, c5, k=1),
)
# detection head
self.convsets = nn.Sequential(
Conv(c5, 256, k=1),
Conv(256, 512, k=3, p=1),
Conv(512, 256, k=1),
Conv(256, 512, k=3, p=1)
)
# detection head
self.convsets = nn.Sequential(
Conv(c5, 256, k=1),
Conv(256, 512, k=3, p=1),
Conv(512, 256, k=1),
Conv(256, 512, k=3, p=1)
)
# pred
self.pred = nn.Conv2d(512, 1 + self.num_class + 4, 1)
def forward(self,x):
B,C,W,H=x.shape
# backbone主干网络
c5 = self.backbone(x)
# neck网络
p5 = self.neck(c5)
# detection head网络
p5 = self.convsets(p5)
# 预测层
pred = self.pred(p5)
pred=pred.view(B,pred.size(1),-1).permute(0, 2, 1)
conf_pred=pred[...,0:1]
cls_pred = pred[...,1:1+self.num_class]
# bbox预测:[B, H*W, 4]
txtytwth_pred = pred[...,1 + self.num_class:]
return pred,conf_pred,cls_pred,txtytwth_pred
注意:相信大家也看了别人得博客,说将一张图片划分成SXS个网格,相信大家有点蒙,那我解释一下,其实就是我们刚刚得到的[7,7,30],我们将7x7当做一张纸片来看,然后就是有30张7x7的纸片,然后在这一张7x7纸片中,我们可以当成49个网格,然后每个网格都会包含原来图片的感受野大小为(448/7,448/7),因此在整个输入图片中就相当于有了49个(448/7,448/7)的网格,那么我们现在可以将S当成7就能够理解了。
三、正负样本的取值
对于数据集所给的图片和标签是这样的,就是一张图片所要检测的物体都是会给你标签的,也就是真实框的相关数据都会给你,然后你需要做的就是将这个数据处理一下能送进网络中并且能够进行损失函数的计算就行了。那么就需要对标签进行处理,让中心坐标变成偏移值,这样方便收敛。
def generate_dxdywh(gt_label, w, h, s):
xmin, ymin, xmax, ymax = gt_label[:-1]
# 计算边界框的中心点
c_x = (xmax + xmin) / 2 * w
c_y = (ymax + ymin) / 2 * h
box_w = (xmax - xmin) * w
box_h = (ymax - ymin) * h
if box_w < 1e-4 or box_h < 1e-4:#只是为了验证长宽是否合格
# print('Not a valid data !!!')
return False
# 计算中心点所在的网格坐标
c_x_s = c_x / s
c_y_s = c_y / s
grid_x = int(c_x_s)
grid_y = int(c_y_s)
# 计算中心点偏移量和宽高的标签
tx = c_x_s - grid_x
ty = c_y_s - grid_y
tw = np.log(box_w)
th = np.log(box_h)
# 计算边界框位置参数的损失权重
weight = 2.0 - (box_w / w) * (box_h / h)
return grid_x, grid_y, tx, ty, tw, th, weight
def gt_creator(input_size, stride, label_lists=[]):
# 必要的参数
batch_size = len(label_lists)
w = input_size
h = input_size
ws = w // stride
hs = h // stride
s = stride
gt_tensor = np.zeros([batch_size, hs, ws, 1+1+4+1])
# 制作训练标签
for batch_index in range(batch_size):
for gt_label in label_lists[batch_index]:
gt_class = int(gt_label[-1])
result = generate_dxdywh(gt_label, w, h, s)
if result:
grid_x, grid_y, tx, ty, tw, th, weight = result
if grid_x < gt_tensor.shape[2] and grid_y < gt_tensor.shape[1]:#如果所得到的真实框的左上角坐标都是符合所设定的网格内,那么标签合格。
gt_tensor[batch_index, grid_y, grid_x, 0] = 1.0
gt_tensor[batch_index, grid_y, grid_x, 1] = gt_class
gt_tensor[batch_index, grid_y, grid_x, 2:6] = np.array([tx, ty, tw, th])
gt_tensor[batch_index, grid_y, grid_x, 6] = weight
gt_tensor = gt_tensor.reshape(batch_size, -1, 1+1+4+1)
return torch.from_numpy(gt_tensor).float()
注意:在其他博客中,会看到这么一句话,就是物体所在的中心点落在那个网格就是有那个网格来进行回归目标框的话语,其实它的意思很好解读,就是我们现在有了真实框的信息,那么我们就能知道真实框的中心点所在的网格位置,至于为什么又出现网格这个词,就是我上面那个注意所解释的。网格是可以通过代码生成的,下面有代码。然后得到网格位置之后,那么预测值通过网络也能知道,我们将预测值通过相关公式变成相对于输入图片的大小,但是预测值肯定会在每个网格都有,那么我们只要将真实值中心坐标所在的那个网格作为正样本来与该网格同样对应的位置的预测值进行损失计算就行了,这就是那句话的 意思。相信大家也清楚了吧。下面直接上代码。
#构建网格
def create_grid(input_size,stride):#224x224
input_w,input_h=input_size
grid_w,grid_h=input_w//stride,input_h//stride #假设stride=32 则grid_w,grid_h=7,7
#创建网格序号
grid_x,grid_y=torch.meshgrid(torch.arange(grid_w),torch.arange(grid_h))
grid_xy=torch.stack([grid_x,grid_y],dim=-1).float()
grid_xy = grid_xy.view(1, grid_w * grid_h, 2)
return grid_xy
四、损失计算
损失计算是一个算法的重中之重,没有损失计算,那么这个算法也就到头了,我们可以通过这损失计算图可以看到,前的系数大家记着就是为了平衡作用,不用管其他的。然后第一个是坐标预测,就是左上角和右下角坐标以及目标框的宽和高的信息进行计算,前面像1的符号就是代表那个网格是否有目标,有就为1,没有就为0。然后下卖弄的就是置信度预测以及类别预测,相信学过算法都能看出来,就不多解释了。
给大家看看解码的过程,不用看懂,有个大概就行,到时候自己去github上面找个代码跑跑再看看就行。
def decode_pred(pred,grid_cell,stride):
output=torch.zeros_like(pred)
pred[:, :, :2]=(torch.sigmoid(pred[:, :, :2])+grid_cell)
pred[:, :, 2:] = torch.exp(pred[:, :, 2:])
# 将所有bbox的中心带你坐标和宽高换算成x1y1x2y2形式
output[:, :, 0] = pred[:, :, 0] * stride - pred[:, :, 2] / 2
output[:, :, 1] = pred[:, :, 1] * stride - pred[:, :, 3] / 2
output[:, :, 2] = pred[:, :, 0] * stride + pred[:, :, 2] / 2
output[:, :, 3] = pred[:, :, 1] * stride + pred[:, :, 3] / 2
return output
五、推理阶段
这一块就不用进行损失计算了,和大家讲个大概就行,就是将测试的图片送进训练好的网络中(该网络是有最好的权重信息的),然后网络会输出7x7x30的张量,然后将这个偏移值进行解码,变成相对于输入图片的大小的信息,然后再进行非极大值抑制的处理,因为会出现很多预测 框,处理完毕之后就能够得到下面所显示的这样,对于框框是自己得到这个预测框的左上角和右下角信息通过opencv代码实现的,不是自己实现的哈,哈哈。 祝大家学有所成哈!