本文只介绍yolov5和yolov8两个版本的输出格式的区别;

yolo目标检测原理:使用yolo进行目标检测的主要思想是将目标检测任务转化为一个回归问题,通过前向传播过程完成目标的定位和分类;

YOLOv5输出格式

yolov5:每一anchar输出对应的xywh(4个通道),框置信度(1个通道),类别置信度(类别数目,就是类别数个通道);

这里的xy,wh的单位,不是像素, 是归一化后的数值, 需要按比例转换成像素值,xy的坐标是框框中心点的坐标,而不是左上角的坐标;

源代码在models\yolo.py 文件中 38 行可以体现;

def forward(self, x):
    z = []  # inference output
    for i in range(self.nl):
        x[i] = self.m[i](x[i])  # conv
        bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
        x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

        if not self.training:  # inference
            if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

            if isinstance(self, Segment):  # (boxes + masks)
                xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4)
                xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i]  # xy
                wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i]  # wh
                y = torch.cat((xy, wh, conf.sigmoid(), mask), 4)
            else:  # Detect (boxes only)
                xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                xy = (xy * 2 + self.grid[i]) * self.stride[i]  # xy
                wh = (wh * 2) ** 2 * self.anchor_grid[i]  # wh
                y = torch.cat((xy, wh, conf), 4)
            z.append(y.view(bs, self.na * nx * ny, self.no))

    return x if self.training else (torch.cat(z, 1), ) if self.export else (torch.cat(z, 1), x)

YOLOv8输出格式

yolov8:每一anchar输出对应的xywh(4个通道),类别置信度(类别数目,就是类别数个通道);

这里的xy,wh的单位,不是像素, 是归一化后的数值, 需要按比例转换成像素值,xy的坐标是框框中心点的坐标,而不是左上角的坐标;

源代码在ultralytics/utils/ops.py 文件中 162行可以体现;

bs = prediction.shape[0]  # batch size
nc = nc or (prediction.shape[1] - 4)  # number of classes
nm = prediction.shape[1] - nc - 4
mi = 4 + nc  # mask start index
xc = prediction[:, 4:mi].amax(1) > conf_thres  # candidates

# Settings
# min_wh = 2  # (pixels) minimum box width and height
time_limit = 2.0 + max_time_img * bs  # seconds to quit after
multi_label &= nc > 1  # multiple labels per box (adds 0.5ms/img)

prediction = prediction.transpose(-1, -2)  # shape(1,84,6300) to shape(1,6300,84)
if not rotated:
    if in_place:
        prediction[..., :4] = xywh2xyxy(prediction[..., :4])  # xywh to xyxy
    else:
        prediction = torch.cat((xywh2xyxy(prediction[..., :4]), prediction[..., 4:]), dim=-1)  # xywh to xyxy

t = time.time()
output = [torch.zeros((0, 6 + nm), device=prediction.device)] * bs
for xi, x in enumerate(prediction):  # image index, image inference
    # Apply constraints
    # x[((x[:, 2:4] < min_wh) | (x[:, 2:4] > max_wh)).any(1), 4] = 0  # width-height
    x = x[xc[xi]]  # confidence

    # Cat apriori labels if autolabelling
    if labels and len(labels[xi]) and not rotated:
        lb = labels[xi]
        v = torch.zeros((len(lb), nc + nm + 4), device=x.device)
        v[:, :4] = xywh2xyxy(lb[:, 1:5])  # box
        v[range(len(lb)), lb[:, 0].long() + 4] = 1.0  # cls
        x = torch.cat((x, v), 0)

    # If none remain process next image
    if not x.shape[0]:
        continue

    # Detections matrix nx6 (xyxy, conf, cls)
    box, cls, mask = x.split((4, nc, nm), 1)

    if multi_label:
        i, j = torch.where(cls > conf_thres)
        x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)
    else:  # best class only
        conf, j = cls.max(1, keepdim=True)
        x = torch.cat((box, conf, j.float(), mask), 1)[conf.view(-1) > conf_thres]

    # Filter by class
    if classes is not None:
        x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]

    # Check shape
    n = x.shape[0]  # number of boxes
    if not n:  # no boxes
        continue
    if n > max_nms:  # excess boxes
        x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence and remove excess boxes

    # Batched NMS
    c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
    scores = x[:, 4]  # scores
    if rotated:
        boxes = torch.cat((x[:, :2] + c, x[:, 2:4], x[:, -1:]), dim=-1)  # xywhr
        i = nms_rotated(boxes, scores, iou_thres)
    else:
        boxes = x[:, :4] + c  # boxes (offset by class)
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
    i = i[:max_det]  # limit detections