torchvision 目标检测微调
本教程将使用Penn-Fudan Database for Pedestrian Detection and Segmentation 微调 预训练的Mask R-CNN 模型。 它包含 170 张图片,345 个行人实例。
定义数据集
用于训练目标检测、实例分割和人物关键点检测的参考脚本允许轻松支持添加新的自定义数据集。数据集应继承自标准的 torch.utils.data.dataset
类,并实现 __len__
和 __getitem__
。
__getitem__
需要返回:
-
image
:PIL
图像(H, W)
target
: 字典数据,需要包含字段
-
boxes (FloatTensor[N, 4])
:N
个Bounding box
的位置坐标[x0, y0, x1, y1]
,0~W, 0~H
-
labels (Int64Tensor[N])
: 每个Bounding box
的类别标签,0
代表背景类。 -
image_id (Int64Tensor[1])
: 图像的标签id
,在数据集中是唯一的。 -
area (Tensor[N])
:Bounding box
的面积,在COCO
度量里使用,可以分别对不同大小的目标进行度量。 -
iscrowd (UInt8Tensor[N])
: 如果iscrowd=True
在评估时忽略。 -
(optionally) masks (UInt8Tensor[N, H, W])
: 可选的 分割掩码 -
(optionally) keypoints (FloatTensor[N, K, 3])
: 对于N
个目标来说,包含K
个关键点[x, y, visibility]
,visibility=0
表示关键点不可见。
如果模型可以返回上述方法,可以在训练、评估都能使用,可以用 pycocotools
里的脚本进行评估。
pip install pycocotools
安装工具。
关于 labels
有个说明,模型默认 0
为背景。如果数据集没有背景类别,不需要在标签里添加 0
。 例如,假设有 cat
和 dog
两类,定义了 1
表示 cat
, 2
表示 dog
, 如果一个图像有两个类别,类别的 tensor
为 [1, 2]
。
此外,如果希望在训练时使用纵横比分组,那么建议实现 get_height_and_width
方法,该方法将返回图像的高度和宽度,如果未提供此方法,我们将通过 __getitem__
查询数据集的所有元素,这会将图像加载到内存中,并且比提供自定义方法的速度慢。
为 PennFudan 写自定义数据集
PennFudan 下载地址,
文件夹结构如下:
PennFudanPed/
PedMasks/
FudanPed00001_mask.png
FudanPed00002_mask.png
FudanPed00003_mask.png
FudanPed00004_mask.png
...
PNGImages/
FudanPed00001.png
FudanPed00002.png
FudanPed00003.png
FudanPed00004.png
这是图像的标注信息,包含了 mask
以及 bounding box
。每个图像都有对应的分割掩码,每个颜色代表不同的实例。
import os
import numpy as np
import torch
from PIL import Image
class PennFudanDataset(torch.utils.data.Dataset):
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
## 加载所有图像,sort 保证他们能够对应起来
self.images = list(sorted(os.listdir(os.path.join(self.root, 'PNGImages'))))
self.masks = list(sorted(os.listdir(os.path.join(self.root, 'PedMasks'))))
def __getitem__(self, idx):
img_path = os.path.join(self.root, 'PNGImages', self.images[idx])
mask_path = os.path.join(self.root, 'PedMasks', self.masks[idx])
image = Image.open(img_path).convert('RGB')
## mask 图像并没有转换为 RGB,里面存储的是标签,0表示的是背景
mask = Image.open(mask_path)
# 转换为 numpy
mask = np.array(mask)
# 实例解码成不同的颜色
obj_ids = np.unique(mask)
# 移除背景
obj_ids = obj_ids[1:]
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
# 转换为 tensor
boxes = torch.as_tensor(boxes, dtype=torch.float32)
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
image, target = self.transforms(image, target)
return image, target
def __len__(self):
return len(self.images)
定义模型
Mask R-CNN
在 F aster R-CNN
的基础上添加了额外的分支,可以为实例预测分割掩码。
两种情况下,我们可能想修改 torchvision modelzoo
里的模型, 一是我们从一个预训练模型开始,仅仅调整最后一层;二是我们想替换模型的 backbone
(例如为了更快的预测)
微调预训练模型
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
# coco pre-trained model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")
# person + background
num_classes = 2
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
修改模型的backbone
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
# load a pre-trained model for classification and return only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# fasterrcnn neeeds to know the number of output channels in a backbone.
# for mobilenet_v2. it's 1280, so we need to add it here
backbone.out_channels = 1280
# let's make the RPN generate 5x3 anchors per spatial location, with 5 sizes and
# 3 aspect ratios.
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
# let's define what are the feature maps that we will use to perform the region
# of interest cropping, as well as the size of crop after rescaling.
# if your backbone returns a Tensor, featmap_names is expected to be [0]. More
# generally, the backbone should return an OrderedDict[Tensor], and in featmap_names
# you can choose which feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],
output_size=7,
sampling_ratio=2)
# put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,
num_classes=2,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler)
PennFudan Dataset 实例分割
在这个案例中,我们希望微调预训练模型,给定的数据集很小,所以用方法1;我们想计算实例分割的mask,所以我们用 Mask R-CNN.
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# get the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(
in_features_mask,
hidden_layer,
num_classes
)
return model
放在一起
目录结构
.
├── coco_eval.py
├── coco_utils.py
├── engine.py
├── group_by_aspect_ratio.py
├── presets.py
├── train_maskrcnn_PennFudan.py
├── transforms.py
└── utils.py
其他文件可以从 vision/references/detection
路径下找到,拷贝出来。
vision
仓库地址 https://github.com/pytorch/vision.git
train_maskrcnn_PennFudan.py
import os
import numpy as np
import torch
from PIL import Image
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from engine import train_one_epoch, evaluate
import utils
import transforms as T
class PennFudanDataset(torch.utils.data.Dataset):
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
## 加载所有图像,sort 保证他们能够对应起来
self.images = list(
sorted(os.listdir(os.path.join(self.root, 'PNGImages'))))
self.masks = list(
sorted(os.listdir(os.path.join(self.root, 'PedMasks'))))
print("hello")
print(self.images)
def __getitem__(self, idx):
img_path = os.path.join(self.root, 'PNGImages', self.images[idx])
mask_path = os.path.join(self.root, 'PedMasks', self.masks[idx])
image = Image.open(img_path).convert('RGB')
## mask 图像并没有转换为 RGB,里面存储的是标签,0表示的是背景
mask = Image.open(mask_path)
# 转换为 numpy
mask = np.array(mask)
# 实例解码成不同的颜色
obj_ids = np.unique(mask)
# 移除背景
obj_ids = obj_ids[1:]
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
# 转换为 tensor
boxes = torch.as_tensor(boxes, dtype=torch.float32)
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
image, target = self.transforms(image, target)
return image, target
def __len__(self):
return len(self.images)
def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# get the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(
in_features_mask,
hidden_layer,
num_classes
)
return model
def get_transform(train):
transforms = []
transforms.append(T.PILToTensor())
transforms.append(T.ConvertImageDtype(torch.float))
if train:
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
def main():
# train on GPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# our dataset has two classes only - background and person
num_classes = 2
# dataset, 数据集地址
data_path = "../../../../datasets/PennFudanPed"
dataset = PennFudanDataset(root=data_path, transforms=get_transform(train=True))
dataset_test = PennFudanDataset(root=data_path, transforms=get_transform(train=False))
# split dataset to train and test
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
dataloader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=True,
num_workers=4, collate_fn=utils.collate_fn)
dataloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=1,
shuffle=False, num_workers=4, collate_fn=utils.collate_fn)
# get the model
model = get_model_instance_segmentation(num_classes)
# move the model to device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(
optimizer=optimizer,
step_size=3,
gamma=0.1
)
# train 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
# train one epoch
train_one_epoch(model=model,
optimizer=optimizer,
data_loader=dataloader,
device=device,
epoch=epoch,
print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model=model,
data_loader=dataloader_test,
device=device)
print("It's OK")
#
if __name__ == '__main__':
main()
...
Test: [ 0/50] eta: 0:00:10 model_time: 0.0790 (0.0790) evaluator_time: 0.0034 (0.0034) time: 0.2154 data: 0.1323 max mem: 3872
Test: [49/50] eta: 0:00:00 model_time: 0.0798 (0.0809) evaluator_time: 0.0032 (0.0046) time: 0.0904 data: 0.0036 max mem: 3872
Test: Total time: 0:00:04 (0.0934 s / it)
Averaged stats: model_time: 0.0798 (0.0809) evaluator_time: 0.0032 (0.0046)
Accumulating evaluation results...
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.01s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.817
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.985
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.945
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.496
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.641
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.834
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.371
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.854
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.854
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.600
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.800
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.865
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.753
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.975
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.906
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.434
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.386
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.774
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.341
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.790
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.790
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.600
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.713
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.801
It's OK
如果 transforms.py
文件中 RandomHorizontalFlip
函数 get_dimensions
报错,请将
_, _, width = F.get_dimensions(image)
改为 width, _ = F.get_image_size(image)
【参考】
TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL