GitHub - lucidrains/vit-pytorch: Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Vision Transformer的实现,在视觉分类中只需要一个transformer就能实现SOTA。 

不涉及过多的代码,以此为基础进行实验,就可以加快注意力革命。(有点像集成了一个工具?)

基于预训练模型的实验,可参考此处!

1.安装vit-pytorch


pip install vit-pytorch


2.使用教程


import torch
from vit_pytorch import ViT

v = ViT(
    image_size = 256,     
    patch_size = 32,      
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)


3.参数说明

  • image_size: int, 图像为矩形时,应当保证其取值为长宽中的最大值。
  • patch_size: int,图像划分时的单位尺寸,patch数量为  n=(image_size//patch_size)**2,同时,patch的数量必须大于16。
  • num_classes: int,要分类的数量。(os:这个参数要注意一下
  • dim: int,线性变换后输出张量tensor的最后维度 nn.Linear(..., dim)。
  • depth: int,Transformer块的数量。(Q:Transformer块的概念
  • heads: int,多头注意力层的头数量。
  • mlp_dim: int,MLP(前向)层的维度
  • channels: int,默认3(RGB),图像的通道数。
  • dropout: float between [0,1], 默认0。衰退率。
  • emb_dropout: float between [0,1], 默认0。嵌入衰退率。
  • pool: string,cls token池化或平均池化。

4.Simple ViT

简单的ViT包含2维余弦位置嵌入,全局平均池化(没有cls token),没有衰退,批处理大小为1024而不是4096,使用了随机增强和混合增强。他们还表明,最后使用一个简单的线性处理所得到的效果与原来的MLP头相比效果无明显差异。

Paper


import torch
from vit_pytorch import SimpleViT

v = SimpleViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)


5.Distillation

使用蒸馏token从卷积网络提取知识到视觉变压器,可以产生小型和高效的视觉transformer。这个存储库提供了轻松进行蒸馏的方法。

例如. distilling from Resnet50 (or any teacher) to a vision transformer


import torch
from torchvision.models import resnet50

from vit_pytorch.distill import DistillableViT, DistillWrapper

teacher = resnet50(pretrained = True)

v = DistillableViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 8,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

distiller = DistillWrapper(
    student = v,
    teacher = teacher,
    temperature = 3,           # temperature of distillation
    alpha = 0.5,               # trade between main loss and distillation loss
    hard = False               # whether to use soft or hard distillation
)

img = torch.randn(2, 3, 256, 256)
labels = torch.randint(0, 1000, (2,))

loss = distiller(img, labels)
loss.backward()

# after lots of training above ...

pred = v(img) # (2, 1000)


除了处理前向传递的方式不同,DistillableViT类与ViT相同,因此在完成蒸馏训练后,能够将参数加载回ViT。

还可以在DistillableViT实例上使用方便的.to_vit方法来返回一个ViT实例。


v = v.to_vit()
type(v) # <class 'vit_pytorch.vit_pytorch.ViT'>


6.DeepViT

研究增加ViT的层数,即网络深度(过去的12层),并建议混合每个头部的注意力后softmax作为一个解决方案,称为重新注意。研究结果与NLP的Talking Heads论文一致。


import torch
from vit_pytorch.deepvit import DeepViT

v = DeepViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)


7.CaiT

指出了更深入训练视觉变压器困难,并提出了两种解决方案。首先,它提出对剩余块的输出逐通道相乘。其次,它建议让补丁相互关注,只允许CLS令牌关注最后几层的补丁。他们还添加了Talking Heads,提出改进。


import torch
from vit_pytorch.cait import CaiT

v = CaiT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 12,             # depth of transformer for patch to patch attention only
    cls_depth = 2,          # depth of cross attention of CLS tokens to patch
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1,
    layer_dropout = 0.05    # randomly dropout 5% of the layers
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)


8.Token-to-Token ViT

提出前两层通过展开对图像序列进行下采样,使每个令牌的图像数据重叠,如图所示。


import torch
from vit_pytorch.t2t import T2TViT

v = T2TViT(
    dim = 512,
    image_size = 224,
    depth = 5,
    heads = 8,
    mlp_dim = 512,
    num_classes = 1000,
    t2t_layers = ((7, 4), (3, 2), (3, 2)) # tuples of the kernel size and stride of each consecutive layers of the initial token to token module
)

img = torch.randn(1, 3, 224, 224)

preds = v(img) # (1, 1000)


9.CCT

CCT提出了使用卷积而不是补丁和执行序列池的紧凑变压器。这使得CCT具有高精度和低数量的参数。


import torch
from vit_pytorch.cct import CCT

cct = CCT(
    img_size = (224, 448),
    embedding_dim = 384,
    n_conv_layers = 2,
    kernel_size = 7,
    stride = 2,
    padding = 3,
    pooling_kernel_size = 3,
    pooling_stride = 2,
    pooling_padding = 1,
    num_layers = 14,
    num_heads = 6,
    mlp_radio = 3.,
    num_classes = 1000,
    positional_embedding = 'learnable', # ['sine', 'learnable', 'none']
)

img = torch.randn(1, 3, 224, 448)
pred = cct(img) # (1, 1000)


或者,也可以使用几个预定义的模型[2,4,6,7,8,14,16],这些模型预先定义了层数、注意头数量、mlp比例和嵌入维度。


import torch
from vit_pytorch.cct import cct_14

cct = cct_14(
    img_size = 224,
    n_conv_layers = 1,
    kernel_size = 7,
    stride = 2,
    padding = 3,
    pooling_kernel_size = 3,
    pooling_stride = 2,
    pooling_padding = 1,
    num_classes = 1000,
    positional_embedding = 'learnable', # ['sine', 'learnable', 'none']
)


10.Cross ViT

本文提出用两个视觉transformer对图像进行不同尺度的处理,每隔一段时间交叉处理一个图像。它们展示了在基本视觉转换器上的改进。


import torch
from vit_pytorch.cross_vit import CrossViT

v = CrossViT(
    image_size = 256,
    num_classes = 1000,
    depth = 4,               # number of multi-scale encoding blocks
    sm_dim = 192,            # high res dimension
    sm_patch_size = 16,      # high res patch size (should be smaller than lg_patch_size)
    sm_enc_depth = 2,        # high res depth
    sm_enc_heads = 8,        # high res heads
    sm_enc_mlp_dim = 2048,   # high res feedforward dimension
    lg_dim = 384,            # low res dimension
    lg_patch_size = 64,      # low res patch size
    lg_enc_depth = 3,        # low res depth
    lg_enc_heads = 8,        # low res heads
    lg_enc_mlp_dim = 2048,   # low res feedforward dimensions
    cross_attn_depth = 2,    # cross attention rounds
    cross_attn_heads = 8,    # cross attention heads
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

pred = v(img) # (1, 1000)


11.PiT

提出通过使用深度卷积的池化过程向下采样令牌。


import torch
from vit_pytorch.pit import PiT

v = PiT(
    image_size = 224,
    patch_size = 14,
    dim = 256,
    num_classes = 1000,
    depth = (3, 3, 3),     # list of depths, indicating the number of rounds of each stage before a downsample
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

# forward pass now returns predictions and the attention maps

img = torch.randn(1, 3, 224, 224)

preds = v(img) # (1, 1000)


未完待续。。。。。。