2010.11929

paper

Motivation: 将Transformer架构应用给 ComputerVision 领域, 证明了可以完全抛弃CNN架构

2010.11929v2, p.1

Transformers lack some of the inductive biases inherent to CNNs

Transformers在小规模未经归一化的数据集上表现不如传统CNN模型(如, ResNet), 是因为CNN模型有平移等变性并能捕捉local的特征.

但是如果在更大数据规模上进行训练CNN, 那么会有更好的表现

pipeline:

2010.11929v2, p.3

To handle 2D images, we reshape the image $x \in R^{H \times W \times C}$ into a sequence of flattened 2D patches $x_{p} \in R^{N \times (P^{2} \cdot C)}$ , where $(H, W)$ is the resolution of the original image, $C$ is the number of channels, $(P, P)$ is the resolution of each image patch, and $N = H W / P^{2}$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.

直接使用split对图片切片, 分成多个patch. 然后使用Linear Projection, 将patches平铺并映射到与embeddings相同的维度:
$z_{0} = [x_{c l a ss}; x_{p}^{1} E; x_{p}^{2} E; \dots; x_{p}^{N} E] + E_{p os}$
其中, $E \in R^{(P^{2} \cdot C) \times D}, E_{p os} \in R^{(N + 1) \times D}$

2010.11929v2, p.3

a learnable embedding

使用nn.Embedding(或nn.Parameters?)引入一个新的可学习的vector(一个标识), 加入到image patches的embeddings前面.

根据self attention, 每一个token都会和其他的token进行交互, 因此这个token的主要作用就是汇集其他的tokens的信息. 最终对图片进行分类的时候, 只提取这一个token的embeding vector, 送入分类头中进行MLP forward, 得到最终分类的概率

2010.11929v2, p.3

Position embeddings

使用一个nn.Embeddings作为Positional Embedding. 这个positional embedding是通过学习学到的, 表示每一个patch在原来图片中的位置. 但是这个embeds只能适用于相同大小的图片, 相同的patch size, 不支持修改分辨率.

证明这个embedding就是positional embedding:

2010.11929v2, p.3

Transformer encoder

最重要的Transformer模块将image提取信息, 由多头自注意力(MSA)和MLP交替层组成.同时, 在每个block之前使用[[PyTorch Normalization#Core Internal Normalizations#Layer Normalization|layernorm]]归一化处理, 每一个block之后使用残差连接.
$z_{l}^{'} = M S A (L N (z_{l - 1})) + z_{l - 1}$ $z_{l} = M L P (L N (z_{l}^{'})) + z_{l}^{'}$

2010.11929v2, p.4

Inductive bias.

ViT比CNN有更少的归纳偏差, 因为只有MLP层和positional embedding具有平移等变性. 因此, ViT在大数据中表现更好, 有更强的通用性, CNN在小规模的数据上有更好的表现.

平移等变性

在CNN中, 模型有非常强的平移等变性: 模型只关注local的信息(kernal), 不关心全局的位置信息. 即, 一个物体在图片中的任何位置都只会得到有这个物体, 而不会关注其空间位置信息. 需要添加其他的模型获取空间位置信息.

2010.11929v2, p.4

Hybrid Architecture.

同时, ViT可以使用CNN提取到的Features作为输入, 并不一定需要是原始的image

2010.11929v2, p.4

FINE-TUNING AND HIGHER RESOLUTION

对于Fine-Tuning, 首先在一个大规模的数据上进行pretrain, 然后移除pretrain的prediction head, 然后换成一个新的预测头(可以是nn.Linear(hidden_dim, num_classes))

对于高分辨率的图片, 可以拼接, 但是会导致positional embeddings没有意义. 可以使用插值的方式(离谱)获得更多的position embeds

后续最重要的一个改进就是positional embedding的改进:

相对位置编码 RPE
旋转位置编码 RoPE

Knowledge Base

Explorer

2010.11929

Graph View

Backlinks