Github Repo

In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A).

Motivation:

  • 高效将视觉-语言表达和动作对齐
  • 减少 VLA 模型对大型 VLM 和大规模pretrain的训练

训练需要的参数: pipeline:

pipeline:

ActionQuery AQ

这个是一个Embedding, 可学习的参数:

self.action_queries = nn.Embedding(NUM_TOKENS, self.llm_dim)

在VLM中, 每一层Transformer layer输出的hidden state和Action Query进行一次cross attention, 得到更深一层的action query features:

The backbones select the Prismatic VLM trained on Qwen2.5-0.5B

VLM骨干用的是 Qwen2.5-0.5B, 是纯文字版本, 然后使用Prismatic VLM的方法进行训练.

使用SigLipDINOv2作为ViT, 获取图片的feature, 与Qwen进行共同训练

Key Finding 1. Regarding CR t , the middle-layer latent performs better than the deep-layer latent. Deep-layer CR t is biased towards semantic information and less effective in action generation. The middle-layer CR t effectively integrates image and text information, retains richer multimodal details, and facilitates action generation.

结果: 中间层的Hidden State(embeds)会包含更加丰富的features, 深层的Hidden State包含的语义信息更多但是features更少. 因此中间层更适合用于Action的生成

Key Finding 2. Regarding CAQ t , deep-layer latent performs better than other-layer latent. Since ActionQuery is trained from scratch, and deep-layer CAQ t aggregates richer multimodal details and is more effectively promoting action generation than the shallow layers.

结果: 深层的Action Query的hidden state更有效果. 因为Action Query的训练过程中, 会逐层与transformers的hidden state(embeds)进行交互, 越深层的action query学习到的表征越丰富

但是这个与上面的结论有一定的冲突: 越深层应该会有更多的语义信息但是更少的feature, 为什么?

Key Finding 3. Multi-layer features perform better. We observed that using all-layer features generally outperforms a single layer. Not only does it improve performance, but it also saves time on best layer selection during design. This design can be more universal.

结论: 多个layer的feature共同的效果会更好.

这个很符合逻辑, 多个layer的hidden state会有更丰富的信息, 从高feature到高语意, 融合更多信息.

同时, 使用multi-layer features可以省去选择layer的痛苦: 不需要逐层测试哪一层的效果最好. 超参数更少了.

.

这里说的输入是: language+image的hidden state(经过attention), action query的hidden state(经过attention), 初始的action, proprioceptive state.

但是实际上, proprioceptive在代码中完全没有使用过, initial action甚至在代码中不存在, 只有是上一层留下来的output embeddings(hidden state), 两者也没有做cross attention而是concat到一起去做self attention.

Bridge Attention.

Bridge Attention: 将VLM的hidden state和Action Query的hidden state作为condition生成Action: 实际上, 在代码中, 并没有直接按照Bridge Attention的做法, 做Cross Attention以及Action的Attention. 在代码中, 仅仅是将image feature+language feature+action query hidden state拼接到一起, 过一次self.language_model的self attention, 然后通过action_head(分成两种, continuous的L1Regressive以及discrete的通过VLM的logits计算)获取normalized action.

总体的数据流动为:

graph TD
d[ViT]
e[tokenizer+embeddings]
f[language model]
g((concat))
h[[if action continuous]]
i[L1 Regressive]
j[get logits]
subgraph Inputs:
    a(image)
	b(language instruction)
	c(action query: embedding vectors)
end
a-->d
b-->e
c-->g
d-->g
e-->g
g-->f
f-->h
h-->|yes|i
h-->|no|j
subgraph Outputs:
	k(normalized actions)
end
i-->k
j-->k

实验结果: B1: Qwen2.5-0.5B(Qwen2.5VL, 但是实际上是纯文字版本加上了SigLip), B2: [2509.09372v2.pdf#page=6&selection=200,31,200,41|LLaMA2-7B], B3: [2509.09372v2.pdf#page=6&selection=206,6,206,16|OpenVLA-7B]

Backbone VLM不训练的时候, 有:

推理速度:

在Libero数据集:

在CALVIN上(泛化能力):

Ablation 消融实验:

Action Query dim = 64:

全部layer注入(?):

防止溢出: raw hidden state * tanh(g), action query * 1: