paper
训练需要的参数: pipeline:
pipeline:
ActionQuery AQ
这个是一个Embedding, 可学习的参数:
self.action_queries = nn.Embedding(NUM_TOKENS, self.llm_dim)
在VLM中, 每一层Transformer layer输出的hidden state和Action Query进行一次cross attention, 得到更深一层的action query features:
The backbones select the Prismatic VLM trained on Qwen2.5-0.5B
VLM骨干用的是 Qwen2.5-0.5B, 是纯文字版本, 然后使用Prismatic VLM的方法进行训练.
Key Finding 1. Regarding CR t , the middle-layer latent performs better than the deep-layer latent. Deep-layer CR t is biased towards semantic information and less effective in action generation. The middle-layer CR t effectively integrates image and text information, retains richer multimodal details, and facilitates action generation.
结果: 中间层的Hidden State(embeds)会包含更加丰富的features, 深层的Hidden State包含的语义信息更多但是features更少. 因此中间层更适合用于Action的生成
Key Finding 2. Regarding CAQ t , deep-layer latent performs better than other-layer latent. Since ActionQuery is trained from scratch, and deep-layer CAQ t aggregates richer multimodal details and is more effectively promoting action generation than the shallow layers.
结果: 深层的Action Query的hidden state更有效果. 因为Action Query的训练过程中, 会逐层与transformers的hidden state(embeds)进行交互, 越深层的action query学习到的表征越丰富
但是这个与上面的结论有一定的冲突: 越深层应该会有更多的语义信息但是更少的feature, 为什么?
Key Finding 3. Multi-layer features perform better. We observed that using all-layer features generally outperforms a single layer. Not only does it improve performance, but it also saves time on best layer selection during design. This design can be more universal.
结论: 多个layer的feature共同的效果会更好.
这个很符合逻辑, 多个layer的hidden state会有更丰富的信息, 从高feature到高语意, 融合更多信息.
同时, 使用multi-layer features可以省去选择layer的痛苦: 不需要逐层测试哪一层的效果最好. 超参数更少了.
.
这里说的输入是: language+image的hidden state(经过attention), action query的hidden state(经过attention), 初始的action, proprioceptive state.
但是实际上, proprioceptive在代码中完全没有使用过, initial action甚至在代码中不存在, 只有和是上一层留下来的output embeddings(hidden state), 两者也没有做cross attention而是concat到一起去做self attention.
Bridge Attention.
Bridge Attention: 将VLM的hidden state和Action Query的hidden state作为condition生成Action: 实际上, 在代码中, 并没有直接按照Bridge Attention的做法, 做Cross Attention以及Action的Attention. 在代码中, 仅仅是将
image feature+language feature+action query hidden state拼接到一起, 过一次self.language_model的self attention, 然后通过action_head(分成两种, continuous的L1Regressive以及discrete的通过VLM的logits计算)获取normalized action.
总体的数据流动为:
graph TD d[ViT] e[tokenizer+embeddings] f[language model] g((concat)) h[[if action continuous]] i[L1 Regressive] j[get logits] subgraph Inputs: a(image) b(language instruction) c(action query: embedding vectors) end a-->d b-->e c-->g d-->g e-->g g-->f f-->h h-->|yes|i h-->|no|j subgraph Outputs: k(normalized actions) end i-->k j-->k
实验结果: B1: Qwen2.5-0.5B(Qwen2.5VL, 但是实际上是纯文字版本加上了SigLip), B2: [2509.09372v2.pdf#page=6&selection=200,31,200,41|LLaMA2-7B], B3: [2509.09372v2.pdf#page=6&selection=206,6,206,16|OpenVLA-7B]
Backbone VLM不训练的时候, 有:
推理速度:
在Libero数据集:
在CALVIN上(泛化能力):
Ablation 消融实验:
Action Query dim = 64:
全部layer注入(?):
防止溢出: raw hidden state * tanh(g), action query * 1: