Motivation:

  1. Embodiment数据稀缺: 使用pretrained VLM以获取先验知识. 使用corss-embodiment数据集
  2. 模型架构泛化性: 使用cross-embodiment training, 将来自多个不同机器人的数据合并到一个模型中
  3. 给出一个更加高效的训练策略: 在10000h+的机器人数据集中pretrain, 然后针对特殊的任务进行finetune

pipeline:

使用SigLip作为ViT+PaliGemma Transformer作为VLM backbone, 配合Flow Matching的denoise过程生成连续的action.

Training的代码的架构:

graph TD

dataset[(Dataset)]

cat1(("concat"))
cat2(("concat"))
cat3(("concat"))

image(["**images**<br>wrist(1 or 2) and third-party camera"])
instruct(["**language instruct**<br>tokenized by PaliGemma Tokenizer"])
state(["**robot state**<br>e.g. joint angle"])
action(["**original action**<br>continuous action from expert"])

noise(["**noise**<br>sampled from normal distribution"])

dataset-->|sample batch|image
dataset-->|sample batch|instruct
dataset-->|sample batch|state
dataset-->|sample batch|action
state-->state_proj["state projection<br>nn.Linear"]-->cat2
action-->add1("(1-t)\*action+t\*noise")
noise-->add1
add1-->x_t

vlm["**VLM Backbone**<br>PaliGemma 2B"]
vlm_weight["Q/K/V projection from VLM Backbone"]
vit["**ViT**<br>SigLip, ViT for VLM Backbone"]
vlm_embed["Embedding for VLM"]
ae["**Action Expert**<br>Gemma 3000M"]
ae_weight["Q/K/V projection from Action Expert"]

subgraph VLM
	vlm-->vit
	vlm-->vlm_embed
	cat1
	vlm-->vlm_weight
end

subgraph AE
	ae-->ae_weight
end


image-->vit
instruct-->vlm_embed
vit-->cat1(("concat"))
vlm_embed-->cat1
cat1-->vlm_weight

x_t-->act_proj["action_in_proj & action time mlp in/out<br>nn.Linear"]-->cat2-->ae_weight

vlm_weight-->cat3
ae_weight-->cat3
cat3-->sa["self attention"]
sa-->|suffix output embed|v_t
x_t-->minus(("\-"))
action-->minus-->u_t
u_t-->loss["MSE Loss"]
v_t-->loss
loss-->b[[Backward]]

Inference时的架构:

graph TD


dataset[(Observation)]

cat1(("concat"))
cat2(("concat"))
cat3(("concat"))

image(["**images**<br>wrist(1 or 2) and third-party camera"])
instruct(["**language instruct**<br>tokenized by PaliGemma Tokenizer"])
state(["**robot state**<br>e.g. joint angle"])


noise(["**noisy action**<br>sampled from normal distribution"])

dataset-->|sample batch|image
dataset-->|sample batch|instruct
dataset-->|sample batch|state
state-->state_proj["state projection<br>nn.Linear"]-->cat2

vlm["**VLM Backbone**<br>PaliGemma 2B"]
vlm_weight["Q/K/V projection from VLM Backbone"]
vit["**ViT**<br>SigLip, ViT for VLM Backbone"]
vlm_embed["Embedding for VLM"]
ae["**Action Expert**<br>Gemma 3000M"]
ae_weight["Q/K/V projection from Action Expert"]

subgraph VLM
	vlm-->vit
	vlm-->vlm_embed
	cat1
	vlm-->vlm_weight
end

subgraph AE
	ae-->ae_weight
end


image-->vit
instruct-->vlm_embed
vit-->cat1(("concat"))
vlm_embed-->cat1
cat1-->vlm_weight

noise-->act_proj["action_in_proj & action time mlp in/out<br>nn.Linear"]-->cat2-->ae_weight

vlm_weight-->cat3
ae_weight-->cat3
cat3-->sa["self attention"]
sa-->|suffix output embed|v_t
v_t-->minus(("\-"))
noise-->minus-->da(denoised actions)-->l["use while loop to denoise(send to action_in_proj)"]

metrics:

本文中没有给出仿真环境下benchmark的metrics, 仅给出了真实世界下的测试结果:

这个experiment证明了, 的flow matching架构非常的优秀, 并且也超过了其他的VLA架构