Survey on VLA

Paper

The Evolution of Language and Vision Foundation Models

Language Foundation Models

从Transformer开始, 有很多引入的内容.

如BERT, Next Token Prediction, GRPO, Mamba, MoE等架构, 和张量并行, 模型量化等方法.

Vision Foundation Models

CLIP将语言与图片信息对齐, SigLIP将sigmoid函数代替softmax以提高效率, DINO自监督学习

Depth Anything用于单目相机下的深度估计, SAM2将图片分割能力拓展到视频, CoTracker引入Transformer架构

GLIP将CLIP拓展到区域级别, Grounding DINO 1.5使用DETR风格架构有SOTA效果, Grounding SAM结合Grounding DINO和SAM2, 零样本分割

视频生成: DALL-E, StableDiffusion等模型, ControlNet学习空间架构

Vision-Language Models

BLIP: ViT+BERT的encoder-decoder架构, BLIP-2使用Q-Former

Flamingo使用Perceiver Resampler和gate cross-attention layer的方式进行跨模态对齐, LLaVA使用简单Linear层将CLIP和LLM对齐, LLava2使用MLP代替Linear层

Qwen2-VL使用position-aware cross-attention adaptor将ViT和Qwen LLM对齐, Qwen2.5-VL拓展到时间领域, 使用M-RoPE旋转位置编码对齐时间

PaliGemma是Gemma 2B和SigLIP+So400m组合的VLM, 后续用于pi0, pi0.5系列模型

Embodied VLA Models as the Next Frontier

具身智能有一个问题, 就是会涉及大量的OOD(Out-of-Distribution)的数据. Deep Learning是从数据分布中学习, 而在真实世界中会有大量的OOD的场景, 因此只使用DL方法训练会让模型泛化性降低.

Overview of Action Tokens

Action Tokens	details	Advantages	Limitations	Notable Empirical Achievements
Language Description	Language Plan	1. 被LLM和VLM很好支持; 2. 有充足的co-training数据; 3. long-horizon规划的必要条件	1. 表达能力不完善(模糊, 难以描述灵巧的操作); 2. 高推理延迟	Make bed(pi0.5); Make a sandwich(Hi Robot)
	Language Motion	多任务数据共享	1. 表达能力不完善(模糊, 难以描述灵巧的操作); 2. 高推理延迟	pull napkin from dispenser(RT-H)
Code	API	1. LLM很好支持; 2. 清晰的控制与规划逻辑; 3. 丰富的第三方库的支持	1. 高度依赖于预定义的API; 2. 脆弱的runtime execution	Rearrange restore(Instruct2Act)
Affordance	Keypoint	1. 精确的交互目标	1. 需要更好的捕获3D空间信息; 2. 缺少时序建模; 3. visual noise敏感	Pour tea(ReKep)
	Bounding Box	1. VLM支持好; 2. 高效instance-level定位	同上	Dexterous grasping in cluttered scenes(DexGraspVLA)
	Segmentation Mask	1. 细粒度捕获	同上	Decision-making in open world(ROCKET-1)
	Affordance Map	1. 密集 2. interaction-centric 3. 全场景	同上	Deformable object manipulation(ManiFoundation)
Trajectory		1. 可以从off-domain的视频数据中学习; 2. 跨任务的泛化性好;	1. 有限的3D表达能力; 2. VLM支持有限; 3. 语义基础不足	Clean the table with a duster(RT-Trajectory)
Goal State		1. 基础模型支持好; 2.		Transfer liquid using a pipette(VPP)
Latent Representation		1. 有很好的数据拓展性, 从human video和cross-embodiment数据中学习; 2. 更好的表达潜力(紧凑的结构, 隐式语义, 多模态继承)	1. 不可解释; 2. 在未来工作需要提高	Fold shorts(GO-1); Mine diamond in Minecrafe(OmniJARVIS)
Raw Actioin		1. 最少的人类知识; 2. 最少的动作token标注; 3. 和VLM相似的训练策略, 可以扩展到VLA; 4. 高效fine-tuning	1. 数据稀缺; 2. 高延迟; 3. cross-embodiment能力差	Laundry fold(pi0); Light a match and light a candle(Real Time Chunking)
Reasoning		1. 增强对目标action的生成能力; 2. 复杂问题的解决能力	1. 高延迟; 2. 需要解决灵活推理的范式	Autonomous driving(DriveVLM)

Knowledge Base

Explorer

survey-VLA

Survey on VLA

The Evolution of Language and Vision Foundation Models

Language Foundation Models

Vision Foundation Models

Vision-Language Models

Embodied VLA Models as the Next Frontier

Overview of Action Tokens

Language Description as Action Tokens

Advantages of Language Descriptions

Discussion and Future Direction

Code as Action Tokens

Brittleness and Challenges

Future Direction

Affordance as Action Token

Keypoints: Precise Interaction Anchors

Graph View

Table of Contents