$π_{0.5}$

Paper

基于pi0的改进版

设计训练recipe, 以提供breadth knowledge, 使robots在不同级别的抽象层次上泛化

graph TD
    WebData["Multimodal Web Data <br> (图像、文本、问答、检测)"]
    RobotActionData["Robot Action Data <br> (来自多种机器人)"]
    WebData -- "Co-training" --> Pi05("π₀.₅ Vision-Language-Action Policy")
    RobotActionData -- "Co-training" --> Pi05
    UserPrompt["用户高级指令 <br> 'clean the kitchen'"]
    UserPrompt -- "输入" --> Pi05
    Pi05 -- "1.预测高级子任务" --> Subtask["语义子任务 <br> 'pick up the plate'"]
    Subtask -- "2.作为低级指令" --> Pi05
    Pi05 -- "3.生成低级动作" --> ActionExpert(Action Expert)
    ActionExpert --> RobotAction["机器人动作序列 <br> (连续、高频)"]
    RobotAction -- "控制" --> Robot(机器人执行)

training分成两步

将不同的数据(不同的robot的data, high-level semantic(subtask分解), 网络的数据等)混合, 训练VLA, 生成high-level的指导(subtask)
在low-level action和high-level semantic actions上进行fine-tune(专门针对移动操作)

inference步骤:

首先预测semantic subtask: 根据场景信息和任务结构推断下一步应该执行的行为
根据subtask预测robot的low-level action

Preliminaries

VLA的任务: $ $max E_{(a_{t : t + H}, o_{t}, l) \sim D} [lo g (π_{θ} (a_{t : t + H} ∣ o_{t}, l))]$ $ 其中:

$a_{t : t + H}$ : action chunk或者action
$o_{t}$ : 观测state
$l$ : Language instruction

The $π_{0.5}$ Model and Training Recipe

大体分为两步. 从一个web-data pretrained VLM开始:

pre-train: 调整VLM, 使其适应不同的任务
post-train: 将其专门应用于移动操作并配备高效的test-time推理机制

在pre-train阶段, 所有的robot actions使用离散的token表示, 使之更简单, 可扩展并让训练更有效率

在post-train阶段, 给模型添加Action Expert(类似pi0), 使用更细粒度的表达, 实现更高效的实时计算控制.

在inference时, model首先提供high-level的subtask, 然后基于这个instruction使用action expert生成low-level actions.

The $π_{0.5}$ architecture

policy: $π_{θ} (a_{t : t + H}, \hat{l} ∣ o_{t}, l) = π_{θ} (a_{t : t + H} ∣ o_{t}, \hat{l}) π_{θ} (\hat{l} ∣ o_{t}, l)$ , 其中

Knowledge Base

Explorer

pi0.5

$π_{0.5}$

Preliminaries

The $π_{0.5}$ Model and Training Recipe

The $π_{0.5}$ architecture

Combining discrete & continuous action representations

Pre-Training

Post-Training

Robot system details

Graph View

Table of Contents

Knowledge Base

Explorer

pi0.5

π0.5​

Preliminaries

The π0.5​ Model and Training Recipe

The π0.5​ architecture

Combining discrete & continuous action representations

Pre-Training

Post-Training

Robot system details

Graph View

Table of Contents

$π_{0.5}$

The $π_{0.5}$ Model and Training Recipe

The $π_{0.5}$ architecture