Paper Reading

OpenVLA

—

::: block VLM:

bridge features from pretrained visual encoder(e.g. DINOv2, SigLIP) with pretrained LLM(e.g. Llama) Generalist Robot Policies:
Octo: policy learning, compose pretrained component, learn to “stitch” them together.
OpenVLA: end-to-end
- more generalist
- large Internet-scale dataset
- generic architecture :::

—

VLM

::: block

visual encoder: map image inputs to image patch embeddings
projector: align image embeddings with word embeddings
LLM backbone :::

—

OpenVLA

::: block

concat SigLIP+DINOv2(helpful for improving spatial reasoning)
projector: 2-layer MLP
use Llama 2 as backbone
map continuous action into discrete action token.
- discretize each dimension of robot action separately into one of 256 bins.
- each bin uniformly divided into $1^{s t}$ to $9 9^{t h}$ quantile
Training Data: Open X-Embodiment dataset :::

RT-1

—

Preliminaries

::: block Robot learning:

Aim to learn robot policy: $π (\cdot ∣ i, x_{i})$
sample the action $a_{t}$ from learned distribution $π (\cdot ∣ i, {x_{j}}_{j = 0}^{t})$
target: maximize average reward(indicate complete or not) Transformer:
sequence model
map image&text to action sequence Imitation Learning:
minimize the gap between $\overset{a}{^}_{t}$ and $a_{t}^{expert}$
refine $π$ by negative log-likelihood :::

—

System Overview

graph TB
a[Textural Instruction]-->|Universal Sentence Encoder|b[word embedding vector]
c[images]-->|ImageNet|d[features]
b-->|FiLM|e(affine transform)
d-->e
e-->|Tokenizer|f[Token]
f-->|Transformer|g[output Tokens]
g-->|Tokenizer Decode|h[action]

RT-2

—

RT-2

::: block Model:

use CLIP to tokenize images and share embeddings with text
use PaLI-x and PaLM-E as backbone of VLM
decode output action token :::

—

::: block Co-Fine-tuning:

combine datasets: to enhance more generalizing policies

Output Constraint:

only sampling robot action when prompted with a robot-action task
otherwise, answer natural language

chain of thought:

an additional step: Plan Step. describes the purpose of the action that the robot is about to take in natural language first
then followed by the actual action tokens. :::

RDT-1B

—

::: block DiT:

combine diffusion and transformer VLA:
Vision-Language-Action Model :::

—

Problem formulation

::: block $o_{t} := (X_{t - T_{im g} + 1 : t + 1}, z_{t}, c)$

$X_{t - T_{im g} + 1 : t + 1} = (X_{t - T_{im g} + 1}, \dots, X_{t})$ : RGB image history
$z_{t}$ : low-dimensional proprioception of robot
$c$ : control frequency
$a_{t}$ : action, usually a subset of $z_{t + 1}$ :::

—

Diffusion Model

::: block

$a_{t}^{k} \sim N (0, I)$
$a_{t}^{k - 1} = \frac{α ˉ ^{k - 1} β ^{k}}{1 - α ˉ ^{k}} a_{t}^{0} + \frac{α ^{k} ( 1 - α ˉ ^{k - 1} )}{1 - α ˉ ^{k}} a_{t}^{k} + σ^{k} z$
- $a_{t}^{0} = f_{θ} (l, o^{t}, a_{t}^{k}, k)$
- $L (θ) = MSE (a_{t}, f_{θ} (l, o_{t}, \overset{α}{ˉ}^{k} a_{t}^{k} + 1 - \overset{α}{ˉ}^{k} ϵ, k))$
- $a_{t}^{k} = \overset{α}{ˉ}^{k} a_{t}^{k} + 1 - \overset{α}{ˉ}^{k} ϵ$
use action chunk to encourage time consistency and alleviate error accumulation over time :::

—

Encoding

::: block

low-dimensional vectors represent physical quantities(proprioception, action chunk, control frequency)
- use MLP with Fourier Features, capture the high-frequency changes
image input: high-dimension
- use image-text-aligned pretrained vision encoder: SigLIP
language input:
- pretrained T5-XXL :::

—

Network Structure

::: block

QKNorm
RMSNorm instead of LayerNorm
MLP Decoder instead of linear decoder
Alternative Condition Injection :::

—

Data

::: block ::: block Physically Interpretable Unified Action Space:

$z_{t}$ and $a_{t}$
unified space ::: :::

pi0

—

::: block Flow Matching:

denoise by conditional probability path $p_{t} (x ∣ x_{t})$
loss: $L_{CFM} (θ) = E_{t, q (x_{1}), p_{t} (x ∣ x_{1})} ∥ v_{t} (x) - u_{t} (x ∣ x_{1}) ∥_{2}^{2}$

Transfusion:

train single transformer by multiple objectives
loss: $L = L_{L M} + λ L_{DD PM}$ :::

—

$π_{0}$ Model

::: block

model data distribution: $p (A_{t} ∣ o_{t})$
- $A_{t} = [a_{t}, a_{t + 1}, \dots, a_{t + H - 1}]$
- $o_{t} = [I_{t}^{1}, \dots, I_{t}^{n}, l_{t}, q_{t}]$
handle action by action expert, with CFM loss: $L^{τ} (θ) = E_{p (A_{t} ∣ o_{t}), q (A_{t}^{τ} ∣ A_{t})} ∥ v_{θ} (A_{t}^{τ}, o_{t}) - u (A_{t}^{τ} ∣ A_{t}) ∥_{2}$

:::

—

Train Recipe

::: block

first pretrain on big dataset
then fine-tune with specific task :::

Knowledge Base

Explorer

2025-03-18

Paper Reading

OpenVLA

VLM

OpenVLA

RT-1

Preliminaries

System Overview

RT-2

RT-2

RDT-1B

Problem formulation

Diffusion Model

Encoding

Network Structure

Data

pi0

$π_{0}$ Model

Train Recipe

Graph View

Table of Contents

Backlinks

Knowledge Base

Explorer

2025-03-18

Paper Reading

OpenVLA

Related work

VLM

OpenVLA

RT-1

Preliminaries

System Overview

RT-2

RT-2

RDT-1B

Related work

Problem formulation

Diffusion Model

Encoding

Network Structure

Data

pi0

Related Work

π0​ Model

Train Recipe

Graph View

Table of Contents

Backlinks

$π_{0}$ Model