2025-03-18
2025-03-18

Paper Reading

OpenVLA

—

Related work

::: block VLM:

bridge features from pretrained visual encoder(e.g. DINOv2, SigLIP) with pretrained LLM(e.g. Llama) Generalist Robot Policies:

Octo: policy learning, compose pretrained component, learn to “stitch” them together.

OpenVLA: end-to-end

more generalist

large Internet-scale dataset

generic architecture :::

—

VLM

::: block

visual encoder: map image inputs to image patch embeddings

projector: align image embeddings with word embeddings

LLM backbone :::

—

OpenVLA

::: block

concat SigLIP+DINOv2(helpful for improving spatial reasoning)

projector: 2-layer MLP

use Llama 2 as backbone

map continuous action into discrete action token.

discretize each dimension of robot action separately into one of 256 bins.

each bin uniformly divided into $1^{s t}$ to $9 9^{t h}$ quantile

Training Data: Open X-Embodiment dataset :::

RT-1

—

Preliminaries

::: block Robot learning:

Aim to learn robot policy: $π (\cdot ∣ i, x_{i})$

sample the action $a_{t}$ from learned distribution $π (\cdot ∣ i, {x_{j}}_{j = 0}^{t})$

target: maximize average reward(indicate complete or not) Transformer:

sequence model

map image&text to action sequence Imitation Learning:

minimize the gap between $\overset{a}{^}_{t}$ and $a_{t}^{expert}$

refine $π$ by negative log-likelihood :::

—

System Overview
graph TB
a[Textural Instruction]-->|Universal Sentence Encoder|b[word embedding vector]
c[images]-->|ImageNet|d[features]
b-->|FiLM|e(affine transform)
d-->e
e-->|Tokenizer|f[Token]
f-->|Transformer|g[output Tokens]
g-->|Tokenizer Decode|h[action]
RT-2

—

RT-2

::: block Model:

use CLIP to tokenize images and share embeddings with text

use PaLI-x and PaLM-E as backbone of VLM

decode output action token :::

—

::: block Co-Fine-tuning:

combine datasets: to enhance more generalizing policies

Output Constraint:

only sampling robot action when prompted with a robot-action task

otherwise, answer natural language

chain of thought:

an additional step: Plan Step. describes the purpose of the action that the robot is about to take in natural language first

then followed by the actual action tokens. :::

RDT-1B

—

Related work

::: block DiT:

combine diffusion and transformer VLA:

Vision-Language-Action Model :::

—

Problem formulation

::: block $o_{t} := (X_{t - T_{im g} + 1 : t + 1}, z_{t}, c)$

$X_{t - T_{im g} + 1 : t + 1} = (X_{t - T_{im g} + 1}, \dots, X_{t})$ : RGB image history

$z_{t}$ : low-dimensional proprioception of robot

$c$ : control frequency

$a_{t}$ : action, usually a subset of $z_{t + 1}$ :::

—

Diffusion Model

::: block

$a_{t}^{k} \sim N (0, I)$

$a_{t}^{k - 1} = \frac{α ˉ ^{k - 1} β ^{k}}{1 - α ˉ ^{k}} a_{t}^{0} + \frac{α ^{k} ( 1 - α ˉ ^{k - 1} )}{1 - α ˉ ^{k}} a_{t}^{k} + σ^{k} z$

$a_{t}^{0} = f_{θ} (l, o^{t}, a_{t}^{k}, k)$

$L (θ) = MSE (a_{t}, f_{θ} (l, o_{t}, \overset{α}{ˉ}^{k} a_{t}^{k} + 1 - \overset{α}{ˉ}^{k} ϵ, k))$

$a_{t}^{k} = \overset{α}{ˉ}^{k} a_{t}^{k} + 1 - \overset{α}{ˉ}^{k} ϵ$

use action chunk to encourage time consistency and alleviate error accumulation over time :::

—

Encoding

::: block

low-dimensional vectors represent physical quantities(proprioception, action chunk, control frequency)

use MLP with Fourier Features, capture the high-frequency changes

image input: high-dimension

use image-text-aligned pretrained vision encoder: SigLIP

language input:

pretrained T5-XXL :::

—

Network Structure

::: block

QKNorm

RMSNorm instead of LayerNorm

MLP Decoder instead of linear decoder

Alternative Condition Injection :::

—

Data

::: block ::: block Physically Interpretable Unified Action Space:

$z_{t}$ and $a_{t}$

unified space ::: :::

pi0

—

Related Work

::: block Flow Matching:

denoise by conditional probability path $p_{t} (x ∣ x_{t})$

loss: $L_{CFM} (θ) = E_{t, q (x_{1}), p_{t} (x ∣ x_{1})} ∥ v_{t} (x) - u_{t} (x ∣ x_{1}) ∥_{2}^{2}$

Transfusion:

train single transformer by multiple objectives

loss: $L = L_{L M} + λ L_{DD PM}$ :::

—

$π_{0}$ Model

::: block

model data distribution: $p (A_{t} ∣ o_{t})$

$A_{t} = [a_{t}, a_{t + 1}, \dots, a_{t + H - 1}]$

$o_{t} = [I_{t}^{1}, \dots, I_{t}^{n}, l_{t}, q_{t}]$

handle action by action expert, with CFM loss: $L^{τ} (θ) = E_{p (A_{t} ∣ o_{t}), q (A_{t}^{τ} ∣ A_{t})} ∥ v_{θ} (A_{t}^{τ}, o_{t}) - u (A_{t}^{τ} ∣ A_{t}) ∥_{2}$

:::

—

Train Recipe

::: block

first pretrain on big dataset

then fine-tune with specific task :::

Link to original

2025-04-09
2025-04-17

ReplanVLM

—
graph TB
a(User Input)
b(Observe Image)
c(Decision Bot)
d(Task plan)
e(Code Generation)
f(Inner Bot)
g(Environment)
h(Extra Bot)

a-->c
b-->c
c-->d
d-->e
e-->f
f-->|No|c
f-->|Yes|g
g-->h
h-->|No|c
h-->|Yes|i(end)
—

Decision Bot:

generate task plan based on user input and observed images

generate code based on task plan

Inner Bot:

check code correctness

check with environment and codebase information

Extra Bot:

compare images before and after taking action

return feedback if not succeed

Chain of Verification

—

::: block

generate baseline response by LLM

based on user input and baseline response, generate verification question for baseline response

independently answer the verification questions (w/o baseline response).

then check the answer against baseline response

generate final response based on baseline response and feedback from step 3

:::

Adaptive Interactive Navigation

—

Task planing:

generate skill tree

evaluate nodes and find high-level skeleton

Advisor:

interpret environment: Failure, New Object, Revaluation

Arborist:

adding node for new information

pruning the failed nodes

Link to original

2025-04-22

2025-04-22

RL & LLM

L2R

Language to Reward for Robotic Skill Synthesis

—

Background

::: block

MDP problem: $⟨ S, A, R, P, p_{0} ⟩$

reward assumption: $R (s, a) = - \sum_{i = 0}^{M} w_{i} \cdot n_{i} (r_{i} (s, a, ψ_{i}))$ :::

—

Method

::: block Motion Description

use LLM to interpret and expand the user input into a natural language description of robot motion

using prompt template

Reward Coding

use LLM generate the reward function :::

LaRe

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

—

Motivation

::: block

make reward include various implicit factors

(R|s_{1:T},a_{1:T})=\int\left[\underbrace{p(r_t|z_{r,t})}{\text{decoder }f}\underbrace{p(z{r,t}|s_t,a_t)}{\text{encoder }\phi}\right]p(R|r{1:T})dzdr$$

obtain interpretable and multifaceted task performance metrics from redundant environment information :::

—

Framework

::: block

generate responses by LLM

summarize responses. generate code(latent reward encoder) based on summary

verify the correctness of the encoder function

train reward decoder with loss: $ $L_{RD}^{ϕ} (ψ) = E_{r \sim D} [(R (τ) - \sum_{t = 1}^{T} f_{ψ} (ϕ (s_{t}, a_{t}))]$ $

optimize policy with latent reward and its decoder :::

MAYE

Rethinking RL Scaling for vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

—

Data

—

Framework

—

Algorithm

::: block
$L^{CLIP} (θ) = E_{[q \sim P (q), o_{q} \sim π_{θ_{old}} (o ∣ q)]}$ $\frac{1}{∣ o _{q} ∣} t = 1 \sum ∣ o_{q} ∣ {min [prob_{t} \hat{A}_{t}, clip (prob_{t}, 1 - ϵ, 1 + ϵ)] - β_{loss} D_{KL} [π_{θ} ∥ π_{ref}]}$

use reward function as rule-based signal to guide RL training

correctness

language :::

—

Metrics

::: block Accuracy curve: correctness and effectiveness while training

Response length: length of output

Words count: effectiveness of RL training, reflected by the frequency of certain words

Ratio curves: reflective words frequency while training :::

ToRL

ToRL: Scaling Tool-integrated RL

—

TIR

::: block tool integrated reasoning

$(r_{k}, c_{k}) = LLM (q \oplus s_{k - 1})$

$o_{k} = I (c_{k})$

$s_{k} = s_{k - 1} \oplus r_{k} \oplus c_{k} \oplus o_{k}$ :::

—

ToRL

::: block

use Qwen2.5-Math

utilize outside code interpreter to execute generated code

concat answer with natural language response

Design

Tool Call Frequency Control: reduce GPU idle time

Execution Environment Selection: Sandbox Fusion

Error Message Processing: only last line of error message

Sandbox Output Masking: don’t compute loss on code output :::

Some others

—

Think before you act

::: block Combine action with “caption”

use “caption” to indicate what to do next

use action to indicate how to do

Treat RL process as auto-regressive Transformer process :::

—

In-Context Reinforcement Learning with Algorithm Distillation

::: block Treat offline RL as sequential prediction problem, distill RL policy into Causal Sequence Model, model RL policy by Neural Network :::
Link to original

2025-04-27

环境:

使用h5py进行储存, 需要下载. 下载后位于datasets/v0.1/single_stage/kitchen_pnp/pnpcountertocab/2024-04-24/demo.hdf5

f = h5py.File(os.path.join(os.getcwd(), "datasets/v0.1/single_stage/kitchen_pnp/pnpcountertocab/2024-04-24/demo.hdf5"))
env_meta = json.loads(f["data"].attrs["env_args"])
states = dict(states=f["data/demo_1/states"][()][0])
states["model"] = f["data/demo_1"].attrs["model_file"]
ep_meta = f["data/demo_1"].attrs.get("ep_meta", None)
states["ep_meta"] = ep_meta
f.close()
env_kwargs = env_meta["env_kwargs"]
env_kwargs["env_name"] = env_meta["env_name"]
env_kwargs["has_renderer"] = False
env_kwargs["renderer"] = "mjviewer"
env_kwargs["has_offscreen_renderer"] = True
env_kwargs["use_camera_obs"] = False
 
# initial env
env = robosuite.make(**env_kwargs)
reset_to(env, state)

实际上, 调用的是RoboCasa的kitchen_pnp.py里面的PnPCounterToCab环境, 通过robosuite.environments.base的MujocoEnv创建

调用:

通过env.step(action)的方式执行单步动作.

对于Panda Mobile, action space为12-dim, 有: joint旋转, 底座水平移动, 躯干上下移动, gripper开关

steps = [[0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...]
 
for step in steps:
	env.step(step)
 
	video_image = []
	for cam in ["robot0_agentview_center"]:
		im = env.sim.render(camera_name=cam, width=512, height=768)[::-1]
		video_image.append(im)
	video_image = np.concatenate(video_image, axis=1)
	video_writer.append_data(video_image)

Link to original

2025-04-29
2025-04-29
Environment Generation

使用LLM生成RL environment

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

2403.12014v2_EnvGen.pdf

publish: COLM 2024

动机

使用LLM生成和调整多种不同的训练环境, 以增强对原始环境(不是LLM生成的环境)的表现能力

少量LLM的调用, 减少计算开销

环境生成

输入:

环境的简要描述以及LLM需要执行的操作

目标, LLM可以操控的环境参数, 生成中的规则约束

一个JSON模板, 让LLM将环境参数填到相应位置

Agent的feedback

输出: 包含环境参数的JSON, 用于生成完整的Environment

transition和reward由原始的环境提供, LLM不参与生成

训练

进行正常的训练.

评估在原始环境中的表现情况, 计算每个目标的成功率, 并将此反馈给LLM, 用以调整下一次的生成

防止overfit到LLM生成的Environment中, 隔一定间隔对原始环境进行一次训练

RoboVerse

安装: 配置项比较多, 数据量大, 下载困难

部分代码有问题, 无法正常启动, 依赖冲突

Robocasa

robot的environment位于RoboSuite, 封装了自己的Environment
代码
def reset_to(env, state):
    env.set_ep_meta(json.loads(state["ep_meta"]))
    env.reset()
    xml = env.edit_model_xml(state["model"])
    env.reset_from_xml_string(xml)
    env.sim.reset()
 
    env.sim.set_state_from_flattened(state["states"])
    env.sim.forward()
 
    env.update_state()
 
 
def main():
    f = h5py.File(os.path.join(os.getcwd(), "datasets/v0.1/single_stage/kitchen_pnp/PnPCounterToCab/2024-04-24/demo.hdf5"))
    env_meta = json.loads(f["data"].attrs["env_args"])
    states = dict(states=f["data/demo_1/states"][()][0])
    states["model"] = f["data/demo_1"].attrs["model_file"]
    ep_meta = f["data/demo_1"].attrs.get("ep_meta", None)
    states["ep_meta"] = ep_meta
    f.close()
    env_kwargs = env_meta["env_kwargs"]
    env_kwargs["env_name"] = env_meta["env_name"]
    env_kwargs["has_renderer"] = False
    env_kwargs["renderer"] = "mjviewer"
    env_kwargs["has_offscreen_renderer"] = True
    env_kwargs["use_camera_obs"] = False
 
    print(json.dumps(env_kwargs, indent=4))
 
    path = f"./video_result/video_{len(os.listdir('./video_result'))}.mp4"
    video_writer = imageio.get_writer(path, fps=50)
 
    # initial env
    env = robosuite.make(**env_kwargs)
    reset_to(env, states)
 
    steps = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...]
 
    for step in steps:
        _ret = env.step(step)
 
        video_image = []
        for cam in ["robot0_agentview_center"]:
            im = env.sim.render(camera_name=cam, width=512, height=768)[::-1]
            video_image.append(im)
        video_image = np.concatenate(video_image, axis=1)
        video_writer.append_data(video_image)
 
    video_writer.close()
    env.close()
实际调用: robosuite的make方法创建一个环境, 配置参数由dataset定义

env_name: 创建的环境名字

robots: 包含的robot, robosuite支持多机器人, 但是robocasa的kitchen只支持一个机器人

controller_configs: 包含mujoco的controller的配置(kp, limits等等)

当前进度:

已完成panda mobile robot的操控

即将完成G1 robot的集成

未来计划:

添加G1 robot的controller config, 完成G1 robot的集成

测试G1 robot

尝试进行RL的训练

Link to original

2025-05-13

grape

Hamster

AnyGrasp

LESR

2025-05-21

LLoVi

VideoTree

2025-05-29

LEAP

f-policy

What Makes Pre-trained Visual Representation Successful for Robust Manipulation

2025-06-11

OpenVLA-OFT

ConRFT

Explorer

Pre

2025-03-18

Paper Reading

Related work

VLM

OpenVLA

Preliminaries

System Overview

RT-2

Related work

Problem formulation

Diffusion Model

Encoding

Network Structure

Data

Related Work

π0​ Model

Train Recipe

2025-04-17

2025-04-22

RL & LLM

L2R

Background

Method

LaRe

Motivation

Framework

MAYE

Data

Framework

Algorithm

Metrics

ToRL

TIR

ToRL

Some others

Think before you act

In-Context Reinforcement Learning with Algorithm Distillation

2025-04-27

2025-04-29

Environment Generation

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

RoboVerse

Robocasa

Graph View

$π_{0}$ Model