LLM generate reward for RL

LaRe

LaRe
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Paper

通过从LLM中整合和任务相关的先验来获取语义上interpretable latent reward, 从而增强reward decomposition, 以获取更好的RL

Preliminary

MPD可以定义为 $M = ⟨ S, A, γ, p, r ⟩$ , 其中 $S$ 是state space, $A$ 是action space, $γ$ 是discount factor(用于reward随时间步衰减), $P (s^{'} ∣ s, a)$ 是environment state transition distribution. 目标是找到policy $π : S \mapsto A$ 满足最大化reward $J (π) = E [\sum_{t = 1}^{T} γ^{t} r (s_{t}, π (s_{t})) ∣ s_{0} \sim η, s_{t + 1} \sim P (\dots ∣ s_{t}, π (s_{t})]$

对于episodic RL, expected episodic reward是 $J_{e p} (π) = E [R (τ) ∣ s_{0} \sim η, a_{t} \sim π (\cdot ∣ s_{t}), τ = ⟨ s_{0}, a_{0}, \dots, s_{T} ⟩]$

通常的一个假设是decomposition of the episodic reward: $R (τ) = \sum_{t = 1}^{T} r (s_{t}, a_{t})$

Latent Reward

Motivation

让reward包含其他implicit factors的表现. 从概念上讲, latent reward的不同dimension表示task performance的不同方向

最终的reward是将latent reward从space $D$ 到 $R$ 的投影. 构建新的episodic RL概率模型:
$p (R ∣ s_{1 : T}, a_{1 : T}) = \int decoder f p (r_{t} ∣ z_{r, t}) encoder ϕ p (z_{r, t} ∣ s_{t}, a_{t}) p (R ∣ r_{1 : T}) d z d r$
其中 $ϕ : S \times A \mapsto D$ 是从environment中获取latent reward的函数.

使用LLM能够从冗余的environment information中获取interpretable和multifaceted的task performance metrics, 即latent reward

Framework
$Input Output Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 1. LLM M 2. task information t a s k 3. candidate responses number n 4. pre-collected random state-action pairs \overset{s}{ˉ} 5. max episodes N^{max} 1. policy network π_{θ} 2. reward decoder model f_{ψ} 初始化 policy network 参数 θ, reward decoder 参数 ψ, replay buffer B 获取 candidate response ξ_{1}, \dots, ξ_{n} \leftarrow M (t a s k, ro l e) 总结得出 improved response ξ \leftarrow M (t a s k, ro l e, ξ_{1}, \dots, ξ_{n}) 验证 latent reward encode functino ϕ : err \leftarrow verify (ϕ, \overset{s}{ˉ}); ξ \leftarrow M (t a s k, ro l e, ξ_{1 : n}, err) . 相当于是错误反馈 For e p i so d e = 1 To N^{max} 使用当前 policy 采样一个轨迹 τ B \leftarrow B \cup {r} 从 replay pool B 中采样一个 batch B = {τ_{i}}_{i = 1}^{∣ B ∣} 评估 latent reward. 使用 Loss: L_{RD}^{ϕ} (ψ) = E_{r \sim D} [(R (τ) - t = 1 \sum T f_{ψ} (ϕ (s_{t}, a_{t}))] 使用任意有 predicted proxy reward \overset{r}{^}^{ψ, ϕ} = f_{ψ} (ϕ (s, a)) 的 RL 算法优化 policy EndFor$

使用LLM生成response, 类似CoT的方法

总结生成的回复, 生成总结. 根据总结生成代码, 这个代码是计算latent reward的一个函数. 调用这个函数并传入observation, action即可计算得出eval_factors. eval_factors指的是一个list, 里面存放所有的reward

验证latent reward是否是合理的, 能否运行

训练一个decoder. 这个decoder相当于是一个加权求和的Linear Layer.

Link to original

L2R

L2R
Language to Reward for Robotic Skill Synthesis

Paper

使用LLM定义reward parameter以增强RL

Method

Background and Reward Interface

定义MDP问题: $(S, A, R, P, p_{0})$ , 其中 $S$ 是state space, $A$ 是action space, $R : S \times A \mapsto R$ 是reward function, $P : S \times A \mapsto S$ 是动态方程(在经过action之后得到state), $p_{0}$ 是initial state distribution.

给定奖励函数 $R$ , optimal controller能找到最大化reward的动作序列 $a_{1 : H} = {a_{1}, \dots, a_{H}}, J (a_{1 : H}) = E_{r = (s_{0}, a_{0}, \dots, s_{H - 1}, a_{H - 1}, s_{H})} \sum_{t = 0}^{H} R (s_{t}, a_{t})$ , 其中 $H$ 是roll-out horizon

假设reward有特殊的形式, 满足MJPC:
$R (s, a) = - i = 0 \sum M w_{i} \cdot n_{i} (r_{i} (s, a, ψ_{i}))$
其中

$w_{i} \in R$ 是权重

$n_{i} (\cdot) : b R \mapsto R_{+}$ 是二阶可微的范数(norm), 最小值为 $0$

$r_{i} \in R$ 是残差, 当 $r_{i} = 0$ 的时候达到最优

$ψ_{i}$ 是第 $i$ 项的参数

使用LLM调整 $w_{i}$ 和 $ψ_{i}$ , 自动生成针对不同task的reward

Reward Translator

Motion Description

使用Motion Descriptor LLM, 将user input解释和拓展成描述期望的robot motion的自然语言描述

可以对比较简单的任务生成reward, 对于复杂任务经常失败

但是可以对复杂任务的motion生成description

因此使用template, 让LLM直接生成Motion的自然语言description

Reward Coding

使用LLM生成reward function的API调用

Motion Controller

使用Model Predictive Control(MPC).

每一步MPC规划一个sequence的optimized action $a_{1 : H}$ , 并将其发送给robot. robot执行之后将state返回给MJPC planner, MJPC生成下一步的plan.
Link to original

RL enhance LLM

MAYE

MAYE
MAYE

Paper

RL 增强 LLM 推理能力

Preparation

Data

关注数学推理问题.

分为两个子类型, text-dominant(使用mm_math5k dataset)和vision-dominant(使用geometry3k dataset):

Algorithm

Loss为:
$L^{CLIP} (θ) = E_{[q \sim P (q), o_{q} \sim π_{θ_{old}} (o ∣ q)]} \frac{1}{∣ o _{q} ∣} t = 1 \sum ∣ o_{q} ∣ {min [prob_{t} \hat{A}_{t}, clip (prob_{t}, 1 - ϵ, 1 + ϵ)] - β_{loss} D_{KL} [π_{θ} ∥ π_{ref}]}$ $prob_{t} = \frac{π _{θ} ( o _{q, t} ∣ q , o _{q, < t} )}{π _{θ_{old}} ( o _{q, t} ∣ q , o _{q, < t} )}$ $\hat{A}_{t} = k = t \sum ∣ o_{q} ∣ γ^{k - t} ⎩ ⎨ ⎧ Rule-based reward I (o_{q, t} = [EOS]) r (q, o_{q}) - Token-level KL reward β_{rew} D_{KL} [π_{θ} (o_{q, t} ∣ q, o_{q, < t}) ∥ π_{ref} (o_{q, t} ∣ q, o_{q, < t})] ⎭ ⎬ ⎫$
其中

$P (q)$ 是输入的queries的distribution

$o_{q}$ 表示sequence of response tokens

$ϵ$ 通过 $clip$ 将 $prob_{t}$ 限制在 $[1 - ϵ, 1 + ϵ]$ 之内

$\hat{A}$ 表示估计的 $token_{t}$ 的estimated advantage, 表示是否是好token

$γ$ 是discount factor, 令 $γ = 1$ 取消discount

$D_{KL}$ 使用k3 formulation, 提供unbiased estimation

令 $β_{rew} = 0$ 以取消对reward的KL散度的约束, 只应用对policy distribution的KL散度惩罚项

Reward Function

作为Rule-based signal指导RL training

正确的answer获得+1, 错误的answer获得0

secondary language reward: 使用English回答问题

防止multi-lingual drift

取消format rewards, 不对格式做约束

Model

使用Qwen-2/2.5-VL-Instruct

MAYE Framework

Setup

冻结connector(projector), ViT, 只训练LLM backend(Transformer)

Hydra管理ocnfiguration

FSDP2用于分布式训练

vLLM用于收集多模态

Data Flow

将text data和vision data给tokenize

Response Collection

生成Response. 分布式训练的话会涉及到GPU数据reduce

Trajectory Collection

收集需要的token ids, 拼接query token ids和response token ids, 重新计算attention mask和position encoding

为了防止out of memory, 只保留response的logprobs, 因为RL用不着query的logprobs

Policy Update

基于保存的trajectories进行RL更新policy model. 使用Algorithm中的公式计算Loss

MAYE Scheme

Training Set Metrics

Accuracy curves: 反应algorithm和data preparation的正确性和有效性

Response length: 输出的长度, 反应模型的output pattern, 包括细节和推理深度的等级

Validation & Test Set Metrics

Accuracy curves: 输出随训练episode增加的准确性曲线

pass@8: temperature=1.0, top_p=1.0, 评估上限

pass@1: temperature=0.6, top_p=1.0, 评估真实性能, 并防止重复或不连贯的输出

pass@1: temperature=0.01, top_p=0.001. 评估真实性能, VLM的基准setup

Accuracy tabs: 最终模型的准确度表格

Reflection Metrics

Words count: “顿悟时刻”(“aha moments”), 反应RL训练的有效性, 通过计算”反思词”(“reflective words”)在generation step中的频率来反映:

["re-check", "re-evaluate", "re-examine", "re-think", "recheck", "reevaluate", "reexamine", "reevaluation", "rethink", "check again", "think again", "try again", "verify", "wait", "yet"]

Ratio curves: 随训练进行, 展示reflective words的频率:

reflection ratio: $\frac{N _{re f}}{N}$

reflection ratio in correct answers: $\frac{N _{re f +}}{N _{+}}$

reflection ratio in incorrect answers: $\frac{N _{re f} - N _{re f +}}{N - N _{+}}$

correct ratio in reflection texts: $\frac{N _{re f +}}{N _{re f}}$

correct ratio in no reflection texts: $\frac{N _{+} - N _{re f +}}{N - N _{re f}}$

Link to original

ToRL

ToRL
ToRL

Paper

一个from-scratch的RL训练, 允许模型通过广泛的探索找到最佳的工具利用策略

Dataset

数学奥赛级别的问题

NuminaMATH

MATH

DeepScaleR

Tool Integrated Reasoning(TIR)

使用TIR取代CoT, 增加精准计算能力

使用tool integrated reasoning可以调用外部程序. TIR的一个trajectory为:
$s_{k} = (r_{1}, c_{1}, o_{1}, \dots, r_{k}, c_{k}, o_{k})$
其中, $r_{i}$ 表示自然语言推理, $c_{i}$ 表示生成的代码, $o_{i}$ 表示外部 $c_{i}$ 得到的结果. 生成过程表示为:

$(r_{k}, c_{k}) = LLM (q \oplus s_{k - 1})$

$o_{k} = I (c_{k})$

$s_{k} = s_{k - 1} \oplus r_{k} \oplus c_{k} \oplus o_{k}$

其中query $q$ , $I$ 是外部的代码解释器

ToRL

将TIR直接与LLM使用RL结合, without prior fine-tuning.

TIR Rollout Framework

使用Qwen2.5-Math作为Transformer LLM backend

当识别到 $^{'''} output$ 的时候, 会停止输出, 并调用外部程序执行代码, 将结果 $Observation$ 返回给LLM, 并拼接成 $^{'''} output \ n Observation \ n^{'''}$ , 然后LLM继续生成自然语言

Design Choices of ToRL

Tool Call Frequency Control

防止使用CPU进行执行代码导致GPU空闲时间过长, 设置超参数 $C$ , 当调用代码次数超过 $C$ 时, 强制使用纯文本推理

Execution Environment Selection

使用Sandbox Fusion, 提供隔离的环境

Error Message Processing

让Sandbox Fusion生成不含有文件的报错信息并只提取最后一行(为了减少上下文长度)

Sandbox Output Masking

计算loss的时候, 屏蔽Sandbox输出(即, $Obversation$ )

Reward Design

成功回答问题, reward $+ 1$ , 否则, reward $- 1$

引入基于代码的惩罚: 如果代码不可执行, 则reward $- 0.5$
Link to original

Others RL with DL

使用Transformer模仿RL

Think Before You Act

将语言推理和环境中执行action结合起来

Problem Formulation

考虑一个部分马尔可夫决策过程(Partially observable Markov decision process, POMDP), $o_{t}$ 只包含部分可观测信息.

历史信息 $h_{t} = (o_{0}, a_{0}, \dots, o_{t - 1}, a_{t - 1})$ , 文本instruction $m$ , 目标是找到最优的策略 $π (a_{t} ∣ m, h_{t}, o_{t})$

training的时候在预收集的dataset上进行offline的训练. 训练时输入trajectory $(m, (o_{t}, a_{t}, c_{t})_{0 \leq t \leq T})$ , 其中m是文本instruction, $c_{t}$ 是 $t$ 时刻的”字幕”, 但是 $c_{t}$ 只在training stage时出现, inference/evaluation stage并没有 $c_{t}$

Method

Unifying actions and language reasoning

在记录的过程中, 将action和字幕进行统一:

Auto-regressive transformer for generating both language and actions

直接送给transformer进行自回归推理

Experiment

Training details

使用GPT-2, 和对应的tokenizer

将RL蒸馏到DL的神经网络中

In-Context Reinforcement Learning with Algorithm Distillation

将offline RL视为sequential prediction problem, 将RL policy蒸馏到causal sequence model中, 使用DL对RL建模

Knowledge Base

Explorer

survey-LLM-for-RL

LLM generate reward for RL

LaRe

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Preliminary

Latent Reward

Motivation

Framework

L2R

Language to Reward for Robotic Skill Synthesis

Method

Background and Reward Interface

Reward Translator

Motion Controller

RL enhance LLM

MAYE

MAYE

Preparation

Data

Algorithm

Reward Function

Model

MAYE Framework

Setup

Data Flow

Response Collection

Trajectory Collection

Policy Update

MAYE Scheme

ToRL

ToRL

Dataset

Tool Integrated Reasoning(TIR)

ToRL

TIR Rollout Framework

Design Choices of ToRL

Reward Design

Others RL with DL

Think Before You Act

Problem Formulation

Method

Unifying actions and language reasoning

Auto-regressive transformer for generating both language and actions

Experiment

Training details

In-Context Reinforcement Learning with Algorithm Distillation

Graph View

Table of Contents