RL & LLM


L2R

Language to Reward for Robotic Skill Synthesis

Background

::: block

  • MDP problem:
  • reward assumption: :::

Method

::: block Motion Description

  • use LLM to interpret and expand the user input into a natural language description of robot motion
  • using prompt template

Reward Coding

  • use LLM generate the reward function :::

LaRe

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Motivation

::: block

  • make reward include various implicit factors

(R|s_{1:T},a_{1:T})=\int\left[\underbrace{p(r_t|z_{r,t})}{\text{decoder }f}\underbrace{p(z{r,t}|s_t,a_t)}{\text{encoder }\phi}\right]p(R|r{1:T})dzdr$$

  • obtain interpretable and multifaceted task performance metrics from redundant environment information :::

Framework

::: block

  1. generate responses by LLM
  2. summarize responses. generate code(latent reward encoder) based on summary
  3. verify the correctness of the encoder function
  4. train reward decoder with loss: $$
  5. optimize policy with latent reward and its decoder :::

MAYE

Rethinking RL Scaling for vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Data

Framework

Algorithm

::: block

  • use reward function as rule-based signal to guide RL training
    • correctness
    • language :::

Metrics

::: block Accuracy curve: correctness and effectiveness while training

Response length: length of output

Words count: effectiveness of RL training, reflected by the frequency of certain words

Ratio curves: reflective words frequency while training :::


ToRL

ToRL: Scaling Tool-integrated RL

TIR

::: block tool integrated reasoning

  1. :::

ToRL

::: block

  • use Qwen2.5-Math
  • utilize outside code interpreter to execute generated code
    • concat answer with natural language response
  • Design
    • Tool Call Frequency Control: reduce GPU idle time
    • Execution Environment Selection: Sandbox Fusion
    • Error Message Processing: only last line of error message
    • Sandbox Output Masking: don’t compute loss on code output :::

Some others

Think before you act

::: block Combine action with “caption”

  • use “caption” to indicate what to do next
  • use action to indicate how to do

Treat RL process as auto-regressive Transformer process :::

In-Context Reinforcement Learning with Algorithm Distillation

::: block Treat offline RL as sequential prediction problem, distill RL policy into Causal Sequence Model, model RL policy by Neural Network :::