Learning Goal-Conditioned Representations for Language Reward Models

Paper

motivate: 改进reward model学习的repr, 以实现language model的对齐.

提高了数学推理中识别正确/错误的solution 的能力

使用 RLHF + RL 的范式

Method

Preliminaries

Preference ranking reward modeling for LMs

奖励模型参数化: $r (x, y) \to R$ . 给定prompt $x$ 和completion sequence of tokens $y = [y_{0}, \dots, y_{T}]$ , 返回一个scalar reward. 给定preference triple $(x, y^{w}, y^{l})$ 组成的数据库, 有Loss:

L^{R} = - \frac{1}{∣ D ∣} E_{(x, y^{w}, y^{l}) \in D} [lo g σ (r (x, y^{w}) - r (x, y^{l}))]

其中reward model $r (x, y)$ 是对整个 $y$ 提供标量反馈

Knowledge Base

Explorer

Learning Goal-Conditioned Repr for Language Reward Models

Learning Goal-Conditioned Representations for Language Reward Models

Method

Preliminaries

Graph View

Table of Contents