MAYE

Paper

RL 增强 LLM 推理能力

Preparation

关注数学推理问题.

分为两个子类型, text-dominant(使用mm_math5k dataset)和vision-dominant(使用geometry3k dataset):

Loss为:

L^{CLIP} (θ) = E_{[q \sim P (q), o_{q} \sim π_{θ_{old}} (o ∣ q)]} \frac{1}{∣ o _{q} ∣} t = 1 \sum ∣ o_{q} ∣ {min [prob_{t} \hat{A}_{t}, clip (prob_{t}, 1 - ϵ, 1 + ϵ)] - β_{loss} D_{KL} [π_{θ} ∥ π_{ref}]}

prob_{t} = \frac{π _{θ} ( o _{q, t} ∣ q , o _{q, < t} )}{π _{θ_{old}} ( o _{q, t} ∣ q , o _{q, < t} )}

\hat{A}_{t} = k = t \sum ∣ o_{q} ∣ γ^{k - t} ⎩ ⎨ ⎧ Rule-based reward I (o_{q, t} = [EOS]) r (q, o_{q}) - Token-level KL reward β_{rew} D_{KL} [π_{θ} (o_{q, t} ∣ q, o_{q, < t}) ∥ π_{ref} (o_{q, t} ∣ q, o_{q, < t})] ⎭ ⎬ ⎫

其中

令 $β_{rew} = 0$ 以取消对reward的KL散度的约束, 只应用对policy distribution的KL散度惩罚项

作为Rule-based signal指导RL training

使用Qwen-2/2.5-VL-Instruct