|
| 1 | +#Soft Adaptive Policy Optimization |
| 2 | + |
| 3 | +**Version Requirement**: ms-swift>=3.11 |
| 4 | + |
| 5 | +[Soft Adaptive Policy Optimization (SAPO)](https://arxiv.org/abs/2511.20347) addresses the issues caused by hard clipping in GRPO by proposing a temperature-controlled soft gate mechanism that smoothly attenuates off-policy updates while preserving useful learning signals. |
| 6 | + |
| 7 | +##Background and Motivation |
| 8 | + |
| 9 | +When training LLMs with reinforcement learning, GRPO handles off-policy training by computing token-level importance sampling ratios: |
| 10 | + |
| 11 | +$$ |
| 12 | +r_t = \frac{\pi_\theta(y_t|x, y_{<t})}{\pi_{\theta_{\mathrm{old}}}(y_t|x, y_{<t})} |
| 13 | +$$ |
| 14 | + |
| 15 | +However, token-level importance sampling ratios often exhibit high variance, which can be exacerbated in the following cases: |
| 16 | +-**Long text generation** |
| 17 | +-**MoE model routing heterogeneity**: The old-policy model during sampling and the training model may use different expert routing, significantly amplifying logps differences |
| 18 | + |
| 19 | +To address this, GRPO uses hard clipping to limit the magnitude of policy updates: |
| 20 | + |
| 21 | +$$ |
| 22 | +L^{\mathrm{GRPO}} = -\min\left( r_t \cdot A, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon) \cdot A \right) |
| 23 | +$$ |
| 24 | + |
| 25 | +**The Dilemma of Hard Clipping**: Hard clipping struggles to balance stability and learning efficiency—too strict clipping limits the number of effective samples, while too loose clipping introduces noisy gradients from off-policy samples, leading to training instability. |
| 26 | + |
| 27 | +##SAPO Method |
| 28 | + |
| 29 | +SAPO uses a temperature-controlled sigmoid soft gate function to replace hard clipping, achieving smooth gradient attenuation. |
| 30 | + |
| 31 | +###Soft Gate Function |
| 32 | + |
| 33 | +The core of SAPO is using the sigmoid function to apply soft gating on the importance sampling ratio: |
| 34 | + |
| 35 | +For positive advantages ($A > 0$), use positive gating: |
| 36 | + |
| 37 | +$$ |
| 38 | +g^{+}_t = \sigma\left( \tau_{\mathrm{pos}} \cdot (r_t - 1) \right) |
| 39 | +$$ |
| 40 | + |
| 41 | +For negative advantages ($A < 0$), use negative gating: |
| 42 | + |
| 43 | +$$ |
| 44 | +g^{-}_t = \sigma\left( \tau_{\mathrm{neg}} \cdot (r_t - 1) \right) |
| 45 | +$$ |
| 46 | + |
| 47 | +where: |
| 48 | +- $\sigma(\cdot)$ is the sigmoid function |
| 49 | +- $\tau_{\mathrm{pos}}$ and $\tau_{\mathrm{neg}}$ are temperature parameters that control the gate function slope |
| 50 | +- $r_t$ is the importance sampling ratio |
| 51 | + |
| 52 | +###SAPO Loss Function |
| 53 | + |
| 54 | +$$ |
| 55 | +L^{\mathrm{SAPO}} = -g_t \cdot A |
| 56 | +$$ |
| 57 | + |
| 58 | +where $g_t = g^{+}_t$ when $A > 0$, $g_t = g^{-}_t$ when $A < 0$. |
| 59 | + |
| 60 | +###Temperature Parameters |
| 61 | + |
| 62 | +The temperature parameter $\tau$ controls the decay rate of the soft gate function—larger values result in faster decay. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +The paper points out that positive advantages increase the logit of sampled tokens while decreasing the logits of all unsampled tokens; negative advantages do the opposite, increasing the logits of many unsampled tokens, which may spread to a large number of irrelevant tokens and introduce instability. Therefore, the paper recommends setting $\tau_\text{neg} > \tau_\text{pos}$ to make the gradient decay faster for tokens with negative rewards, improving training stability and performance. |
| 67 | + |
| 68 | +The paper recommends default values of $\tau_{\mathrm{pos}} = 1.0$ and $\tau_{\mathrm{neg}} = 1.05$. |
| 69 | + |
| 70 | +##Parameter Settings |
| 71 | + |
| 72 | +| Parameter| Type| Default| Description| |
| 73 | +|-----------|------|---------|-------------| |
| 74 | +|`--loss_type`|`str`| -| Set to`sapo`| |
| 75 | +|`--tau_pos`|`float`|`1.0`| Temperature parameter for positive advantages, controls gate slope| |
| 76 | +|`--tau_neg`|`float`|`1.05`| Temperature parameter for negative advantages, controls gate slope| |
| 77 | + |
| 78 | +```bash |
| 79 | +swift rlhf \ |
| 80 | + --rlhf_type grpo \ |
| 81 | + --loss_type sapo \ |
| 82 | + --tau_pos 1.0 \ |
| 83 | + --tau_neg 1.05 \ |
| 84 | +# ... other parameters |
| 85 | +``` |
| 86 | + |
| 87 | +Example training scripts: |
| 88 | + |
| 89 | +-[swift](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/sapo.sh) |
| 90 | +-[megatron swift](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo/sapo.sh) |
| 91 | + |
| 92 | +>The soft gate mechanism of SAPO only takes effect during off-policy training. |
| 93 | +>The importance sampling granularity in SAPO is at the token level (i.e., importance_sampling_level defaults to token), which conflicts with GSPO. |