Week 08 · April 2026

GDPO: Why GRPO Breaks Under Multiple Rewards - and How to Fix It

April 18, 2026 · by Satish K C 7 min read
Deep Learning LLMs RLHF

The Paper

"GDPO: Group reward-Decoupled Normalization Policy Optimization" was released in January 2026 by Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov - a team of researchers at NVIDIA. The paper's central claim is that applying Group Relative Policy Optimization (GRPO) directly to multi-reward settings is fundamentally flawed: normalizing a combined reward signal causes distinct rollout advantages to collapse into identical values, eroding the training signal and causing instability. The authors introduce GDPO, which decouples normalization to operate per reward before aggregation, preserving advantage resolution and enabling stable, consistent improvement across tool calling, math reasoning, and coding tasks.

Read the Paper on arXiv →

The Problem Before This Paper

GRPO, introduced as part of DeepSeek-R1's training pipeline, computes advantages by normalizing rewards across a group of rollouts from the same prompt. It was designed for single-reward settings, where the only signal is, say, answer correctness. Modern RLHF pipelines increasingly combine multiple rewards - correctness, format adherence, output length constraints - to shape more nuanced behavior. The straightforward approach is to aggregate these rewards into a weighted sum and apply GRPO as-is. The NVIDIA team shows this is not a safe default: when rewards of varying scale and difficulty are summed before normalization, rollouts that differ meaningfully on individual reward components end up with nearly identical normalized advantages. The gradient signal becomes nearly uniform across the group, reducing effective batch diversity and, in the case of math reasoning with a format reward, causing training collapse around step 400.

What They Built

GDPO's core change is to move normalization inside the reward loop rather than applying it after aggregation. For each reward r_i in the multi-reward set, a group-wise mean and standard deviation are computed independently across the rollout group, producing a per-reward normalized advantage A_i. These per-reward advantages are then summed and passed through a final batch-level normalization to produce the final advantage A used in the policy gradient update. This two-level normalization - per-reward group normalization followed by batch normalization - ensures that each reward retains its relative ordering across rollouts regardless of the absolute scale differences between reward types. The paper also introduces reward conditioning as a complementary technique: when one reward is significantly harder to satisfy than another (e.g., format correctness vs. answer correctness in math), the easier reward's gradient is gated on whether the harder reward was already satisfied, preventing the model from gaming the easier signal at the cost of the harder one.

// GRPO (broken in multi-reward):
A = normalize( sum_i( w_i * r_i ) )

// GDPO (decoupled):
A_i = group_normalize( r_i )   // per-reward, independent
A = batch_normalize( sum_i( w_i * A_i ) )

// Reward conditioning (difficulty-aware gating):
A_easy = A_easy * I[ r_hard > threshold ]
// Easy reward gradient fires only when hard reward is satisfied

Key Findings

Results

On tool calling with Qwen2.5-Instruct models evaluated on BFCL-v3, GDPO raises average accuracy from 30.18% to 32.81% at 1.5B parameters, and from 39.20% to 40.87% at 3B - while also improving format correctness from 76.33% to 80.66% (1.5B) and 81.64% to 82.23% (3B). On math reasoning with DeepSeek-R1-1.5B, GDPO achieves 29.4% on AIME-24 versus GRPO's 23.1%, and 86.2% on MATH500 versus 83.6%, while also reducing the length-exceed rate from 10.8% to 6.5% - a direct indicator of improved length reward adherence. On coding with DeepSeek-R1-7B evaluated on Codeforces, pass rate improves from 68.1% to 71.2% and bug ratio drops from 7.0% to 5.6%. Across all tasks, no single setting shows a regression under GDPO relative to GRPO.

Why This Matters for AI and Automation

My Take

This is an important paper precisely because it is narrow and empirically honest. The contribution is not a new architecture or a new training paradigm - it is a documented failure mode in a widely adopted method, with a principled fix and rigorous ablations. The collapse behavior GRPO exhibits at step 400 in math training is the kind of issue that would appear in an internal experiment as "unstable training run" and get attributed to hyperparameters or data quality. The team here does the work to isolate the cause: advantage homogenization from pre-normalization aggregation. The solution, decoupled group normalization, is elegant and the ablation showing GRPO w/o std still fails on format rewards is particularly well-designed - it rules out the obvious partial fix before proposing the full one. The open question is whether GDPO's advantage of preserving per-reward resolution holds when reward count grows to five or more signals, as some enterprise alignment pipelines require, or when rewards are correlated rather than independent. The paper tests up to two rewards, and the interaction dynamics in higher-dimensional reward spaces remain unexplored.

Discussion question: If GRPO's normalization collapses advantages when rewards are aggregated before normalization, what does this imply about reward model design choices - specifically, should practitioners design reward models that output independent scalar signals rather than composite scores, even when the behaviors being evaluated are inherently correlated?

Share this discussion

← Back to all papers