GRPO Judge Experiments: Findings & Empirical Observations
Introduction
Many GRPO reproductions for LLM reinforcement learning available online lack useful intuitions or recommendations regarding hyperparameters and reward shaping.
Most examples focus on GSM8k or similar math problems - tasks with clear right/wrong answers. Additionally, demonstrations seen so far primarily use models smaller than ~7B, and don't demonstrate realistic failure modes when it comes to the learning dynamics of smaller models.
My goal was to tackle a more generalizable and difficult problem: improving generalized semantic judgements where the "correct" answer requires subjective evaluation, not just calculation. This is a more diverse task compared to math-based problem solving, and helps illustrate how GRPO develops problem-solving strategies in other domains.
Task Description
The task involves training a model to evaluate two pieces of text. One of the texts has been subtly augmented by a LLM (specifically, the larger 14B variant of my corruption models). The model must provide notes and a subsequent judgment afterwards in consecutive XML tags.
Example Format
The base model is provided a system prompt that establishes the expected template, as well as two randomly ordered A/B samples containing "real" vs "synthetic" samples for the input:
REQUEST: You are to judge the better of the two samples and determine which of the following samples is better using a short judgement that is no longer than (and no shorter than) exactly 128 tokens.
Respond with an exactly 128 tokens tag labeled <notes> that contains your notes, and then <judgement> which is just the letter that you are picking.
For example:
JUDGE: <notes>
Sample A is superior to Sample B... (example notes)
</notes>
<judgement>A</judgement>
Now, it is your turn.
[Sample A]:
Included is a pre-test, post-test, and vocabulary quiz on the 8th grade math standard functions (8.F). 1.) Determine if a graph represents a function 2.) State the domain and range of a relation 3.) Plot points on a graph to determine if the table represents a function 4.) State if a function is decreasing, increasing, or constant 5.) Determine the output of a function machine 6.) Determine the recursive and explicit equation 7.) Determine the minimum, maximum, increasing interval, and decreasing interval of a graph 8.) Determine the rate of change, initial value, independent value, and dependent variable given a graph 9.) Sketch a graph given a situation The vocabulary included is dependent, output, function, domain, range, decreasing function, input, range, non-linear function, relation, increasing function, and function notation. Total Pages: 9 (18 including answer key) Answer Key: Included Document File: PDF
[Sample B]:
Included is a pre-test, post-test, and vocabulary quiz on the 8th grade math standard functions (8.F). 1.) Determine if a graph represents a function 2.) State the domain and range of a relation 3.) Plot points on a graph to determine if the table represents a function 4.) State if a function is increasing, decreasing, or constant 5.) Determine the output of a function given 6.) Determine the input of a function given 7.) Determine a function rule given ordered pairs or a table of values. 8.) Graph functions using a table of values and determine a trend line in a graph 9.) Write a data table situation The vocabulary included is dependent, output, function, domain, range, decreasing function, input, range, non-linear function, relation, increasing function, and function notation. Total Pages: 9 (18 including answer key) Answer Key: Included Document File: PDF
JUDGE:
A correct output follows this structure:
<notes>
Sample A provides more specific and thoroughly defined tasks. It mentions "function machine," "recursive and explicit equation," and detailed graph analysis with "minimum, maximum" and intervals. Sample B contains incomplete phrases like "output of a function given" without completing the thought, making it less coherent and precise than Sample A.
</notes>
<judgement>A</judgement>
Resources
- Model checkpoint: quest-grpo-judge-14b-v1-205
- Training code: Available in the Files tab at this link
- Dataset: quest-corruption-14brestorations-2.6k-filter-v1
Hyperparameter Observations
Learning Rate Schedulers
When hyperparameter tuning, I observed that cosine or linear decaying learning rates (particularly with fast warmup) do not work well with GRPO.
What worked best was constant with a warm up phase with lower learning rates: lr_scheduler_type="constant_with_warmup"
Gradient Clipping
For consistency, setting max_grad_norm
to 0.2 as a default seems most effective (this is also the default clipping ratio in PPO, & I have done SFT runs in the past where 0.2 had the lowest loss).
This value can be reduced if necessary for stability.
Reward Construction Decisions
Power Scaling the Judgement Reward
For this task, there are two possible valid choices: A/B. Base models (Qwen2.5 series, mostly 7b/14b runs) begin with slightly worse than average reward; 14b was closer to 50/50 (equivalent to random chance).
I wanted to try something different for reward scaling rather than naively using the percentage of accurate judgements. If we simply took the raw accuracy and used it as reward (e.g., 0.4 for 40% accuracy), I thought it might not create enough pressure for improvement.
My idea was to apply a power function (specifically x⁴) to the reward scaling. Specifically, the approach scales down rewards for correct answers when the model's overall batch accuracy is low, while substantially increasing the reward as accuracy improves.
The implementation I tried rescales individual correct judgments based on the overall batch performance. This means that the actual reward value assigned to each correct answer changes as the model improves on average, but I haven't rigorously tested if this scaling is actually better than alternatives.
Example:
- At 50% accuracy, correct answers get 0.125 reward
- At 80% accuracy, correct answers get 0.512 reward
XML Formatting Bonus
Enforcing the XML schema (notes followed by a judgement) is something we want to incentivize. However, I wanted to avoid giving partial reward to responses with wrong judgements.
What I ended up doing is enforcing the formatting reward as a flat 0.2 bonus that only applies when the judgement is also correct.
My reasoning for this is because, if done this way, the "chain of desirability" (according to our reward specifications) looks something along the lines of:
correct judgements & formatting > correct judgements, wrong formatting > wrong judgements (formatting irrelevant)
Length as a Bonus Reward
The third reward was a bit of an afterthought, but it's a non-linear scaling that favors
Much like the formatting reward, this reward only applies if the judgement is correct.
Scaling and the Impact on Judgement
While Qwen2.5 3b can climb the hill, it does it significantly more slowly compared to the 7b or 14b base models.
I think it's a little important to harp on the fact that the 7b base adapts so much faster here; it's to the degree where the Qwen 1.5b was only slightly worse compared to the 3b, while the 7b saw a substantial improvement.
Observed Reasoning Patterns
Analysis of the highest reward samples from the GRPO judge model revealed "templated" reasoning patterns:
Template Structure
The highest reward samples seem to follow a specific reasoning template:
- Model starts with "Both of these..." followed by an explanation of what the samples share
- Then follows with "However," to contrast differences between examples
Ambiguity Recognition
There is more acknowledgement of ambiguity leading up to the judgement in some of the highest reward samples.
Reinforcement Phrases
"Additionally," is another anchor phrase that the model uses to reinforce the judgement in cases where it seems more confident than usual.
Acknowledgements
Special thanks to Deepshard for temporarily providing 8xH200s for the purposes of research & open source experimentation.
My reward implementation builds upon willccbb's verifiers repository, which was modified to handle the custom task described here.