GRPO Judge Experiments: Findings & Empirical Observations

03 Mar, 2025

Introduction

Many GRPO reproductions for LLM reinforcement learning available online lack useful intuitions or recommendations regarding hyperparameters and reward shaping.

Most examples focus on GSM8k or similar math problems - tasks with clear right/wrong answers. Additionally, demonstrations seen so far primarily use models smaller than ~7B, and don't demonstrate realistic failure modes when it comes to the learning dynamics of smaller models.

My goal was to tackle a more generalizable and difficult problem: improving generalized semantic judgements where the "correct" answer requires subjective evaluation, not just calculation. This is a more diverse task compared to math-based problem solving, and helps illustrate how GRPO develops problem-solving strategies in other domains.

Task Description

The task involves training a model to evaluate two pieces of text. One of the texts has been subtly augmented by a LLM (specifically, the larger 14B variant of my corruption models). The model must provide notes and a subsequent judgment afterwards in consecutive XML tags.

GRPO Task Overview

Example Format

The base model is provided a system prompt that establishes the expected template, as well as two randomly ordered A/B samples containing "real" vs "synthetic" samples for the input:

REQUEST: You are to judge the better of the two samples and determine which of the following samples is better using a short judgement that is no longer than (and no shorter than) exactly 128 tokens.

Respond with an exactly 128 tokens tag labeled <notes> that contains your notes, and then <judgement> which is just the letter that you are picking.

For example:

JUDGE: <notes>
Sample A is superior to Sample B... (example notes)
</notes>
<judgement>A</judgement>

Now, it is your turn.

[Sample A]:
Included is a pre-test, post-test, and vocabulary quiz on the 8th grade math standard functions (8.F). 1.) Determine if a graph represents a function 2.) State the domain and range of a relation 3.) Plot points on a graph to determine if the table represents a function 4.) State if a function is decreasing, increasing, or constant 5.) Determine the output of a function machine 6.) Determine the recursive and explicit equation 7.) Determine the minimum, maximum, increasing interval, and decreasing interval of a graph 8.) Determine the rate of change, initial value, independent value, and dependent variable given a graph 9.) Sketch a graph given a situation The vocabulary included is dependent, output, function, domain, range, decreasing function, input, range, non-linear function, relation, increasing function, and function notation. Total Pages: 9 (18 including answer key) Answer Key: Included Document File: PDF

[Sample B]:
Included is a pre-test, post-test, and vocabulary quiz on the 8th grade math standard functions (8.F). 1.) Determine if a graph represents a function 2.) State the domain and range of a relation 3.) Plot points on a graph to determine if the table represents a function 4.) State if a function is increasing, decreasing, or constant 5.) Determine the output of a function given 6.) Determine the input of a function given 7.) Determine a function rule given ordered pairs or a table of values. 8.) Graph functions using a table of values and determine a trend line in a graph 9.) Write a data table situation The vocabulary included is dependent, output, function, domain, range, decreasing function, input, range, non-linear function, relation, increasing function, and function notation. Total Pages: 9 (18 including answer key) Answer Key: Included Document File: PDF

JUDGE:

A correct output follows this structure:

<notes>
Sample A provides more specific and thoroughly defined tasks. It mentions "function machine," "recursive and explicit equation," and detailed graph analysis with "minimum, maximum" and intervals. Sample B contains incomplete phrases like "output of a function given" without completing the thought, making it less coherent and precise than Sample A.
</notes>
<judgement>A</judgement>

Resources

Model checkpoint: quest-grpo-judge-14b-v1-205
Training code: Available in the Files tab at this link
Dataset: quest-corruption-14brestorations-2.6k-filter-v1

Hyperparameter Observations

Comparison of decaying rewards

Learning Rate Schedulers

When hyperparameter tuning, I observed that cosine or linear decaying learning rates (particularly with fast warmup) do not work well with GRPO.

Learning Rate Comparison

What worked best was constant with a warm up phase with lower learning rates: lr_scheduler_type="constant_with_warmup"

Constant with Warmup Learning Rate

Gradient Clipping

For consistency, setting max_grad_norm to 0.2 as a default seems most effective (this is also the default clipping ratio in PPO, & I have done SFT runs in the past where 0.2 had the lowest loss).

This value can be reduced if necessary for stability.

Reward Construction Decisions

Power Scaling the Judgement Reward

For this task, there are two possible valid choices: A/B. Base models (Qwen2.5 series, mostly 7b/14b runs) begin with slightly worse than average reward; 14b was closer to 50/50 (equivalent to random chance).

I wanted to try something different for reward scaling rather than naively using the percentage of accurate judgements. If we simply took the raw accuracy and used it as reward (e.g., 0.4 for 40% accuracy), I thought it might not create enough pressure for improvement.

My idea was to apply a power function (specifically x⁴) to the reward scaling. Specifically, the approach scales down rewards for correct answers when the model's overall batch accuracy is low, while substantially increasing the reward as accuracy improves.

The implementation I tried rescales individual correct judgments based on the overall batch performance. This means that the actual reward value assigned to each correct answer changes as the model improves on average, but I haven't rigorously tested if this scaling is actually better than alternatives.

Example:

At 50% accuracy, correct answers get 0.125 reward
At 80% accuracy, correct answers get 0.512 reward

Power Scaling Function

XML Formatting Bonus

Enforcing the XML schema (notes followed by a judgement) is something we want to incentivize. However, I wanted to avoid giving partial reward to responses with wrong judgements.

What I ended up doing is enforcing the formatting reward as a flat 0.2 bonus that only applies when the judgement is also correct.

My reasoning for this is because, if done this way, the "chain of desirability" (according to our reward specifications) looks something along the lines of:

correct judgements & formatting > correct judgements, wrong formatting > wrong judgements (formatting irrelevant)

Length as a Bonus Reward

The third reward was a bit of an afterthought, but it's a non-linear scaling that favors tags with a length closely clustered to the range of ~128 tokens.

Much like the formatting reward, this reward only applies if the judgement is correct.

Token Length Example

Scaling and the Impact on Judgement

While Qwen2.5 3b can climb the hill, it does it significantly more slowly compared to the 7b or 14b base models.

Scaling Comparison Between Model Sizes

I think it's a little important to harp on the fact that the 7b base adapts so much faster here; it's to the degree where the Qwen 1.5b was only slightly worse compared to the 3b, while the 7b saw a substantial improvement.

Observed Reasoning Patterns

Analysis of the highest reward samples from the GRPO judge model revealed "templated" reasoning patterns:

Template Structure

The highest reward samples seem to follow a specific reasoning template:

Model starts with "Both of these..." followed by an explanation of what the samples share
Then follows with "However," to contrast differences between examples

Example of template structure

Ambiguity Recognition

There is more acknowledgement of ambiguity leading up to the judgement in some of the highest reward samples.

Example showing ambiguity acknowledgment

Reinforcement Phrases

"Additionally," is another anchor phrase that the model uses to reinforce the judgement in cases where it seems more confident than usual.

Example of reinforcement phrase

Acknowledgements

Special thanks to Deepshard for temporarily providing 8xH200s for the purposes of research & open source experimentation.

My reward implementation builds upon willccbb's verifiers repository, which was modified to handle the custom task described here.