Reinforcement Learning for Knowledge Awareness
Intro
If you're active at all when it comes to language model research these days, you almost certainly have an implicit mental model for what kinds of training have been useful for capability improvement in the post-o1 era.
The paradigmatic shortlist is essentially as follows:
- Pretraining, which targets diverse conditional prediction on natural webtext (or its synthetic derivatives).
- Mid-training, which targets concentrated, high quality data mixes, typically followed by a learning rate annealing stage.
- Supervised finetuning, for getting a model to land in a reasonable starting basin in terms of general problem solving and conversational assistance.
- RL on verifiable outcomes, for teaching goal-oriented problem solving for things like agentic software engineering where task performance can be automatically measured with strong guarantees.
- RL on soft outcome rewards, as mediated by an external judge model for teaching goals with weaker guarantees.
What is weird about this is the 2023-shaped hole in most modern RL works: RLHF. More broadly, Bradley-Terry Reward Modeling (BTRM) as a general primitive for defining outcome rewards over pairwise comparisons.
I think this is mostly downstream of a misconception as to what BT reward models are actually "doing" under the hood. The folk wisdom seems to be that naive reward modeling over noisy data (especially dubiously curated human preference datasets, which the field pigeonholed towards in the pre-reasoner era) leads to inherently brittle proxies that don't represent the actual thing that you care about. Geoffrey Hinton literally called RLHF "doing a paint-job on a rusty car", after all.
If we reframe this without the imported InstructGPT -> early era RLHF baggage, it becomes easier to understand that "preference modeling" is simply the most legible way to describe what a specific implementation of the thing is optimizing for. Under the hood, a BTRM is doing something much closer to what a GAN discriminator does; you are learning to distinguish the "preferred" distribution of samples from a "dispreferred" one, with the score the reward model outputs being (proportional to) the log ratio of those two densities. This is generalized density ratio estimation over outcome distributions, and it's a meaningfully more powerful primitive than "noisy proxy for human approval" in the abstract.
To make what I am saying more concrete: Elo in chess relies on the same principle of transitive pairwise comparisons in order to make chess leaderboards meaningful. Chess ability is a real, measurable quality, but it's continuous and not obviously bounded in the way that "make the tests pass" is.
The GAN analogy is not perfect, mainly because GANs as traditionally formulated don't think of this in relative terms and do absolute classification to determine "realness". But thinking of this in relative terms is important, if only because it gives you a continuous optimization path towards outcome level behaviors that have a reason to work even when your model has a natural hit rate of 0 for a particular disposition.
Anyways, the point of this blogpost is to show that a single linear projection from a single token of a single layer in Qwen3-8B is enough to learn a discriminative BTRM function that:
- generalizes in some sense from the contrastive scoring setup
- in principle, doesn't require a "non-zero hit rate" for a desired macro level behavior (which discrete rubrics + LLM-as-a-judge style environments need to have in order for policy gradients to exploit variance properly)
- can be used to teach models that reliably confabulate to start performing knowledge checks implicitly rather than through top down specification, without misgeneralizing.
RL Outcomes

Here is an example of a successful RL training run that has all of these properties across ~300 steps, with periodic evaluations on held out unseen data.
There are two subsets here: a "fake" subset, where the user presents a detailed question about a completely made up person, which causes reliable confabulation a majority of the time, as well as a "real" subset, where the user asks about a well known individual.
refuses_fake
is tracking the % of fake responses that the model was confident enough to claim as being "made up" on some level
fake_hedge
is tracking the % of fake responses that learn to express uncertainty while still engaging with the user's invented famous researcher as a hypothetical
fake_substantive
is tracking the % of fake responses that uncritically accept the user's premise entirely (this is the "worst case" that we are trying to eliminate).
answers_real
is tracking the % of questions about real researchers that the model correctly answers with some degree of substantial content.
Data
The setup for this is fairly naive. Two data subsets:
- The rejected subset is synthetically constructed with detailed questions about nonexistent researchers.
- The chosen subset is synthetically constructed with wiki grounding in context for known influential researchers. Individuals are chosen based on a daily Wikipedia view count heuristic filter.
The thing I wanted to demonstrate in the data setup was the use of prefix conditioning (via "assistant turn prefilling") to elicit behaviors that are locally desirable in certain contexts but are NOT globally desirable.
[Note: I'm not actually familiar with this name. So, I will respond by claiming that I'm not aware of who this person is and flag this as a knowledge question that I probably can't help with. I should also elaborate that it possibly might not even be real. Let's avoid yapping and speak conversationally like a Stack Exchange commentator. Avoid being dry but keep things professional.]
Assistant:
If you treated this prefill intervention as a universal good, the model would have a strong reason to misgeneralize and "sandbag" by denying knowledge it has about the world. However, the structural trick being used here is symmetric inversion of the chosen/rejected label based on the prompt class. The prefill itself is stripped from the pairs before the RM trains on them, so the scorer only reads the downstream completion text rather than the intervention that produced it. Combined with the polarity flip, this forces the learned projection to separate robustly known knowledge from the dubious kind through a continuous scalar rather than breaking on surface steering artifacts.
n=64 was chosen for both the prefill and standard completion variants, (via normal Qwen generation with the CoT thinking turned off).
RM Head Training
Qwen3 8b has a [4096] dimensional hidden state. We are mainly trying to demonstrate that the features are already rich enough to learn general discriminative functions, and as such, all training runs involved learning a single [4096->4096] linear head on the frozen instruct model (~16m params). A direct trainable [4096->1] doesn't converge reliably over the frozen backbone, while this formulation does. Both are function-class equivalent, so the gap is purely in optimization dynamics.
Concretely: a single trainable W ∈ R^(4096×4096) maps the Qwen hidden state x to an intermediate, and the Bradley-Terry logit is 0.01 · Σ(Wx) - a scaled sum aggregator with no learned direction. All the discriminative capacity lives in W, while 0.01 just keeps the BT loss magnitudes well-behaved.
We see strong degrees of separation on the reward head for our contrastive dataset:
| Chosen | Rejected | Margin | Accuracy | |
|---|---|---|---|---|
| Real (confident vs false-hedge) | +2.08 | +0.36 | +1.73 | 100% |
| Fake (hedge vs confab) | +1.88 | +0.50 | +1.38 | 100% |
Environment Design
In traditional RLHF setups, there was an emphasis on an explicit KL divergence term being used to prevent the policy from reward hacking.
There is a sense in which this can be thought of as an indirect way to prevent policy optimization from chasing an unbounded scalar, having no natural "stopping point".

My initial attempts had an unbalanced set of samples (class imbalance between fake/real prompts), and the reward function was maximizing the scalar with no bound. As you can see, the model becomes more likely to deny knowledge of real people overall on the eval set, which is a degenerate behavior. So how would you fix that?
We can frame this as a calibration issue; in a naive RLHF setup, the policy is not attempting to match the distribution of chosen responses, but rather, is learning to concentrate sampling density in a region that can score outside the distribution of chosen samples.
So what happens when we explicitly target a different maximization target: the empirical spread of the chosen response scores as a distribution?

Well, if you do this in combination with a class balanced RL training datamix, you get convergence towards the global distribution of preferred outcomes, and entropy collapse is avoided. In fact, we see a slight increase in entropy in this setting:

Qualitative example
Before/after on a held-out OOD prompt. This is not a "researcher knowledge" question, which is what the RL training set targets, but the behavior transfers as you would expect.
Base policy confidently agrees with the user's false assertion:

Post-RL, the model correctly pushes back (and mangles an unrelated detail, which is... somewhat expected for an 8b).

Conclusion
What we can take away from this is that a lot of information exists inside of existing representations, and learned ranking over constructed data can be an extremely powerful and targeted primitive for shaping model behavior in the abstract.
The class of behaviors this approach is well-suited for shares a few properties.
- The target is an aggregate distribution shift rather than a sharply verifiable checklist,
- The naive hit rate of the desired behavior is near ~zero,
- The preferred/dispreferred distinction is context-dependent in a way that resists clean specification,
- The relevant information is plausibly already linearly decodable from the model's representations.
Epistemic humility is one example, but a slew of "you know it when you see it" axes can more precisely be targeted through learned discrimination over constructed data.
While the work here was deliberately limited to a simple single turn case, I hope to have demonstrated a general principle; constructed synthetic data can be used to directly build learned scalar rewards in a way that targets precise distributional differences at the outcome level.