- Reinforcement Learning for Knowledge Awareness
- Don't Exclude Rollouts From Your RL Training Runs
- RL Learning with LoRA: A Diverse Deep Dive
- Understanding Transformers... (beyond the Math)
- GRPO Judge Experiments: Findings & Empirical Observations
- Why does GRPO work?
- Synthetic rejected preference data creation [via Qwen7b finetune]