Synthetic rejected preference data creation [via Qwen7b finetune]
Note:
This is a repost of an older rentry page: https://rentry.org/corruption_model_writeup
I've been building a custom model specifically tailored to take input text that has been "corrupted" in specific sections.
The actual "corruption" looks like UTF8 character substitution, which my finetuned Qwen7b has learned to reconstruct the original from (as best as it can, given the surrounding context).
The fine-tuned model attempts to produce a version of the text without the corruption; you can select a particular part for it to "replace" with an augmented "base model" style output.
The end goal of this project is to build a strong baseline for a reward model; classical RLHF (as done with the PPO algorithm, not DPO!) is not just useful at aligning the model to known preferences, but towards aligning the model towards coherent multiple-token completions.
Regular SFT training will do a good job at aligning individual token predictions, but it doesn't make the model robust at "long form" completions; this leads to issues (like repetition biases and looping) that aren't present in the natural distribution of text data.
While conventional RLHF preference models are used in RL to steer the generations towards a more coherent outcome, the preference classifier can overfit to the scarce amounts of available preference data and learn biased outcomes; my idea is to use the naturally available pre-training data (filtered for quality!) and augment it with lower quality base model predictions, since this is naturally scaleable, and real human data will (generally) be superior to base model outputs.
Intuitively, these reward models should be stronger and more adaptable baselines to finetune from compared to reward models that have only ever been trained on instruction-style outputs, because you can scale this idea to billions of tokens.
- Why UTF8?
Character level substitution was chosen over token-level substitution (at least in my attempts thus far) because it gives me the ability to do it at different "strengths".
100% Corruption
50% Corruption
Preference models can overfit strongly if the signal is "too clear"; so I think it will be important to have subtle errors or awkward wording examples on top of the obviously "base model" augmented outputs for generalizable preference judgements.
Being able to control the degree of corruption (in the way I have done here) seems to enable the ability to augment existing data "more subtly" in addition to "more selectively" (as demonstrated earlier with the ability to corrupt only the text you have highlighted).