Researchers at Google Deepmind Introduce BOND: A Novel RLHF Method that Fine-Tunes the Policy via Online Distillation of the Best-of-N Sampling Distribution

Researchers at Google Deepmind Introduce BOND: A Novel RLHF Method that Fine-Tunes the Policy via Online Distillation of the Best-of-N Sampling Distribution