Reinforcement learning in general didn't come out of AI alignment work, but RHLF...

matusp · on March 14, 2023

Maybe I am missing something, but I don't entirely understand what exactly is the novelty of RLHF in this case. RL is a long-standing field and it is being trained based on human inputs for decades. Even the first paper you mention as the "initial idea" simply claims that: "Our algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014)" and that "A long line of work studies reinforcement learning from human ratings or rankings" and finally that "our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors".

jefftk · on March 14, 2023

If you want more details the last paper, on InstructGPT, is probably the most interesting?