RLHF
RLHF (Reinforcement Learning from Human Feedback) is the training method that fine-tunes LLMs after pre-training using human raters who compare and rank outputs.
RLHF (popularised by InstructGPT and ChatGPT) bridges 'can predict' and 'is helpful'. Step 1: human evaluators rank multiple model outputs. Step 2: a reward model learns which outputs humans prefer. Step 3: the LLM is optimised on it via PPO or DPO. Result: fewer hallucinations, better instruction following, safer behaviour. Downsides: reward-hacking and model sycophancy (over-agreeing).
Example
OpenAI trained InstructGPT with ~10,000 human comparisons. Despite 100x smaller model size, the instruction-tuned models outperformed the original GPT-3 on user preference — RLHF in action.
Frequently asked questions
RLHF, RLAIF or DPO?
RLHF: human raters. RLAIF: AI raters (cheaper, scalable). DPO (Direct Preference Optimization): simpler alternative without reward model.
Can I RLHF my model?
Yes, but costly. Thousands of evaluations plus ML infra required. For most use cases, fine-tuning on high-quality examples is more practical.
Related terms
Further reading
- → Our service: GEO