RLHF

By Paul Brock·Updated on 24-04-2026

TL;DR

RLHF (Reinforcement Learning from Human Feedback) is the training method that fine-tunes LLMs after pre-training using human raters who compare and rank outputs.

RLHF (popularised by InstructGPT and ChatGPT) bridges 'can predict' and 'is helpful'. Step 1: human evaluators rank multiple model outputs. Step 2: a reward model learns which outputs humans prefer. Step 3: the LLM is optimised on it via PPO or DPO. Result: fewer hallucinations, better instruction following, safer behaviour. Downsides: reward-hacking and model sycophancy (over-agreeing).

Example

OpenAI trained InstructGPT with ~10,000 human comparisons. Despite 100x smaller model size, the instruction-tuned models outperformed the original GPT-3 on user preference — RLHF in action.

Frequently asked questions

RLHF, RLAIF or DPO?

RLHF: human raters. RLAIF: AI raters (cheaper, scalable). DPO (Direct Preference Optimization): simpler alternative without reward model.

Can I RLHF my model?

Yes, but costly. Thousands of evaluations plus ML infra required. For most use cases, fine-tuning on high-quality examples is more practical.

RLHF

Example

Frequently asked questions

Related terms

Further reading

Need help with SEO or GEO?