Reinforcement Learning with Human Feedback (RLHF) relies on human annotators to guide large language model behavior, but their judgments' role is often unclear. Three conceptual models emerge: extension, where annotators augment system designers' judgments; evidence, which treats annotations as empirical data; and authority, where annotators' decisions are seen as normative. These models have distinct implications for RLHF development and deployment. For instance, the extension model assumes annotators share designers' values, while the authority model grants annotators significant influence over model behavior1. As large language models become more pervasive, understanding these models is crucial for managing their risks and benefits. The security implications of RLHF are particularly significant, as they can reshape the capability and risk surfaces of these models. So what matters to practitioners is that clarifying the role of human annotators in RLHF can help mitigate potential security risks and ensure more effective model development.
Three Models of RLHF Annotation: Extension, Evidence, and Authority
⚡ High Priority
Why This Matters
LLM developments from reinforcement learning reshape both capability and risk surfaces — security implications trail the hype cycle.
References
- [Author]. (2026, April 28). Three Models of RLHF Annotation: Extension, Evidence, and Authority. *arXiv*. https://arxiv.org/abs/2604.25895v1
Original Source
arXiv AI
Read original →