Three Models of RLHF Annotation: Extension, Evidence, and Authority

Reinforcement Learning with Human Feedback (RLHF) relies on human annotators to guide large language model behavior, but their judgments' role is often unclear. Three conceptual models emerge: extension, where annotators augment system designers' judgments; evidence, which treats annotations as empirical data; and authority, where annotators' decisions are seen as normative. These models have distinct implications for RLHF development and deployment. For instance, the extension model assumes annotators share designers' values, while the authority model grants annotators significant influence over model behavior¹. As large language models become more pervasive, understanding these models is crucial for managing their risks and benefits. The security implications of RLHF are particularly significant, as they can reshape the capability and risk surfaces of these models. So what matters to practitioners is that clarifying the role of human annotators in RLHF can help mitigate potential security risks and ensure more effective model development.

Three Models of RLHF Annotation: Extension, Evidence, and Authority

References

Related Intelligence

Three Models of RLHF Annotation: Extension, Evidence, and Authority

References

Related Intelligence

Get the Signal. Skip the Noise.