The reliability of evaluations for generative AI models, such as large language models, is compromised due to inconsistent human rater assessments. To address this issue, researchers have proposed a multi-level annotator modeling approach to improve reproducibility in evaluation. This method accounts for variability in human rater judgments, which can significantly impact the perceived performance and safety of AI systems1. By incorporating annotator modeling, evaluations can become more robust and trustworthy, ultimately enhancing the overall trustworthiness of AI systems. The lack of reproducibility in AI evaluations has significant implications, as it can lead to inaccurate assessments of model performance and safety. Therefore, improving evaluation methodologies is crucial for ensuring the safe and effective deployment of AI systems, and this new approach has important consequences for practitioners seeking to develop reliable and trustworthy AI models.