Large Language Models (LLMs) are being leveraged to evaluate the quality of other machine learning models, specifically LLMs, by analyzing their outputs through a specially engineered judge prompt that outlines the criteria for assessment. This automation enables the efficient evaluation of complex free-form text outputs from victim models. Researchers have been exploring the reliability and fidelity of these automated judgment systems, which combine an LLM with a tailored judge prompt to assess the performance of other LLMs1. The implications of this research extend beyond the technical realm, influencing policy, security, and workforce dynamics. As LLMs become increasingly prevalent, understanding their limitations and potential biases is crucial for ensuring the integrity of automated decision-making systems. The ability to reliably evaluate the performance of LLMs has significant consequences for the development of trustworthy AI systems, so the fidelity of these automated judgment systems matters to practitioners seeking to deploy secure and reliable AI solutions.