The evaluation of autonomous agents, such as large language models, is hindered by significant limitations in existing benchmarks. These benchmarks only assess final outputs, neglecting the sequence of actions that led to those outcomes, and lack clear criteria for evaluating safety and robustness1. Furthermore, current evaluation methods focus on a narrow range of interaction modes and data types, which does not accurately reflect real-world scenarios. To address these shortcomings, researchers have introduced a new evaluation framework, which aims to provide a more comprehensive and trustworthy assessment of autonomous agents. This framework considers the entire sequence of actions and provides clearer guidelines for evaluating safety and robustness. The development of this framework is crucial for ensuring that autonomous agents can be reliably deployed in complex, real-world environments, and that their performance can be accurately measured and improved, which matters to practitioners seeking to develop more reliable and trustworthy autonomous systems.