Language models' capacity for long-horizon reasoning is crucial for complex autonomous tasks, and a key aspect of this is their ability to manage lengthy chains-of-thought. Researchers have introduced LongCoT, a benchmark consisting of 2,500 expert-crafted problems that cover a range of domains, including chemistry, mathematics, and computer science1. This benchmark is designed to test language models' ability to reason accurately over extended periods, making it a valuable tool for evaluating their suitability for complex tasks. The problems in LongCoT are diverse, spanning areas such as chess and logic, and are intended to provide a comprehensive assessment of a model's reasoning capabilities. By using LongCoT, developers can identify areas where their models need improvement, ultimately leading to more reliable and effective language models. This matters to practitioners because it enables them to develop more sophisticated language models that can tackle complex tasks with greater accuracy.