Researchers have developed a benchmark called MADQA to assess the strategic reasoning capabilities of multimodal agents in document-intensive workflows. This benchmark consists of 2,250 human-authored questions based on 800 heterogeneous PDF documents, allowing for a comprehensive evaluation of an agent's ability to navigate and search through complex document collections. By utilizing Classical Test Theory as a guiding framework, the benchmark aims to distinguish between genuine strategic reasoning and stochastic trial-and-error search in agents. The introduction of MADQA provides a crucial step in understanding the limitations and capabilities of current multimodal agents, shedding light on their potential to automate complex workflows. This matters to practitioners because it highlights the need for more sophisticated evaluation methods to ensure that agents are truly reasoning strategically, rather than relying on random searches, which can have significant implications for the development of reliable and efficient document-intensive workflows1.