Researchers have identified a critical blind spot in video multimodal large language models (MLLMs), which struggle to capture brief, momentary visual events that can be crucial for understanding video content. These events, often lasting only a few frames, can be easily missed by MLLMs that rely on sparse frame sampling, potentially leading to incorrect or incomplete interpretations. The study introduces Moment-Video, a diagnostic tool designed to assess the temporal fidelity of MLLMs on momentary visual events1. By evaluating the performance of MLLMs on brief, action-critical visual evidence, the researchers aim to improve the accuracy and reliability of video understanding models. This matters to practitioners because the inability to capture momentary visual events can have significant implications for applications such as surveillance, autonomous vehicles, and video analysis, where missing critical information can have serious consequences.