Multimodal Large Language Models (MLLMs) have been found to struggle with fine-grained visual understanding, often failing to accurately answer questions that depend on small but crucial details in an image. This limitation is attributed to a regional-to-global perception gap, where MLLMs perform better when focused on specific, evidence-centered areas of an image rather than the full image itself. To address this issue, researchers have proposed Vision-OPD, a method that utilizes on-policy self-distillation to improve MLLMs' ability to perceive fine details. This approach enables MLLMs to learn from their own experiences and adapt to complex visual scenarios1. By enhancing the visual understanding capabilities of MLLMs, Vision-OPD has significant implications for various applications, including those that require precise image analysis. This development matters to practitioners as it highlights the need to prioritize visual understanding in MLLM development to unlock more accurate and reliable performance.