Researchers have identified significant limitations in existing visual token pruning methods, which fail to preserve crucial image features under dense instructions and fine-grained queries. The primary issue stems from the dispersion of textual noise that compromises dense cross-modal scoring, as well as redundant feature representations. To address this, a novel approach has been proposed, leveraging entropy-aware dense visual token pruning to combat textual noise and redundancy. This method aims to selectively prune redundant image patches while preserving critical cues, thereby enhancing the efficiency and accuracy of visual-language models (VLMs). The proposed approach has significant implications for applications relying on VLMs, such as image-text retrieval and visual question answering. By improving the robustness and efficiency of VLMs, this research carries important consequences for practitioners working on computer vision and natural language processing tasks, as it enables more effective and reliable model deployment1.