Vision-language models utilize two concurrent mechanisms to associate objects with their properties and spatial relations, a crucial capability for tasks like image captioning and visual question answering. The language model backbone and intermediate representations are key to understanding how these associations are computed. Researchers have shed light on the dual mechanisms, providing insight into the complex processes within vision-language models1. This newfound understanding has significant implications for the development of more sophisticated multimodal models. By recognizing the interplay between language and visual components, developers can refine their models to better capture spatial relationships and object properties. This, in turn, can lead to improved performance in various applications, from image captioning to visual question answering. The ability to effectively represent spatial relationships is critical for vision-language models, so what matters most to practitioners is how this newfound understanding can be leveraged to drive innovation in multimodal modeling.