Researchers have introduced CLAY, a novel method for adaptive similarity computation in vision-language embedding spaces, which enables conditional visual similarity modulation1. This approach reframes the embedding space to incorporate multiple conditions simultaneously, allowing for more flexible and subjective visual similarity assessments. Unlike traditional image retrieval systems that rely on fixed metrics, CLAY can adapt to various user interests and focuses. By doing so, it addresses a significant limitation in current image retrieval systems, which often fail to capture the nuances of human perception. The implications of this research extend beyond computer vision, as it can inform the development of more sophisticated and context-aware information retrieval systems. This matters to practitioners because it has the potential to significantly improve the accuracy and relevance of image retrieval results, particularly in applications where user context and intent play a crucial role.