Evaluating large language models (LLMs) is a complex task, as traditional benchmark scores often fail to accurately reflect their real-world utility. Users frequently rely on informal, experience-based assessments, known as "vibe-testing," where they compare models on tasks related to their own workflows. However, this approach is often unstructured and difficult to reproduce at scale. Researchers have begun to study vibe-testing methods, aiming to formalize and quantify these evaluations1. By understanding how users vibe-test LLMs, developers can create more effective models that meet real-world needs. The lack of standardized evaluation methods has significant implications for the development and deployment of LLMs, as it can lead to models that excel in benchmark tests but fail in practical applications. This matters to practitioners because it highlights the need for more nuanced and user-centered evaluation approaches to ensure LLMs are truly effective in real-world scenarios.