Multimodal large language models exhibit inconsistent responses when faced with reordered input, highlighting a critical reliability concern. Researchers have developed Facet-Probe, a comprehensive auditing tool that assesses the order sensitivity of these models across five distinct facets, including option, evidence-chunk, document-rank, image-set, and mixed-modality ordering1. This audit evaluates 18 state-of-the-art models, revealing significant variability in their ability to maintain consistent outputs despite changes in input order. The findings underscore the importance of evaluating multimodal models beyond traditional benchmarks, which often rely on a single, canonical ordering. By exposing these models to diverse input permutations, Facet-Probe provides a more nuanced understanding of their strengths and limitations. The implications of this research are significant for practitioners, as it emphasizes the need for more rigorous testing protocols to ensure the reliability and trustworthiness of multimodal large language models.
Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
⚠️ Critical Alert
Why This Matters
We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open
References
- [Author/Org]. (2026, June 24). Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models. *arXiv*. https://arxiv.org/abs/2606.26079v1
Original Source
arXiv ML
Read original →