Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Multimodal large language models exhibit inconsistent responses when faced with reordered input, highlighting a critical reliability concern. Researchers have developed Facet-Probe, a comprehensive auditing tool that assesses the order sensitivity of these models across five distinct facets, including option, evidence-chunk, document-rank, image-set, and mixed-modality ordering¹. This audit evaluates 18 state-of-the-art models, revealing significant variability in their ability to maintain consistent outputs despite changes in input order. The findings underscore the importance of evaluating multimodal models beyond traditional benchmarks, which often rely on a single, canonical ordering. By exposing these models to diverse input permutations, Facet-Probe provides a more nuanced understanding of their strengths and limitations. The implications of this research are significant for practitioners, as it emphasizes the need for more rigorous testing protocols to ensure the reliability and trustworthiness of multimodal large language models.

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

References

Related Intelligence

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

References

Related Intelligence

Get the Signal. Skip the Noise.