Large language models' calibration is typically assessed using metrics like Expected Calibration Error, which fails to distinguish between two crucial aspects: sensitivity and bias. Signal Detection Theory provides a framework to decompose these components, allowing for a more nuanced evaluation of LLMs. By applying SDT-derived metrics, such as the area under the receiver operating characteristic curve, researchers can separately examine a model's ability to discern correct from incorrect answers and its propensity for confident or cautious responses1. This distinction is critical, as it enables the identification of models that are overly confident or prone to false positives. The temperature-criterion analogy further refines this analysis, offering insights into the interplay between sensitivity and bias. This matters to practitioners, as a deeper understanding of LLMs' strengths and weaknesses is essential for developing reliable and trustworthy AI systems.