Clinical large language models (LLMs) are typically scaled up to improve accuracy, but this approach may not necessarily enhance safety. Research has shown that safety and accuracy in clinical LLMs follow distinct scaling laws, meaning that increased model size or complexity does not always lead to safer outcomes1. In medical applications, a small number of high-risk or evidence-contradicting errors can have significant consequences, outweighing average benchmark performance. The introduction of frameworks like SaFE-Scale aims to address this issue by reevaluating the relationship between model scaling and safety. This distinction is crucial in medicine, where erroneous predictions can have severe repercussions. Therefore, clinicians and developers must consider safety as a separate metric when designing and deploying clinical LLMs, rather than relying solely on accuracy metrics, to ensure the reliable use of these models in healthcare settings.