Researchers have made a crucial step in understanding how instructions impact speech synthesis in text-to-speech systems, particularly those that utilize natural language to control voice characteristics. By adapting the DAAM framework to speech diffusion models, they have developed a cross-attention attribution method to analyze the influence of individual words on acoustic output. This breakthrough is essential for identifying and addressing failure modes in expressive text-to-speech systems, ultimately enhancing their controllability. The introduction of state-aligned activity involving diffusion models significantly alters the threat landscape, shifting the focus from criminal to geopolitical threats, which demands a distinct approach1. As a result, this development has significant implications for practitioners, as it necessitates a new set of strategies to mitigate potential risks and ensure the secure deployment of these systems.