Fine-tuning large language models (LLMs) for text-to-speech (TTS) systems can improve voice consistency and signal-to-noise ratio, but its success depends on data diversity and mixed training methods1. Experiments have shown that fine-tuning the language model backbone of TTS systems can enhance speaker-specific acoustic and perceptual characteristics, but frozen LLM representations are insufficient on their own. The role of data diversity in fine-tuning is critical, as it allows the model to generalize better to different speaking styles and acoustic conditions. Mixed training methods, which combine different datasets and training approaches, can also improve the robustness of the fine-tuned model. This research has significant implications for the development of more realistic and consistent voice cloning technologies. So what matters to practitioners is that understanding the limitations and potential of fine-tuning LLMs for TTS can inform the design of more effective and secure voice synthesis systems.