Text-to-Audio-Video generation evaluation has been hindered by fragmented benchmarks, which often assess audio and video separately or rely on coarse metrics. To address this, researchers have introduced AVGen-Bench, a task-driven benchmark designed to evaluate the multi-granular aspects of T2AV generation1. This benchmark aims to capture the fine-grained joint correctness required by realistic prompts, providing a more comprehensive assessment of T2AV models. By using a task-driven approach, AVGen-Bench can evaluate the performance of T2AV models on specific tasks, such as audio and video alignment, and overall media quality. The development of AVGen-Bench has significant implications for the field of media creation, as it can help improve the accuracy and realism of generated audio and video content. So what matters to practitioners is that AVGen-Bench can facilitate the creation of more sophisticated and realistic media, which can be used in various applications, from entertainment to education.