Researchers have made significant strides in developing multimodal large language models that can understand and generate text and 2D images, but extending this capability to 3D models remains a challenge due to the limited availability of high-quality 3D data. The scarcity of 3D assets compared to 2D imagery makes 3D synthesis under-constrained, leading to indirect pipelines that edit in 2D and then lift the results to 3D. A new approach, Omni123, aims to address this issue by unifying text to 2D and 3D generation, enabling the creation of 3D native foundation models with limited 3D data1. This approach has the potential to improve the performance of 3D synthesis and generation tasks. The development of Omni123 is crucial for advancing 3D modeling and generation capabilities, which is essential for various applications, including computer-aided design, robotics, and virtual reality, so what matters most to practitioners is the potential of Omni123 to overcome the limitations of current 3D modeling techniques.