Researchers have made a breakthrough in optimizing recommendation models by leveraging low-precision kernel applications, specifically FP8 arithmetic, which is supported by recent GPU generations. This innovation has the potential to significantly boost performance in large recommendation models (LRMs) by increasing FLOPs. Unlike large language models, LRMs are more numerically sensitive and rely heavily on small matrix multiplications and normalization, making them challenging to optimize. The proposed approach, LoKA, addresses these limitations by adapting low-precision arithmetic to the unique requirements of LRMs. By doing so, LoKA enables the efficient training of LRMs on modern GPUs, which is critical for scaling these models to meet the demands of real-world applications1. This development matters to practitioners because it can lead to substantial performance gains and reduced training times for recommendation models, ultimately improving the overall efficiency of AI-powered systems.