Policy gradient methods are prone to converging at suboptimal critical points when applied to restricted policy classes, primarily due to their myopic nature, focusing solely on one-step improvements. Researchers have identified this limitation and proposed a generalized $k$-step policy gradient method to address the issue1. This approach enables policies to be optimized based on longer-term outcomes, rather than just immediate rewards. By considering $k$-step $Q$-functions, the method can escape local optima and converge to better solutions. The restricted policy classes are a crucial aspect, as they are commonly used in real-world applications where certain constraints must be satisfied. This breakthrough has significant implications for the development of more effective reinforcement learning algorithms. So what matters to practitioners is that this advancement can lead to more robust and efficient policy optimization, ultimately enhancing the performance of AI systems in complex, constrained environments.