The European Union's AI Act demands "appropriate accuracy" in automated legal reasoning, but current benchmarks fail to assess a crucial aspect: doctrinal legal reasoning. This gap exists because existing evaluations focus on ancillary tasks rather than the core interpretive work of legal professionals. Large language models can now produce legal text of median quality, but it remains unclear whether they truly understand the underlying legal principles. The lack of a suitable benchmark hinders the development of reliable automated legal reasoning systems, posing significant security implications. As large language models become more prevalent, the risk of inaccurate or misleading legal interpretations grows, underscoring the need for more comprehensive evaluation methods1. This oversight matters to practitioners because it can lead to flawed decision-making and undermine trust in AI-driven legal tools, ultimately compromising the integrity of the legal system.