[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang - Latent Space Recap

Podcast: Latent Space

Published: 2025-12-31

Duration: 18 minutes

Guests: John Yang

Summary

John Yang discusses the evolution of SWE-bench into an industry-standard benchmark for AI coding agents and introduces CodeClash, a new benchmark focusing on long-horizon development for AI models.

What Happened

John Yang highlights how SWE-bench transitioned from obscurity to becoming a standard in AI coding evaluations after Cognition's launch of Devin. This pivot led to an arms race among major AI labs to build better coding agents. John shares how SWE-bench expanded from its Django-heavy origins to include nine languages across 40 repositories, highlighting its evolution into SWE-bench Multimodal and Multilingual variants.

A significant discussion revolves around the limitations of unit tests as verification tools in AI coding evaluations. John suggests that long-running agent tournaments, like those in CodeClash, may represent the future. In these tournaments, AI agents maintain and improve their own codebases over multiple rounds, competing in arenas that range from programming games to economic tasks.

The proliferation of SWE-bench variants, such as SWE-bench Pro, SWE-bench Live, and others, is addressed. While some variants were created independently of John's team, he expresses support for their contributions to the field. He also touches on the controversy surrounding some benchmarks using the SWE-bench name without prior approval.

John discusses the Tau-bench controversy, where some tasks are deemed impossible, which he sees as a feature to flag cheating. He emphasizes the need for benchmarks that include impossible tasks to maintain integrity and challenge AI models.

The episode delves into Cognition's research focus on codebase understanding and automatic context engineering for LLMs, emphasizing the importance of human-AI collaboration. John envisions CodeClash as a testbed for exploring various human-AI collaboration setups, from solo agents to multi-agent collaborations.

Finally, John shares his vision for the future of code evaluations, advocating for the balance between long autonomy and interactivity. He sees potential in enabling different levels of abstraction for developers, allowing them to focus on more creative and complex tasks.

Key Insights

SWE-bench has evolved from its Django-heavy origins to support nine languages across 40 repositories, expanding into Multimodal and Multilingual variants to better evaluate AI coding capabilities.
CodeClash tournaments allow AI agents to maintain and improve their own codebases over multiple rounds, suggesting a shift towards long-running evaluations for AI coding assessments.
Some benchmarks under the SWE-bench name have been created without prior approval, leading to controversy over the use of the name despite their contributions to the field.
Incorporating impossible tasks in benchmarks, as seen in the Tau-bench controversy, is intended to flag cheating and maintain the integrity of AI model evaluations.