The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI - Latent Space Recap

Podcast: Latent Space

Published: 2026-02-06

Duration: 1 hr 8 min

Guests: Myra Deng, Mark Bissell

Summary

Goodfire AI is pioneering the use of interpretability in AI models to address fundamental flaws in the AI lifecycle. By creating bi-directional interfaces between humans and models, they aim to enhance control and understanding of AI behavior.

What Happened

Goodfire AI, led by Mark Bissell and Myra Deng, is at the forefront of making mechanistic interpretability a practical tool in AI development. They recently secured $150 million in Series B funding at a $1.25 billion valuation, underscoring their significant role in the industry. Goodfire's approach involves using interpretability not just as a post-training tool but as a foundational aspect of AI model development. They are developing lightweight probes and token-level safety filters that operate with near-zero latency, allowing real-time adjustments to trillion-parameter models like Kimi K2.

Mark and Myra emphasize that the current AI lifecycle is broken, as it relies heavily on data inputs without a reliable way to ensure models learn the correct behaviors. Goodfire's solution is to create a bi-directional interface between humans and models, enabling precise surgical edits to remove unwanted behaviors and biases. This approach contrasts with the traditional 'black-box' methods, offering a more transparent and controllable AI model.

One of the most notable applications of Goodfire's technology is with Rakuten, where their interpretability tools are used for real-time PII detection in multilingual settings, without training on actual customer data. This showcases the practical deployment of interpretability in high-stakes environments.

The episode also covers the operational benefits of interpretability, which can be more cost-effective than using large language models for model oversight. By using probing techniques, Goodfire reduces the need for substantial computational resources typically required by other guardrail methods.

Goodfire's work extends beyond language models, as they explore applications in genomics and medical imaging. Their interpretability techniques are being used to debug AI models and extract valuable scientific insights, accelerating discoveries in fields like healthcare.

The conversation also touches on the philosophical and theoretical aspects of AI interpretability, referencing sci-fi author Ted Chiang's works to illustrate the potential of AI models to self-analyze. This aligns with Goodfire's vision of intentional model design, where goals and constraints are directly imparted by experts rather than relying solely on data-driven methods.

Key Insights