Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith - Latent Space Recap

Podcast: Latent Space

Published: 2026-01-08

Duration: 1 hr 18 min

Guests: George Cameron, Micah-Hill Smith

Summary

Artificial Analysis has emerged as the independent gold standard for AI benchmarking, providing comprehensive evaluations of both open and closed AI models. The episode explores how they maintain objectivity in benchmarking and their innovative metrics like the Omissions Index and Openness Index.

What Happened

Artificial Analysis, co-founded by George Cameron and Micah-Hill Smith, started as a side project in a Sydney basement and has grown into a leading independent AI benchmark platform. Initially gaining attention through a retweet by Swyx, the company is now a trusted resource for developers, enterprises, and major labs navigating the AI model landscape. They offer free data to aid decision-making in AI model selection and have built a business providing enterprise and private benchmarking insights. Their mystery shopper policy ensures unbiased model evaluations by preventing labs from manipulating results through private endpoints.

To address the challenges of AI model benchmarking, Artificial Analysis developed the Intelligence Index, which synthesizes results from multiple evaluation datasets into a single score. This index, now in its third version, helps users understand which models are best suited for specific use cases. They also introduced the Omissions Index, which evaluates a model's tendency to hallucinate, rewarding correct abstentions with a score from -100 to +100. The Claude models from Anthropic currently lead with the lowest hallucination rates.

The company utilizes the GDP Val AA, an agentic benchmark for real work tasks, to further assess AI models. This benchmark uses the Stirrup agent harness to simulate real-world tasks across 44 white-collar scenarios, employing Gemini 3 Pro as an LLM judge. Their evaluations have revealed that models are improving in token and turn efficiency, with significant cost reductions in achieving GPT-4 level intelligence.

Artificial Analysis is also known for its Openness Index, which scores models on transparency concerning pre-training data, methodology, and licensing. AI2 OLMo 2 currently leads in this area, reflecting the industry's push towards open-source contributions. The company anticipates further developments in AI model sparsity and efficiency, with rumors suggesting future frontier models could reach trillions of parameters.

The episode also highlights the cost dynamics in AI, noting that while intelligence has become significantly cheaper, the demand for reasoning models in agentic workflows has led to increased costs. Despite these challenges, Artificial Analysis remains committed to providing independent and comprehensive evaluations that help steer the AI industry's development.

Looking ahead, Artificial Analysis plans to release version 4 of the Intelligence Index, which will incorporate new metrics like hallucination rate and agentic performance, reflecting the evolving landscape of AI capabilities.

Key Insights