[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI - Latent Space Recap

Podcast: Latent Space

Published: 2025-12-31

Duration: 28 minutes

Guests: Josh McGrath

Summary

Josh McGrath from OpenAI discusses the evolution of AI post-training, focusing on RLVR, token efficiency, and the significance of data quality over mere optimization techniques. He highlights the complexities of reinforcement learning compared to pre-training and the future of AI in handling long contexts and personality customization.

What Happened

Josh McGrath details his journey from pre-training data curation to becoming a post-training researcher at OpenAI, where he has contributed to the development of models like GPT-4o and GPT-5. He talks about the transition from PPO to RLVR, emphasizing that both are policy gradient methods but differ in data input quality, which is crucial for trustworthy AI performance.

McGrath underscores the importance of token efficiency over wall-clock time, noting that GPT-5 to 5.1 upgrades have improved evaluations while reducing token usage. This shift in focus allows for better tool-calling and more efficient agent workflows, which significantly impacts how AI models are developed and deployed.

The episode explores the infrastructure challenges of reinforcement learning, which involves more complex moving parts than pre-training. McGrath describes the chaotic nature of scaling RL, where tasks, grading setups, and external partners add layers of complexity, often requiring late-night interventions and unfamiliar code adjustments.

Codex has notably transformed McGrath's workflow, compressing 40-minute design sessions into 15-minute agent sprints. He shares the peculiar feeling of waiting for agents to complete tasks, highlighting the efficiencies and dependencies introduced by such AI advancements.

OpenAI's shopping model serves as a test bed for interruptibility and chain-of-thought transparency, offering users a way to refine search results interactively. McGrath also mentions the significance of personality toggles, like Anton versus Clippy, which provide users with the ability to customize the AI's persona according to their preferences.

McGrath expresses concern over the educational system's failure to produce individuals skilled in both distributed systems and machine learning. He believes this skill set is critical for advancing AI technologies, as the bottleneck in development shifts frequently, requiring adaptable and broadly skilled professionals.

Key Insights

The transition from Proximal Policy Optimization (PPO) to Reinforcement Learning with Variable Reward (RLVR) at OpenAI focuses on improving data input quality, which is essential for developing trustworthy AI models.
Upgrades from GPT-5 to 5.1 have improved evaluation metrics while reducing token usage, enhancing tool-calling capabilities and agent workflow efficiency in AI model development.
Reinforcement learning infrastructure presents complex challenges due to the need for managing multiple tasks, grading setups, and collaborations with external partners, often requiring late-night interventions and code adjustments.
OpenAI's shopping model incorporates interruptibility and chain-of-thought transparency, allowing users to refine search results interactively and customize the AI's persona with personality toggles like Anton versus Clippy.