[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton - Latent Space Recap

Podcast: Latent Space

Published: 2026-01-02

Duration: 28 minutes

Guests: Kevin Wang

Summary

The Princeton team won the NeurIPS 2025 Best Paper award by scaling reinforcement learning networks to 1,000 layers, leveraging self-supervised learning to surpass existing performance limits. Their approach reframes RL as a classification problem, utilizing advanced architectural techniques for efficient scaling.

What Happened

Kevin Wang and his team from Princeton University won the NeurIPS 2025 Best Paper award for their groundbreaking work on scaling reinforcement learning (RL) networks to 1,000 layers. Their project began as an undergraduate research seminar and defies the conventional wisdom that deep RL models are not scalable beyond a few layers. They found that self-supervised learning, which focuses on learning representations of states and actions without human-crafted rewards, allows deeper networks to thrive where traditional RL methods fail.

The team identified that the failure of deeper networks in RL was not merely a function of network depth, but rather the lack of appropriate architectural components. By integrating residual connections and layer normalization, they overcame the vanishing gradient problem and unlocked significant performance improvements. This approach aligns with successful deep learning techniques used in language and vision, where scaling depth is more efficient than scaling width.

Their work shifts the RL objective from Q-learning to a classification problem, which involves classifying whether future states belong to the same trajectory. This redefinition of objectives was crucial in achieving the 'critical depth' phenomenon, where performance multiplies after crossing a threshold of 15 million transitions.

The use of Jax and GPU-accelerated environments enabled the team to collect hundreds of millions of transitions rapidly, providing the data abundance necessary for scaling. They discovered that larger batch sizes become beneficial as network depth increases, further enhancing the efficiency of their models.

The potential applications of this work are vast, particularly in robotics, where it allows for goal-conditioned RL without human supervision. By scaling architecture rather than manual data collection, the approach demonstrates a new paradigm for deploying RL models efficiently.

The episode concludes with a discussion on future directions, such as distilling deep networks into shallow ones for efficient deployment and integrating pre-trained Vision-Language Models for hierarchical planning. The team believes RL is poised to scale similarly to language and vision models by focusing on self-supervised objectives rather than traditional reward maximization strategies.

Key Insights