Nathan Lambert on why the open-closed model performance gap is harder to measure than a single number suggests

Nathan Lambert published an analysis on Interconnects examining how the open-to-closed model performance gap is typically measured and why he thinks that measurement is misleading. The analysis and characterisations below are Lambert’s.

The composite benchmark problem

Lambert’s central argument is that the Artificial Analysis Intelligence Index, a composite of roughly ten sub-evaluations widely cited as a measure of the open-closed gap, “covers up a nuanced and crucial dynamic at what capabilities the models are covering.” He identifies three failure modes in how this number is interpreted: benchmarks evolve in correlation with actual use-cases over time; different models’ benchmark rankings diverge from their real-world performance; and training regimes shift specifically to move the benchmarks.

He cites Gemini 3 as an example of this divergence: “Gemini 3’s incredible benchmarks and remarkable irrelevance in where AI tools currently are being tested and deployed (agents).” Lambert states he is “at a relative minimum in my personal confidence in benchmarks” given the current pace of post-training change.

The shifting domain problem

Lambert describes a sequence of focus areas since ChatGPT: chat and simple code saturated first, then mathematics, and the field has since moved to complex coding and terminal tasks via reinforcement learning with verifiable rewards. He says frontier labs are now beginning to push toward specialised knowledge domains — accounting, law, healthcare — where the relevant training data is less public than code repositories and evaluation is harder.

He notes that non-frontier labs, including leading Chinese open-weight labs, have historically caught up by purchasing training environments and datasets that frontier labs developed first, often at reduced cost. He describes this as “economically similar to building chip fabs” and says it depends on RL environments being buildable as evaluations — a condition he expects to become less reliable as domains shift.

The frontier labs’ precarious position

Lambert argues that the current revenue base for frontier closed labs is heavily concentrated in coding and terminal tasks, which are the same domains where open models are closest to parity. If capability in those areas saturates and the field moves elsewhere, he writes, “a large amount of the enterprise revenue could be reliant on well-formed customer relationships, inertia, and better product development, rather than the models being leaps and bounds better.”

He frames this as a structural reinvention problem: labs must continuously find new domains where their models are demonstrably better in order to justify continued investment in expensive AI infrastructure. He says he still expects frontier labs to be “astronomically profitable businesses,” but attributes this to a belief they will continue unlocking new valuable use-cases rather than to the benchmarks being a complete signal.

Open models’ residual strength

Lambert does not characterise open models as simply weaker. He notes that in some out-of-distribution benchmarks — he cites WeirdML and ARC AGI 2 — open-weight models are “very far behind,” but that in practice users encounter lack of robustness in long-context tasks and needing to reset agent context more often, rather than a fundamental quality gap. He writes that open models are “not a category error in the sense that they’re fundamentally different classes of models” — they are, in his view, “far closer than many would’ve expected.”