Tokenmaxxing vs. tasteful tokenmaxxing: the debate AI leaders are actually having

As AIE Miami wrapped and Google Cloud Next opened, Latent Space’s AINews reported on what it says was the dominant theme in hallway conversations among AI leaders: not model selection or benchmark chasing, but how to direct token usage toward productive work rather than waste. The framing the newsletter settled on is “tasteful tokenmaxxing” — a phrase coined by Shopify CTO Mikhail Parakhin in his Latent Space guest appearance.

The problem with undirected AI adoption

The underlying tension is between two organizational pressures. On one side, AI infrastructure and model progress make it increasingly cheap to run more AI. On the other, raw token volume does not map cleanly to engineering output — and some patterns of high usage correspond to low-quality work. Gergely Orosz apparently described the pathological end of this in his AIE Miami keynote, though the post reports on the conversation rather than quoting the talk directly.

The term “tokenmaxxing” itself seems to have evolved quickly. At its most naive, it means encouraging teams to use more AI as a proxy for AI adoption. The problem AINews surfaces is that this creates perverse incentives: teams can hit usage metrics by running LLMs on work that doesn’t benefit from them, or by running parallel agent calls that produce redundant results.

Dex Horthy, described in the post as the coiner of “Context Engineering” and “the Dumb Zone,” apparently walked back a strongly pro-vibe-coding position he had taken six months earlier. According to AINews, he “publicly retracted his extremely vibe-coding-pilled call 6 months ago and encouraged people to please read the code.” The framing he now cites — Alex Volkov’s Z/L continuum from AIE Europe — positions a spectrum between two poles of opinion, without specifying which is which.

Depth over breadth

Parakhin’s formulation in his Latent Space interview is concrete. According to the newsletter, he argued for depth — “do more serial autoresearch loops” — over breadth, which he described as “solve a problem by kicking off 5, 10, 50, 500 parallel runs of the LLM slot machine.” The slot machine metaphor is explicit: parallel runs without feedback loops produce random variation, not systematic improvement.

The practical distinction matters for agent design. Breadth-first approaches use parallelism to try multiple paths and pick the best output. Depth-first approaches use the same compute budget on sequential refinement: one run, then critique, then improvement, then repeat. The breadth approach is easier to implement and looks like high throughput. The depth approach requires more careful orchestration but, in Parakhin’s framing, produces better results.

AINews notes that the Z/L debate — referenced in Horthy’s retraction — is explicitly not resolved in favor of either side, and that engineering leaders in particular may be biased toward overweighting code quality concerns that “sheer quantity of cheap code generation and code review might overcome.” This is a fair caveat: the argument for code review and human verification is stronger for senior engineers than for all use cases uniformly.

Google TPU v8 and enterprise agent platform

At Google Cloud Next, Google announced its 8th-generation TPUs in a split design: TPU 8t for training and TPU 8i for inference. The training variant delivers “nearly 3x compute per pod vs Ironwood,” according to Google. The inference variant connects 1,152 TPUs per pod for multi-agent workloads. @scaling01 flagged an additional claim: Google says it can now scale to a million TPUs in a single cluster with TPU 8t. AINews notes this reinforces “the sheer hardware advantage that a decade of investment has given to GDM.”

Google also launched Gemini Enterprise Agent Platform, described in the post as the evolution of Vertex AI into a platform for building, governing, and optimizing agents at scale. It includes Agent Studio, access to 200+ models via Model Garden, and support for the current Gemini stack including Gemini 3.1 Pro, Gemini 3.1 Flash Image, Lyria 3, and Gemma 4. Related launches included Workspace Intelligence (described as a semantic layer over docs, sheets, meetings, and mail), Gemini Embedding 2, and security agents with Wiz integration.

Open model releases

The same two-day period saw several notable open model releases. Alibaba released Qwen3.6-27B, a dense Apache 2.0 model with thinking and non-thinking modes and native vision-language reasoning. Alibaba claims it outperforms the larger Qwen3.5-397B-A17B on multiple coding evaluations: SWE-bench Verified 77.2 vs 76.2, SWE-bench Pro 53.5 vs 50.9, Terminal-Bench 2.0 59.3 vs 52.5, and SkillsBench 48.2 vs 30.0. Community reports from @KyleHessling1 and @simonw were strong for local frontend and image tasks.

OpenAI separately open-sourced a Privacy Filter: a 1.5B total / 50M active MoE model under Apache 2.0 for PII detection and masking, with a 128k context window. AINews describes this as operationally interesting beyond the generic “small open model” framing, because it targets a concrete enterprise problem — cheap, on-device PII redaction over large corpora and logs.

Xiaomi announced MiMo-V2.5-Pro and MiMo-V2.5. The Pro model claims SWE-bench Pro 57.2, Claw-Eval 63.8, and tau3-Bench 72.9 with 1,000+ autonomous tool calls. The non-Pro variant adds native omnimodality and a 1M-token context window.

Agent infrastructure converging on shared patterns

The broader infrastructure theme AINews identifies across the period: agent harnesses are hardening at multiple vendors simultaneously. OpenAI introduced workspace agents in ChatGPT with Slack-based workflows and scheduled background tasks. Google made a parallel enterprise move with its agent platform. Cursor added Slack invocation for task kick-off and streaming updates. The pattern, as AINews frames it, is “cloud-hosted agents, shared team context, approvals, and long-running execution rather than single-user chat.”

The tokenmaxxing conversation fits into this: as agentic infrastructure scales and parallel invocations become cheap, the question of what those invocations should actually be doing becomes more urgent, not less.