Seohong Park, a researcher at the Berkeley AI Research lab, has published a blog post proposing a third paradigm for value learning in reinforcement learning — divide and conquer — as an alternative to temporal difference (TD) learning and Monte Carlo estimation. The approach reduces the number of Bellman recursions logarithmically rather than linearly, according to the post, and is framed as a response to the difficulty of scaling off-policy RL to long-horizon tasks.
Off-policy RL allows a model to learn from previously collected data rather than requiring fresh experience from the current policy, which matters in domains such as robotics and healthcare where data collection is expensive. The dominant off-policy algorithm is Q-learning, built on TD learning.
The problem with TD learning
TD learning updates value estimates using the Bellman equation: the current Q-value is updated toward the immediate reward plus the discounted maximum Q-value of the next state. The post identifies bootstrapping as the core problem: error in the next state’s estimate propagates into the current state’s estimate, and over a long horizon those errors accumulate through every step of the recursion.
The standard mitigation is $n$-step TD learning, which uses actual observed returns for the first $n$ steps before switching to bootstrapped values. The post acknowledges this “often works well” but characterizes it as “highly unsatisfactory”: it reduces Bellman recursions only by a constant factor $n$, not fundamentally, and as $n$ increases, variance and suboptimality both grow.
The post states that “as of 2025, we have reasonably good recipes for scaling up on-policy RL,” but that “we still haven’t found a ‘scalable’ off-policy RL algorithm that scales well to complex, long-horizon tasks.”
The divide-and-conquer approach
The divide-and-conquer approach — which the post calls Transitive RL (TRL) — updates value estimates by splitting a trajectory into two equal-length segments and combining their values to estimate the value of the full trajectory. The post claims this reduces the number of Bellman recursions logarithmically rather than linearly.
The mathematical structure that enables this in goal-conditioned RL is the triangle inequality. If $d^*(s, g)$ denotes the temporal distance between state $s$ and goal state $g$, then for any intermediate state $w$:
$$d^(s, g) \leq d^(s, w) + d^*(w, g)$$
Translated into value function terms, this gives a transitive update rule: $V(s, g)$ can be updated using $V(s, w)$ and $V(w, g)$, where $w$ is a subgoal on the path. The post describes this as “exactly the divide-and-conquer value update rule that we were looking for.”
The post also reports that TRL “matches the best TD-n on all tasks, without needing to set n,” based on experiments on OGBench, a benchmark for offline goal-conditioned RL.
The subgoal selection problem
Finding the optimal midpoint $w$ is a practical obstacle. In tabular settings, the post notes this reduces to the Floyd-Warshall shortest path algorithm. In continuous environments with large state spaces, exhaustive search is not feasible. The post describes this as the reason “previous works have struggled to scale up divide-and-conquer value learning, even though this idea has been around for decades,” tracing the approach to Kaelbling (1993).
The paper described in the post (arXiv:2510.22512), co-led with a collaborator the post refers to as Aditya, is described as making “meaningful progress toward realizing and scaling up this idea” in continuous settings. The post describes it as the first work to scale divide-and-conquer value learning to what it calls “highly complex tasks” in goal-conditioned RL, while acknowledging the demonstration is specific to that class of problems rather than arbitrary RL settings.
The post’s stated properties
The post argues that the approach has “all the nice properties we want in value learning” simultaneously: logarithmic error accumulation rather than linear, no hyperparameter equivalent to $n$-step TD, and no inherent bias toward suboptimality as the horizon grows. These properties are contrasted with both TD learning, which the post says accumulates error linearly, and Monte Carlo estimation, which the post characterizes as having low bias but high variance that scales poorly to long horizons.
Whether the approach generalizes beyond goal-conditioned RL settings remains, by the post’s own account, an open question.