Off-policy reinforcement learning is important for any domain where data collection is expensive — robotics, healthcare, dialogue systems — because it allows a model to learn from previously collected data rather than requiring fresh experience from the current policy. The dominant off-policy algorithm is Q-learning, which is based on temporal difference (TD) learning. But as a Berkeley AI Research blog post explains, TD learning has a fundamental scaling problem: error accumulates across the entire horizon through bootstrapping, making it fragile for long-horizon tasks.
The post proposes a third paradigm for value learning — divide and conquer — as an alternative to TD and Monte Carlo that reduces the number of Bellman recursions logarithmically rather than linearly, without the variance and hyperparameter sensitivity that make Monte Carlo approaches difficult to tune.
The core problem with TD learning
TD learning updates value estimates using the Bellman equation: the current Q-value is updated toward the immediate reward plus the discounted maximum Q-value of the next state. The problem, as the post describes, is bootstrapping: the error in the estimate of the next state’s value propagates into the current state’s estimate. Over a long horizon, these errors accumulate through every step of the recursion.
The standard mitigation is $n$-step TD learning, which uses actual observed returns for the first $n$ steps and then switches to bootstrapped values. The post acknowledges this “often works well” but characterizes it as “highly unsatisfactory” for two reasons. First, it only reduces the number of Bellman recursions by a constant factor $n$, not fundamentally. Second, as $n$ increases, variance and suboptimality both grow, so there is no free lunch — $n$ must be tuned per task.
The post distinguishes two RL classes: on-policy RL (PPO, GRPO, policy gradient methods), which requires discarding old data on each policy update, and off-policy RL, which can use any available data. The post states that “as of 2025, we have reasonably good recipes for scaling up on-policy RL,” but that “we still haven’t found a ‘scalable’ off-policy RL algorithm that scales well to complex, long-horizon tasks.”
Divide and conquer: the key idea
The divide-and-conquer approach updates value estimates by splitting a trajectory into two equal-length segments and combining their values to estimate the value of the full trajectory. The post claims this reduces the number of Bellman recursions logarithmically rather than linearly — meaning a horizon of 1024 steps requires only about 10 levels of recursion rather than 1024.
The mathematical structure that makes this tractable in goal-conditioned RL is the triangle inequality. If $d^*(s, g)$ denotes the shortest path distance (temporal distance) between state $s$ and goal state $g$, then for any intermediate state $w$:
$$d^(s, g) \leq d^(s, w) + d^*(w, g)$$
Translated into value function terms, this gives a “transitive” Bellman update rule: $V(s, g)$ can be updated using two smaller values $V(s, w)$ and $V(w, g)$, where $w$ is the optimal midpoint (subgoal) on the shortest path. The post describes this as “exactly the divide-and-conquer value update rule that we were looking for.”
The subgoal selection problem
There is one practical obstacle: finding the optimal midpoint $w$. In tabular settings with small state spaces, the post notes this reduces to the Floyd-Warshall shortest path algorithm — just enumerate all states. In continuous environments with large state spaces, exhaustive search is impossible. The post describes this as the reason “previous works have struggled to scale up divide-and-conquer value learning, even though this idea has been around for decades” — tracing the approach back to Kaelbling (1993) as “the very first work in goal-conditioned RL.”
The recent work described in the post (arXiv:2510.22512), co-led with a collaborator referred to as Aditya, makes “meaningful progress toward realizing and scaling up this idea” in continuous settings. The post describes this as “the first such work” to scale divide-and-conquer value learning to “highly complex tasks” in goal-conditioned RL, though it acknowledges this demonstration is limited to that specific class of problems rather than arbitrary RL settings.
Why this matters for off-policy RL at scale
The post argues that divide and conquer has “all the nice properties we want in value learning” simultaneously: logarithmic error accumulation rather than linear, no hyperparameter equivalent to $n$-step TD, and no inherent bias toward suboptimality as the horizon grows. These properties contrast with both TD learning (accumulates error linearly) and Monte Carlo estimation (low bias but high variance, scales poorly to long horizons).
For robotics and other domains where off-policy learning is mandatory due to data collection constraints, the inability to scale TD learning has been a practical limitation. The post frames divide and conquer as a principled alternative rather than a heuristic patch — the logarithmic recursion reduction is structural, following from the mathematical properties of goal-conditioned value functions, not from empirical tuning.
The paper backing the post is arXiv:2510.22512. The approach is presented as a step toward the goal of scalable off-policy RL, with the current results demonstrating feasibility in goal-conditioned settings as a foundation for broader application.