ParaRNN: Apple researchers train a 7B-parameter nonlinear RNN, competitive with transformers

Recurrent neural networks have a well-known efficiency advantage at inference: unlike transformers, where attention cost scales quadratically with sequence length, a single RNN forward pass costs the same regardless of context length. That advantage has been difficult to exploit at scale because RNN training cannot be parallelized along the sequence dimension — each step depends on the previous hidden state. Apple ML Research’s ParaRNN work, accepted as an Oral at ICLR 2026, addresses that bottleneck.

The framework achieves a 665x speedup over sequential RNN training and, for the first time, enables training classical nonlinear RNNs at 7 billion parameters, where they reach language modeling performance competitive with transformers.

Why the training bottleneck existed

Modern state space models like Mamba resolved the RNN training problem by making the recurrence linear in the hidden state. Linear recurrences are associative, which enables parallel scan algorithms: the same mathematical property that allows a cumulative sum to be computed in a tree-parallel structure rather than sequentially. This transforms an O(n) sequential computation into O(log n) parallel steps.

The cost of linearity is expressivity. A linear hidden state evolution covers a narrower range of dynamics than a nonlinear one, which constrains the model’s ability to track state and retrieve information on tasks that require it.

Classical RNNs — including GRU and LSTM — include nonlinearities in the recurrence. Those nonlinearities break the associativity needed for parallel scan. The question ParaRNN addresses is whether there is a way to train nonlinear RNNs in parallel without discarding the nonlinearity.

Newton’s method as the key

The approach reframes RNN training not as a sequential chain of steps but as a single system of equations, with hidden states across all time steps as simultaneous unknowns. According to Apple ML Research’s post, Newton’s method “reframe[s] the entire sequence as a single system of equations, where the hidden states across all steps are unknowns to solve for simultaneously.” Newton’s method solves this system iteratively, replacing the nonlinear equations with linear approximations using their Jacobians at each iteration.

The linearized system has the same form as a linear state space model, with Jacobians playing the role of state matrices. This means each Newton iteration can be solved using parallel scan. The full nonlinear RNN behavior is recovered by iterating: each iteration refines the approximation, converging to the true nonlinear solution.

The researchers apply this to GRU and LSTM cells and observe convergence in three iterations. Three parallel computations recover the same hidden state evolution as sequential nonlinear RNN computation.

Engineering for scale

Newton iterations introduce Jacobian matrices into the parallel reduction. For generic RNNs, these Jacobians are dense — storage becomes quadratic and multiplication cubic in hidden state size, making this intractable for large models.

The solution is to constrain the cells so their Jacobians have structured sparsity. The ParaGRU cell produces diagonal Jacobians; ParaLSTM produces block-diagonal Jacobians. Custom CUDA kernels implement the parallel reduction of these structured Jacobians, fusing Newton iterations, system assembly, and parallel reduction into a single kernel.

Three performance tiers are offered: pure PyTorch for prototyping, CUDA-accelerated reduction for generic cells with diagonal or block-diagonal Jacobians, and a fully-fused single-kernel implementation for production use.

Results at 7B parameters

The researchers trained models from 400M to 7B parameters on language modeling tasks. Both ParaGRU and ParaLSTM at 7B parameters reach perplexity and downstream task scores comparable to transformers and state-of-the-art state space models, according to Apple ML Research.

The inference advantage holds: constant-time token generation means throughput does not degrade with context length. On state-tracking tasks, nonlinear RNNs showed measurable benefits — the paper’s Table 2 shows ParaGRU and ParaLSTM achieving 100 percent accuracy on k-hop and Parity benchmarks, compared to Mamba2’s 98 percent and 51 percent respectively.

Open-source framework

The ParaRNN codebase has been released as an open-source framework. Defining a custom cell requires implementing a single recurrence step method inheriting from a base class; the framework applies Newton’s method, assembles the Jacobian system, and runs the parallel reduction. The paper’s first author is delivering an Expo Talk at ICLR 2026 alongside the oral presentation.