COSPLAY co-evolves an LLM decision agent and a skill bank to improve long-horizon game performance

A paper posted to arXiv presents COSPLAY, a framework designed to address a persistent weakness in LLM-based agents: the inability to consistently discover, retain, and reuse structured skills across episodes in long-horizon tasks. The authors are Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha.

The approach is co-evolutionary: two components improve together rather than one being fixed while the other adapts. A decision agent retrieves skills from a skill bank to guide its actions. Separately, a skill pipeline — managed by its own agent — discovers reusable skills from the decision agent’s unlabeled rollouts and continuously extracts, refines, and updates them.

This article relies on a single tier-4 source (the arXiv preprint). The paper has not yet undergone peer review.

How the skill bank works

The skill bank is not pre-specified. Skills are discovered from actual rollout data — sequences of actions the decision agent has already taken — rather than defined by humans in advance. The skill pipeline agent processes these rollouts to identify patterns worth retaining, formalizes them with what the paper calls “contracts,” and updates the bank as new rollouts arrive.

The decision agent, in turn, retrieves from this bank when choosing actions. The retrieval step means the agent is not relying solely on in-context reasoning to decide what to do next; it can draw on structured prior experience in the form of retrieved skills. Both components improve together: better decisions produce better rollouts, which allow the skill pipeline to extract better skills, which improve future decisions.

Results across six game environments

The paper reports experiments across six game environments, covering both single-player and multi-player settings. Using an 8B parameter base model, COSPLAY achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single-player game benchmarks. On multi-player social reasoning games, the paper reports that COSPLAY remains competitive.

The distinction between single-player and multi-player results is meaningful. Single-player games allow the framework to optimize against a fixed environment, where skill reuse is straightforwardly beneficial — a skill that worked before is likely to work again in similar situations. Multi-player social reasoning games introduce other agents whose behavior changes, which limits how much pre-discovered skills can generalize. The paper does not claim strong performance on multi-player settings, only that the framework remains competitive.

The long-horizon problem

Large language models are capable game-playing agents in many settings, but their performance degrades over extended interaction sequences where skills need to chain across many timesteps under delayed rewards and partial observability. Standard prompting approaches give LLMs access to a context window but no mechanism to accumulate structured experience across episodes.

COSPLAY addresses this through the skill bank rather than through extended context. Skills extracted from prior rollouts persist across episodes and are retrievable in structured form, rather than relying on the model to re-derive useful strategies from raw context each time.

The 25.1 percent improvement figure is against frontier LLM baselines. Whether the approach scales beyond game environments, or how the skill extraction pipeline performs when rollout data is noisy or sparse, is not addressed in the abstract.