Microsoft Research has released AsgardBench, a benchmark designed to isolate whether AI agents use visual observations to revise their plans as tasks unfold. According to the Microsoft Research blog post, the benchmark spans 108 controlled task instances across 12 task types and is available as open source on GitHub.

The paper is titled “AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback.”

The problem AsgardBench targets

The post describes a gap in existing embodied AI evaluation: many benchmarks test perception, navigation, and physical control simultaneously, making it difficult to determine whether an agent is actually using what it perceives or exploiting the predictability of its environment. AsgardBench is designed to separate the planning-from-observation problem from navigation and object manipulation.

The benchmark is built on AI2-THOR, a 3D household simulation environment. Agents are positioned near objects so navigation and viewpoint selection are not factors. A find action brings objects into view; the environment handles container sizing and placement, so the agent does not need to reason about physical locations within cabinets or countertops. The only inputs available to the agent are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of its planned next steps.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step executes. The agent then receives new images and a binary signal indicating whether the action succeeded or failed. The post says this “prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step.” Built-in limits on total steps and repeated actions prevent looping.

Because objects can be in different states — a mug may be clean, dirty, or filled with coffee; a sink may contain other items — the same instruction can require different action sequences, even within the same environment. The post uses a kitchen-cleaning scenario as a running illustration: the agent must notice whether the mug is already clean or whether the sink is occupied before deciding on its next action.

Results across models

The post reports that the research team tested several vision-capable models and found that visual input substantially improved performance. According to the post, most models more than doubled their success rates when given images compared to text-only descriptions of the scene. The post contrasts this with prior benchmarks “where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.”

Providing detailed failure information, which the post calls a form of additional feedback, raises performance for all models. However, the post says “the strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate.”

The post does not report numerical scores by model name in the blog entry.

Failure patterns

Testing across models surfaced consistent failure modes. The post lists four: agents attempted actions that were impossible given the current state of the environment (such as trying to clean a mug not in the sink); agents got stuck in repeated action loops; agents misinterpreted subtle visual cues such as on/off states or clean/dirty distinctions; and agents lost track of where they were in the task from one step to the next.

The post summarizes these as three underlying weaknesses: inability to distinguish subtle visual details in cluttered scenes, inability to maintain an accurate picture of task progress across multiple steps, and inability to consistently translate visual observations into timely updates to the plan.

Research uses and next steps

The post describes AsgardBench as both a diagnostic and development tool. By varying the feedback agents receive — none, minimal, or detailed — researchers can isolate whether performance gains come from better perception, better memory, or better planning.

Directions the post identifies as promising include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure how well an agent adapted rather than only whether it succeeded.

AsgardBench is open source and available on GitHub. The post acknowledges the AI2-THOR community for the simulation platform.