The Allen Institute for AI published a post describing two benchmarks it developed to evaluate whether AI agents can conduct scientific investigations. The Ai2 post covers ScienceWorld, released in 2022, and DiscoveryWorld, released in 2024, and argues these evaluations have become more relevant as model capabilities have grown.
The gap between knowing and doing
The framing in the post distinguishes between “book smarts” — answering exam questions about science — and “street smarts” — using the scientific method to make new discoveries. The post describes the 2022 starting point: at ScienceWorld’s release, the best AI models scored highly on multiple-choice grade-school science exams but, when asked to perform experiments in a virtual environment, scored below 10 percent.
Peter Jansen, an Ai2 researcher who led development of both benchmarks, is quoted in the post: “So many folks are jumping on the science agent bandwagon and releasing agents. But if the best systems a year ago couldn’t even solve most of the easy problems in DiscoveryWorld, how likely is it that they’re much better today?”
ScienceWorld
ScienceWorld places agents inside a text-based simulated environment spanning ten interconnected locations — a kitchen, a workshop, a greenhouse, and others — with around 200 types of objects that behave as they would in a physical setting. Ice melts when heated; circuits conduct based on the materials used; plants grow under the right conditions.
Rather than selecting boiling points from a multiple-choice list, an agent given an unknown substance, a thermometer, and a stove must determine the boiling point through experimentation. Agents issue text commands and receive descriptions of what happens, working through 30 task types across categories including changing states of matter, mixing chemicals, and running Mendelian genetics crosses. Each of the 30 tasks has hundreds of randomized configurations, preventing agents from succeeding by memorizing solutions.
When ScienceWorld launched, models that scored an “A” on the ARC science exam — covering the same conceptual material — failed more than 90 percent of ScienceWorld tasks. The post reports that a 2025 benchmark suite from Microsoft Research called TALES, which includes ScienceWorld, found leading models now score in the low 80s — a large improvement from sub-10 percent, but still short of fully solving the tasks.
DiscoveryWorld
DiscoveryWorld is the more demanding benchmark. It tests whether an agent can design and execute end-to-end scientific investigations from scratch, not just perform predetermined steps.
The benchmark takes place in a fictional setting on Planet X, a hypothetical space colony, where the agent plays the role of a scientist. It contains 120 challenge tasks spanning eight topics — proteomics, rocket science, radioisotope dating, and epidemiology, among others — across three difficulty levels, with parametric variations that change the data, solution, and environment layout each run. The fictional framing is deliberate: by placing tasks in invented scientific contexts, the benchmark prevents agents from drawing on prior training knowledge of established facts.
Tasks require forming hypotheses, designing experiments, running them, and analyzing results, often over hundreds of in-game actions. DiscoveryWorld scores not just whether the agent solved the task but whether it followed a scientific process and whether it actually understood the discovery it made, distinguishing genuine insight from lucky guessing.
The post reports that some of the best current systems complete only approximately 20 percent of DiscoveryWorld tasks at higher difficulty levels — problems that average human scientists with advanced degrees solve approximately 70 percent of the time.
Jansen is quoted on the benchmark’s trajectory: “We tend to release benchmarks that start out being very challenging, but they become much more popular a year or two later as models and methods catch up. ScienceWorld was very much like that, and DiscoveryWorld seems like it’s getting like that now.”
Context and availability
The post states the DiscoveryWorld paper has been cited nearly 80 times and covered by New Scientist. Both benchmarks are described as open and freely available.
Jansen is quoted on the purpose behind the evaluations: “We hope that in the near future, science agents will help treat diseases, create new materials, and generate other important discoveries. DiscoveryWorld and ScienceWorld help measure whether agents can begin that process by testing their end-to-end scientific capabilities in simplified virtual worlds. If an agent flunks basic science, what hope does it have of curing cancer?”
Ai2 describes both benchmarks as developed alongside systems that push the boundaries of capability, treating measurement and progress as “two sides of the same effort.”