ADeLe scores models and tasks on the same 18-ability scale to predict performance before deployment

Microsoft Research, in collaboration with Princeton University and Universitat Politècnica de València, has published ADeLe — AI Evaluation with Demand Levels — a method that scores both tasks and language models across 18 core abilities, then uses those shared scores to predict how a model will perform on tasks it has never seen. The work appears in Nature under the title “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” and was supported by Microsoft’s Accelerating Foundation Models Research grant program.

The central problem ADeLe addresses is that current AI benchmarks report aggregate scores without explaining why a model succeeds or fails, and they do not reliably forecast behavior on new tasks. ADeLe’s approach is to represent benchmarks and models in the same vocabulary — a vector of 18 ability scores covering attention, reasoning, domain knowledge, and similar dimensions — so that task demands and model capabilities can be directly compared.

How the scoring works

Each task receives a score from 0 to 5 on each of the 18 abilities based on how much of that ability it requires. A basic arithmetic problem scores low on quantitative reasoning; an Olympiad-level proof scores near the top. Evaluating a model across many such tasks produces what the team calls an ability profile: a structured map of where the model reliably succeeds and where it breaks down.

The team applied this framework to 15 LLMs. For each model and each ability, they measured how performance changes with increasing task difficulty, then used the difficulty level at which the model achieves a 50% success rate as its ability score. The results are visualized as radial plots — one per model — that make cross-model comparisons direct rather than requiring navigation across different benchmark reports.

One finding is that newer models generally outperform older ones, but not uniformly across all 18 abilities. Knowledge-heavy task performance correlates strongly with model size and training data composition, while models explicitly optimized for reasoning show measurable gains specifically on tasks requiring logic, abstraction, and social inference — gains that would be difficult to isolate using conventional benchmark suites.

Prediction accuracy and the reasoning debate

ADeLe’s most concrete claim is predictive: by comparing a model’s ability profile against a new task’s demand profile, the framework can forecast success or failure on that task before any inference is run. In the team’s experiments, this achieved approximately 88% accuracy for models including GPT-4o and LLaMA-3.1-405B, which the paper describes as outperforming traditional methods.

The framework also offers a way into one of the field’s persistent disputes: whether large language models actually reason or are pattern-matching at scale. According to the Microsoft Research blog post, ADeLe shows the question is poorly framed. Benchmarks labeled as measuring “reasoning” vary substantially in what they actually require — from basic problem-solving to tasks that combine advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones. The difference reflects task requirements, not a shift in model capability between tests.

The post specifically notes that reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models — not only on logic and mathematics but also on interpreting user intent — but that performance declines as task demands increase. The conclusion the team draws is that AI systems can reason, but only up to a point, and ADeLe identifies where that threshold sits for each individual model.

Benchmark diagnosis and design

A secondary use of the framework is diagnostic: ADeLe can assess whether an existing benchmark actually measures what it claims to measure. According to the post, many widely used benchmarks conflate abilities. A test designed to evaluate logical reasoning may in practice depend heavily on specialized knowledge or metacognition. Others cover only a narrow range of difficulty, missing both the simple cases and the hardest ones.

By scoring tasks across all 18 abilities, ADeLe makes these mismatches visible. A benchmark that presents itself as a reasoning test but consistently requires domain knowledge at high difficulty levels is measuring something different from what its label implies. The team describes this as a tool for diagnosing existing benchmarks and informing the design of future ones.

The framework is designed to extend beyond text-only LLMs. The team describes plans to apply it to multimodal and embodied AI systems, and positions ADeLe as a potential standardized framework for AI research, policymaking, and security auditing. Benchmark annotations, experiment code, and additional resources are available on GitHub.

The value of the approach is straightforward: if you can score a new task’s ability demands before deploying a model on it, you can predict which models will handle it and which will fail, and you can explain the prediction in terms of specific capability gaps rather than aggregate scores. That is a more actionable output than a leaderboard number.