Agreement metrics mischaracterize AI moderation quality in rule-governed environments, new framework shows

A paper posted on arXiv identifies a systematic failure in how content moderation AI systems are evaluated and proposes a replacement framework. The core problem, which the authors call the Agreement Trap, is that standard evaluation methods measure whether an AI’s decision matches a human label — an approach that breaks down when moderating under explicit rules, where multiple different decisions can all be logically valid.

In rule-governed environments, disagreement between an AI decision and a historical human label does not necessarily mean the AI was wrong. A moderator might have declined to remove a post for reasons that had nothing to do with the governing rule. A different moderator might have removed it, also correctly. When an evaluation system penalizes the AI for the first decision and rewards it for the second, it is measuring conformity to a single historical outcome, not quality of reasoning under the governing policy.

What the framework measures instead

The authors formalize an alternative grounded in policy-correctness rather than label-agreement. The key metrics are the Defensibility Index (DI) — which measures whether a decision is logically derivable from the governing rule hierarchy — and the Ambiguity Index (AI), which captures how much of the apparent disagreement reflects genuine policy ambiguity rather than decision error.

To estimate reasoning stability without additional audit passes, the paper introduces the Probabilistic Defensibility Signal (PDS), derived from audit-model token log-probabilities. Rather than using an LLM to decide whether content violates policy, the framework deploys the audit model to verify whether a proposed decision is logically derivable from the governing rules. The distinction matters: the model is used as a governance verifier, not a classifier.

Validation at scale

The framework was validated on more than 193,000 Reddit moderation decisions across multiple communities and evaluation cohorts. The paper reports a 33–46.6 percentage-point gap between agreement-based metrics and policy-grounded metrics — evaluations that measure label agreement significantly understate how often the AI made valid decisions.

The mechanism for that gap is visible in the false negative data. According to the abstract, 79.8–80.6 percent of what agreement-based evaluation classified as the model’s false negatives were actually policy-grounded decisions — cases where the AI’s decision was defensible under the rules but differed from the historical human label.

Rule specificity also matters. The paper describes an experiment in which 37,286 identical decisions were audited under three tiers of the same community rules. Adding specificity to the rules reduced the Ambiguity Index by 10.8 percentage points while the Defensibility Index remained stable. This isolates a practical implication: vague rules produce measured ambiguity; the model’s reasoning quality is not the source of that ambiguity.

Automation and risk reduction

The paper describes a Governance Gate built on these signals that achieves 78.6 percent automation coverage with 64.9 percent risk reduction. The Governance Gate uses the defensibility and ambiguity signals to determine which decisions can be handled automatically and which require human review — routing high-ambiguity cases to humans while automating decisions that are clearly defensible under policy.

The core argument the paper makes is that evaluation methodology in rule-governed environments needs to change. Measuring agreement with historical labels is not the same as measuring correctness under explicit policy. The 33–46.6 percentage-point measurement gap, sustained across more than 193,000 decisions, is described by the authors as a consistent property of the evaluation approach rather than an artifact of specific data.