QIMMA validates Arabic benchmarks before running models on them — and finds systematic problems in established datasets

A team from the Technology Innovation Institute UAE has released QIMMA — Arabic for “summit” — a leaderboard that validates benchmark quality before running any model evaluations. The motivation is specific: the Arabic NLP evaluation landscape has been expanding rapidly in the number of benchmarks and leaderboards, but the quality of those benchmarks has not kept pace. QIMMA’s approach is to apply a systematic validation pipeline to every sample in every benchmark first, then run models on what survives.

The result is a unified suite of 109 subsets drawn from 14 source benchmarks, totaling over 52,000 samples across seven domains, with 99% native Arabic content. The only exception is code evaluation, which is inherently language-agnostic. QIMMA is described as the only platform combining all five properties the team considers essential: open source, predominantly native Arabic content, systematic quality validation, code evaluation, and public per-sample inference outputs.

Why existing Arabic benchmarks have problems

The post identifies four structural weaknesses in the current Arabic evaluation landscape. Many Arabic benchmarks are translations from English, which introduces distributional shifts — questions that feel natural in English become awkward or culturally misaligned in Arabic. Even native Arabic benchmarks are often released without rigorous quality checks, and annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have been documented in established resources. Evaluation scripts and per-sample outputs are rarely released publicly, making auditing and replication difficult. And existing leaderboards cover isolated tasks and narrow domains, making it hard to assess models holistically.

The problem QIMMA found by actually running validation is that these are not isolated edge cases — they are systematic patterns reflecting gaps in how benchmarks were originally constructed.

The two-stage validation pipeline

Every sample in the suite goes through a multi-stage quality pipeline before any model sees it. Stage 1 uses two LLMs — Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B — chosen for their strong Arabic capability and different training data compositions. Each model independently scores a sample against a 10-point binary rubric. A sample is eliminated if either model scores it below 7/10 and both agree on elimination. Where only one model flags a sample, it proceeds to human review in Stage 2.

Stage 2 routes flagged samples to native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on cultural context and regional variation, dialectal nuance, subjective interpretation, and subtle quality issues that automated assessment may miss. For culturally sensitive content, multiple perspectives are considered, since correctness can genuinely vary across Arab regions.

The taxonomy of issues found spans four categories: answer quality problems including false or mismatched gold indices and factually wrong answers; text and formatting quality issues including corrupt text, spelling errors, and duplicate samples; cultural sensitivity failures including stereotype reinforcement and monolithic generalizations about diverse communities; and gold answer compliance issues where answers are misaligned with evaluation protocols.

Code benchmarks required different handling. Rather than discarding samples, the team refined Arabic problem statements in 3LM’s Arabic adaptations of HumanEval+ and MBPP+, leaving task identifiers, reference solutions, and test suites completely unchanged. Modifications fell into five categories: linguistic refinement toward natural Modern Standard Arabic, clarity improvements for ambiguous instructions, consistency normalization of mathematical terminology and punctuation, structural corrections for broken code formatting, and semantic refinements clarifying edge cases like inclusive versus exclusive ranges.

Leaderboard results

Results as of April 2026 cover the top 10 of 46 evaluated models. The post reports that scale does not guarantee best performance — the top 10 spans models from 32B to 397B parameters, with several mid-size models outperforming larger ones on specific domains.

Arabic-specialized models lead on cultural and linguistic tasks: Jais-2-70B-Chat ranks highest on ArabicMMLU and ArabCulture, while Karnak leads on 3LM STEM and ArabLegalQA. Coding is the hardest domain for Arabic-specialized models — the top HumanEval+ and MBPP+ scores belong to multilingual models, with Qwen3.5-397B leading both. Across the full 46-model set, a clear but imperfect size-performance correlation emerges, with notable exceptions: Arabic-specialized models often outperform size-matched multilingual models, instruction-tuned models consistently outperform their base counterparts (except Qwen3), and some smaller Arabic-specialized models such as Fanar-1-9B and ALLaM-7B outperform much larger multilingual models on specific domains.

Evaluation uses LightEval, EvalPlus, and FannOrFlop, with six standardized prompt templates by question format. All prompts are in Arabic. The leaderboard, code, and the paper “Are Arabic Benchmarks Reliable? QIMMA’s Quality-First Approach to LLM Evaluation” are publicly available. The approach — validate the measuring instrument before reporting measurements — is more rigorous than the standard practice of running models on benchmarks as-published, and the systematic quality failures QIMMA found suggest the approach is worth the investment.