frontier

31 stories

Zvi Mowshowitz reviews Claude Opus 4.7 benchmark results and user reactions

In the second of three posts on Opus 4.7, Mowshowitz reviews official benchmark scores, community reactions, and usage recommendations, characterising the model as a substantial improvement over Opus 4.6 for coding while noting issues with adaptive thinking and instruction following.

Apr 27, 2026

1 source · independent

frontier Official

Stanford CRFM integrates Rasch-model adaptive testing into HELM to cut evaluation costs

Stanford CRFM's IRT-based adaptive testing dynamically selects the most informative questions per model, achieving AUC-ROC of 0.85 on training sets across 22 HELM datasets while matching average-score rankings.

frontier

Zvi Mowshowitz reviews Claude Opus 4.7 benchmark results and user reactions

Stanford CRFM integrates Rasch-model adaptive testing into HELM to cut evaluation costs

Stanford CRFM releases HELM Long Context leaderboard, with GPT-4.1 leading at mean score 0.588

Stanford CRFM launches HELM Arabic, a leaderboard for evaluating LLMs on Arabic benchmarks

ChatGPT Images added a road sign reading 'WHY ARE YOU LIKE THIS' without being asked

Databricks' own OfficeQA benchmark shows GPT-5.5 scoring 52.63% on full-agent enterprise eval, up from GPT-5.4's 36.10%

Databricks adds native GPT-5.5 support with Unity AI Gateway governance

Adept releases Fuyu-Heavy, a multimodal model the company says ranks third behind GPT-4V and Gemini Ultra

Google will invest as much as $40 billion in Anthropic at a $350 billion valuation

OpenAI releases GPT-5.5 system card covering predeployment safety evaluations

Top logits can leak task-irrelevant image information as readily as full residual stream projections

The Verge on GPT-5.5: Musk trial timing, Anthropic rivalry, and token efficiency

OpenAI releases GPT-5.5 with faster agentic coding and fewer tokens per task

Apple research generates realistic long-term motion with a 64x temporally compressed embedding

Google commits up to $40B to Anthropic, adds 5 gigawatts of cloud capacity over five years

GPT-5.5 launches alongside Codex's expansion into a broader agent workspace

Simula treats synthetic data generation as mechanism design, not sample-by-sample prompting

Three LLM agents wrote 600,000 lines of code and ran 850 experiments to win a Kaggle competition

Apple at ICLR 2026: RNN parallelization, tool-augmented SSMs, unified image models, and more

ReasoningBank gives agents a memory that learns from failures, not just successes

Google DeepMind partners with five global consultancies to deploy frontier AI at enterprise scale

Microsoft Research releases AutoAdapt, an open-source framework for LLM domain adaptation

DeepSeek V4 cuts inference costs sharply and validates on Huawei's Ascend accelerators

ParaRNN: Apple researchers train a 7B-parameter nonlinear RNN, competitive with transformers

Apple researchers build a context understanding benchmark and find quantized models degrade unevenly

Google Photos can now reframe your shots from a new camera angle after the fact

Google DeepMind releases Gemini 3.1 Flash TTS with audio tags, 70+ languages, and SynthID watermarking

Google's Vantage uses AI avatars to assess skills like critical thinking in adaptive conversations

Google DeepMind's Decoupled DiLoCo trains LLMs across data centers on standard internet bandwidth

ADeLe scores models and tasks on the same 18-ability scale to predict performance before deployment

Thinking Machines Lab is raiding Meta as fast as Meta is raiding it