Adept introduced Fuyu-Heavy on January 24, 2024, describing it as a multimodal model designed specifically for digital agents. According to the company’s announcement post, Fuyu-Heavy is “the world’s third-most-capable multimodal model, behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger.”
The company says the model will shortly be powering its enterprise product.
Benchmark results
The post includes benchmark tables comparing Fuyu-Heavy to Gemini Pro, Gemini Ultra, GPT-4 Turbo, Mistral Medium, Claude 2.0, Inflection-2, and Grok-1 across a range of evaluations.
On standard text-only benchmarks, Fuyu-Heavy scores 72.1 on MMLU, 82.9 on GSM8K, 29.5 on MATH, and 58.0 on HumanEval. The post reports Gemini Pro at 71.8, 86.5, 32.6, and 67.7 on the same tasks, noting the Gemini GSM8K result used a majority vote across 32 samples (Maj1@32). The post says Fuyu-Heavy “performs roughly on par with Gemini Pro on standard text-only evaluations, outperforming it on the commonly used MMLU benchmark.” Inflection-2 scores higher on several text evaluations, which the post attributes to it being “a much larger model.”
On MMMU — a multimodal benchmark — Fuyu-Heavy scores 48.3 against Gemini Pro’s 47.9 and Gemini Ultra’s 59.4. The post says Fuyu-Heavy also outperforms Gemini Pro on VQAv2 (76.2 vs 71.2), AI2D (81.2 vs 73.9), and ChartQA (75.4 vs 74.1). On AI2D specifically, the post says Fuyu-Heavy outperforms Gemini Ultra (79.5).
For chat evaluation, the post describes a variant called Fuyu-Heavy Chat, produced by running the base model through supervised fine-tuning followed by direct preference optimization on publicly available chat data. On MT-Bench, Fuyu-Heavy Chat scores 8.01 versus Claude 2.0’s 8.06 and Mistral Medium’s 8.61. On AlpacaEval 1.0, it scores 92.20% versus Claude 2.0’s 91.60% and GPT-4 Turbo’s 97.70%.
The post notes that VQAv2 is “quite flawed,” linking to an external post for context, but includes the numbers for completeness.
Training context
The post explains that scaling the Fuyu architecture — described in an earlier release of Fuyu-8B — required addressing several technical challenges. Image data stresses training infrastructure: the post says memory usage increases, cloud storage ingress and egress become limiting factors, and handling image formats consistently between training and inference is difficult. Image models are also “famously unstable,” the post says, citing an external paper, which led the team to make “substantial” tweaks to the architecture and training procedure. The post also says high-quality image pre-training data is scarce and required significant collection and curation work, along with “recipes for striking this balance at scale” between text and image tasks.
Adept says it spent roughly four months on these problems before training Fuyu-Heavy. The post credits the Microsoft Azure team with providing and servicing the training cluster, and NVIDIA with collaborations on model efficiency and networking.
The post states the team has “already applied lessons learned from Fuyu-Heavy to train its successor.”
Sample model outputs
The post includes several extended examples of Fuyu-Heavy responses to MMMU questions. In one, the model is given a table of foods and illness rates and asked which food is most likely the cause of a food poisoning outbreak. The model computes the percentage of people who ate each food and became ill and identifies potato salad at 70.4% as the highest. In another, it applies the capital asset pricing model to calculate a required rate of return, arriving at 16%. In a third, it performs a chi-square test of independence and concludes that ages and net worth are independent, matching answer B (1.76).
Next steps
The post lists three areas the company says it is building on: fundamental scaling research to improve base model capabilities, converting base models into agents through reward modeling, self-play, and inference-time search, and connecting agents to the world to build products. The post invites engineers to apply through the company’s careers page.