Logo
Back to Blog
AI & LLMsJune 22, 202611 min read

Fugu Ultra Benchmarks Explained: SWE-Bench, GPQA & More

Sakana Fugu Ultra reports SWE-Bench Pro 73.7, TerminalBench 2.1 82.1, and GPQA-Diamond 95.5, putting it shoulder to shoulder with Fable 5 and Mythos Preview. But these are vendor-published numbers for an orchestration model, not a single LLM, which makes them easy to misread. This guide explains each benchmark, what the scores do and do not prove, and how to run your own evaluation before trusting the headline.

Lushbinary Team

Lushbinary Team

AI & Cloud Solutions

Fugu Ultra Benchmarks Explained: SWE-Bench, GPQA & More

When Sakana AI launched Fugu Ultra on June 22, 2026, the headline was hard to miss: SWE-Bench Pro 73.7, TerminalBench 2.1 82.1, GPQA-Diamond 95.5, framed as standing with Anthropic's Fable 5 and Mythos Preview and ahead of GPT-5.5 and Claude Opus 4.8. For an orchestration model that trains no frontier model of its own, that is a bold claim.

It is also easy to misread. These are vendor-published numbers, and they describe a system that routes across a pool of models, not a single LLM. A high score can reflect great routing as much as raw model power, which is exactly what Fugu is built to do, but it means the numbers answer a slightly different question than a normal model card.

This guide explains each benchmark in plain terms, what the scores do and do not prove for an orchestration model, the caveats a careful reviewer would raise, and how to run an evaluation that actually reflects your workload before you trust the headline.

Read this first

Every Fugu Ultra figure in this article is vendor-reported by Sakana AI as of June 2026 and has not been independently reproduced. Treat the numbers as a reason to test, not as settled fact.

What This Guide Covers

  1. The Headline Numbers
  2. SWE-Bench Pro: Agentic Coding
  3. TerminalBench 2.1: Command-Line Tasks
  4. GPQA-Diamond: Hard Science Reasoning
  5. Why a System Score Is Not a Model Score
  6. The Caveats a Reviewer Would Raise
  7. How to Run Your Own Evaluation
  8. Why Lushbinary for Model Evaluation

1The Headline Numbers

Here are the three figures Sakana put front and center for Fugu Ultra, the flagship tier with model id fugu-ultra-20260615.

BenchmarkFugu Ultra (reported)Domain
SWE-Bench Pro73.7Software engineering
TerminalBench 2.182.1Agentic terminal tasks
GPQA-Diamond95.5Graduate-level science

Sakana's framing is that Fugu Ultra reaches this level by orchestrating a pool of strong models rather than by training one giant model. That is the interesting part, and also the reason the numbers deserve a careful read rather than a screenshot.

2SWE-Bench Pro: Agentic Coding

SWE-Bench Pro is a harder version of SWE-Bench. It asks a model to resolve real software issues drawn from production repositories: read the codebase, make a multi-file change, and produce a patch that passes the project's tests. It is one of the better proxies for agentic coding because the tasks cannot be solved by recall alone.

A 73.7 here is a strong result if it holds, and it is also a place where orchestration could genuinely help: a plan-execute-verify loop across specialists tends to beat a single pass on multi-step code changes. That is the optimistic reading. The cautious reading is that coding benchmarks are sensitive to harness setup and retries, so the number is most meaningful when you can reproduce it under conditions like your own.

3TerminalBench 2.1: Command-Line Tasks

TerminalBench evaluates how well a model operates in a real terminal: running commands, reading output, and completing multi-step tasks in a shell environment. It is a good signal for agentic competence, since the model has to act, observe, and adjust rather than emit one answer.

Fugu Ultra's reported 82.1 is the kind of number that, if reproducible, points to solid tool-use and recovery behavior. For buyers, this is the benchmark closest to what an autonomous coding or ops agent actually does, so it is worth weighting heavily in your own tests.

4GPQA-Diamond: Hard Science Reasoning

GPQA-Diamond is a set of graduate-level, Google-proof science questions designed so that non-experts cannot answer them even with web access. High scores indicate strong multi-step reasoning in physics, chemistry, and biology.

Why very high GPQA scores need a second look

As models approach the mid-90s on GPQA-Diamond, the remaining gap is small and sensitive to prompting, sampling, and possible contamination. A 95.5 is impressive, but at that altitude the difference between models is often within noise. Treat it as a strong signal, not a precise ranking.

5Why a System Score Is Not a Model Score

This is the crux. Fugu Ultra is an orchestration model: it routes, delegates, verifies, and synthesizes across a pool of models, as we explain in the Sakana Fugu orchestration model guide. Its benchmark score is a property of the whole system, not of one model's weights.

That changes how you should interpret a win. It does not mean Sakana built a model smarter than Fable 5. It means the combination of routing plus the underlying pool produced that result on that test. For many buyers that distinction does not matter, you care about the answer you get, not how it was produced. But it has two practical consequences:

  • The result depends on the pool. If the underlying models in the pool change, the system score can move without any change you control.
  • Cost and latency are part of the score's context. Reaching a number by fanning out across several models and verifying can use more tokens and time than a single-model result at a similar score.

6The Caveats a Reviewer Would Raise

  • Vendor-published. No independent lab had reproduced these at launch. That is normal for a launch, but it caps how much weight they deserve.
  • Comparisons to restricted models. Fable 5 and Mythos were inaccessible to much of the world at launch, so head-to-head re-testing is hard for outside parties.
  • Harness sensitivity. Coding and terminal benchmarks shift with retries, time limits, and scaffolding. The same model can post different numbers under different harnesses.
  • Cost not in the headline. A score says nothing about the tokens spent to reach it. For an orchestration model, that gap is larger than usual.

7How to Run Your Own Evaluation

The fix for vendor-published uncertainty is your own eval. A workmanlike approach:

  • Assemble 30 to 100 tasks from your real backlog, with clear pass/fail criteria, instead of leaning on public sets.
  • Run Fugu Ultra and your current model on the same tasks, same harness, same retry budget.
  • Record quality, latency, and tokens per task. For Fugu Ultra, log the full token use so internal fan-out shows up in the cost.
  • Score blind where you can, and have a human review a sample so you are not trusting an automated judge alone.

Our eval-driven development guide walks through building this harness. The payoff is that you replace a launch headline with a number that reflects your workload, which is the only number that should drive a production decision.

8Why Lushbinary for Model Evaluation

Benchmarks sell models. Evals ship products. The teams that avoid expensive model mistakes are the ones that test candidates on their own tasks before committing, and that measure cost and latency alongside quality.

Lushbinary builds evaluation harnesses and model-selection processes that cut through launch hype. We will assemble an eval set from your real work, run Fugu Ultra against your incumbents, and give you a ranking grounded in your workload rather than a vendor's slide.

🚀 Free Consultation

Want to know whether Fugu Ultra is actually better for your tasks, not just on a benchmark? Lushbinary will design and run the evaluation and give you the numbers with no obligation.

❓ Frequently Asked Questions

What are Fugu Ultra's benchmark scores?

As reported by Sakana AI at the June 2026 launch, Fugu Ultra scores 73.7 on SWE-Bench Pro, 82.1 on TerminalBench 2.1, and 95.5 on GPQA-Diamond, positioned with Fable 5 and Mythos Preview and ahead of GPT-5.5 and Claude Opus 4.8 on these tests.

Are Fugu Ultra's benchmarks independently verified?

Not yet. The scores are vendor-published as of launch. They are evidence to validate, not independent proof. Run your own evaluation on tasks that match your workload before relying on them.

Why is a Fugu Ultra benchmark different from a single-model benchmark?

Fugu Ultra is an orchestration model that routes across a pool, so its score reflects the whole system: selection, delegation, verification, and synthesis. A high number can come from excellent routing as much as any one model's raw ability.

What is SWE-Bench Pro?

A harder variant of SWE-Bench that tests whether a model can resolve real software issues from production repositories. It is a strong signal for agentic coding because tasks require multi-step changes that must pass tests.

Should I trust the GPQA-Diamond score of 95.5?

Treat it as a strong upper-bound signal, not a precise ranking. At the mid-90s, scores are sensitive to setup and possible contamination, so the honest read is that Fugu Ultra appears strong on hard reasoning, pending independent reproduction.

Sources

Content was rephrased for compliance with licensing restrictions. Benchmark figures are vendor-reported by Sakana AI as of June 2026 and have not been independently reproduced. Scores and comparisons may change. Always verify on Sakana's website and run your own evaluation.

Evaluate Fugu Ultra on Your Real Tasks

We will build the eval set, run Fugu Ultra against your incumbents, and hand you quality, latency, and cost numbers you can trust.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

Encrypted in transit · GDPR ready · We never share or sell your data

Subscribe · Newsletter

Read Benchmarks Critically

We cut through vendor benchmark claims and show you how to run evals that reflect your real workload.

  • New deep-dives on AI agents and cloud architecture
  • Engineering teardowns of shipped products
  • No spam, unsubscribe in one click

We respect your inbox. Read our privacy policy.

Exclusive Offer for Lushbinary Readers
WidelAI

One Subscription. Every Flagship AI Model.

Stop juggling multiple AI subscriptions. WidelAI gives you access to Claude, GPT, Gemini, and more - all under a single plan.

Claude Opus & SonnetGPT-5.5 & o3Gemini ProSingle DashboardAPI Access

Use code at checkout for 10% off your subscription:

Fugu UltraSakana FuguBenchmarksSWE-Bench ProTerminalBenchGPQA DiamondOrchestration ModelModel EvaluationSakana AIFrontier ModelsEval-Driven DevelopmentAI Coding

ContactUs