When Sakana AI launched Fugu Ultra on June 22, 2026, the headline was hard to miss: SWE-Bench Pro 73.7, TerminalBench 2.1 82.1, GPQA-Diamond 95.5, framed as standing with Anthropic's Fable 5 and Mythos Preview and ahead of GPT-5.5 and Claude Opus 4.8. For an orchestration model that trains no frontier model of its own, that is a bold claim.

It is also easy to misread. These are vendor-published numbers, and they describe a system that routes across a pool of models, not a single LLM. A high score can reflect great routing as much as raw model power, which is exactly what Fugu is built to do, but it means the numbers answer a slightly different question than a normal model card.

This guide explains each benchmark in plain terms, what the scores do and do not prove for an orchestration model, the caveats a careful reviewer would raise, and how to run an evaluation that actually reflects your workload before you trust the headline.

Read this first

Every Fugu Ultra figure in this article is vendor-reported by Sakana AI as of June 2026 and has not been independently reproduced. Treat the numbers as a reason to test, not as settled fact.

What This Guide Covers

The Headline Numbers
SWE-Bench Pro: Agentic Coding
TerminalBench 2.1: Command-Line Tasks
GPQA-Diamond: Hard Science Reasoning
Why a System Score Is Not a Model Score
The Caveats a Reviewer Would Raise
How to Run Your Own Evaluation
Why Lushbinary for Model Evaluation

1The Headline Numbers

Here are the three figures Sakana put front and center for Fugu Ultra, the flagship tier with model id fugu-ultra-20260615.

Benchmark	Fugu Ultra (reported)	Domain
SWE-Bench Pro	73.7	Software engineering
TerminalBench 2.1	82.1	Agentic terminal tasks
GPQA-Diamond	95.5	Graduate-level science

Sakana's framing is that Fugu Ultra reaches this level by orchestrating a pool of strong models rather than by training one giant model. That is the interesting part, and also the reason the numbers deserve a careful read rather than a screenshot.

2SWE-Bench Pro: Agentic Coding

SWE-Bench Pro is a harder version of SWE-Bench. It asks a model to resolve real software issues drawn from production repositories: read the codebase, make a multi-file change, and produce a patch that passes the project's tests. It is one of the better proxies for agentic coding because the tasks cannot be solved by recall alone.

A 73.7 here is a strong result if it holds, and it is also a place where orchestration could genuinely help: a plan-execute-verify loop across specialists tends to beat a single pass on multi-step code changes. That is the optimistic reading. The cautious reading is that coding benchmarks are sensitive to harness setup and retries, so the number is most meaningful when you can reproduce it under conditions like your own.

3TerminalBench 2.1: Command-Line Tasks

TerminalBench evaluates how well a model operates in a real terminal: running commands, reading output, and completing multi-step tasks in a shell environment. It is a good signal for agentic competence, since the model has to act, observe, and adjust rather than emit one answer.

Fugu Ultra's reported 82.1 is the kind of number that, if reproducible, points to solid tool-use and recovery behavior. For buyers, this is the benchmark closest to what an autonomous coding or ops agent actually does, so it is worth weighting heavily in your own tests.

4GPQA-Diamond: Hard Science Reasoning

GPQA-Diamond is a set of graduate-level, Google-proof science questions designed so that non-experts cannot answer them even with web access. High scores indicate strong multi-step reasoning in physics, chemistry, and biology.

Why very high GPQA scores need a second look

As models approach the mid-90s on GPQA-Diamond, the remaining gap is small and sensitive to prompting, sampling, and possible contamination. A 95.5 is impressive, but at that altitude the difference between models is often within noise. Treat it as a strong signal, not a precise ranking.

5Why a System Score Is Not a Model Score

This is the crux. Fugu Ultra is an orchestration model: it routes, delegates, verifies, and synthesizes across a pool of models, as we explain in the Sakana Fugu orchestration model guide. Its benchmark score is a property of the whole system, not of one model's weights.

That changes how you should interpret a win. It does not mean Sakana built a model smarter than Fable 5. It means the combination of routing plus the underlying pool produced that result on that test. For many buyers that distinction does not matter, you care about the answer you get, not how it was produced. But it has two practical consequences:

The result depends on the pool. If the underlying models in the pool change, the system score can move without any change you control.
Cost and latency are part of the score's context. Reaching a number by fanning out across several models and verifying can use more tokens and time than a single-model result at a similar score.

6The Caveats a Reviewer Would Raise

Vendor-published. No independent lab had reproduced these at launch. That is normal for a launch, but it caps how much weight they deserve.
Comparisons to restricted models. Fable 5 and Mythos were inaccessible to much of the world at launch, so head-to-head re-testing is hard for outside parties.
Harness sensitivity. Coding and terminal benchmarks shift with retries, time limits, and scaffolding. The same model can post different numbers under different harnesses.
Cost not in the headline. A score says nothing about the tokens spent to reach it. For an orchestration model, that gap is larger than usual.

7How to Run Your Own Evaluation

The fix for vendor-published uncertainty is your own eval. A workmanlike approach:

Assemble 30 to 100 tasks from your real backlog, with clear pass/fail criteria, instead of leaning on public sets.
Run Fugu Ultra and your current model on the same tasks, same harness, same retry budget.
Record quality, latency, and tokens per task. For Fugu Ultra, log the full token use so internal fan-out shows up in the cost.
Score blind where you can, and have a human review a sample so you are not trusting an automated judge alone.

Our eval-driven development guide walks through building this harness. The payoff is that you replace a launch headline with a number that reflects your workload, which is the only number that should drive a production decision.

8Why Lushbinary for Model Evaluation

Benchmarks sell models. Evals ship products. The teams that avoid expensive model mistakes are the ones that test candidates on their own tasks before committing, and that measure cost and latency alongside quality.

Lushbinary builds evaluation harnesses and model-selection processes that cut through launch hype. We will assemble an eval set from your real work, run Fugu Ultra against your incumbents, and give you a ranking grounded in your workload rather than a vendor's slide.

🚀 Free Consultation

Want to know whether Fugu Ultra is actually better for your tasks, not just on a benchmark? Lushbinary will design and run the evaluation and give you the numbers with no obligation.

❓ Frequently Asked Questions

What are Fugu Ultra's benchmark scores?

As reported by Sakana AI at the June 2026 launch, Fugu Ultra scores 73.7 on SWE-Bench Pro, 82.1 on TerminalBench 2.1, and 95.5 on GPQA-Diamond, positioned with Fable 5 and Mythos Preview and ahead of GPT-5.5 and Claude Opus 4.8 on these tests.

Are Fugu Ultra's benchmarks independently verified?

Not yet. The scores are vendor-published as of launch. They are evidence to validate, not independent proof. Run your own evaluation on tasks that match your workload before relying on them.

Why is a Fugu Ultra benchmark different from a single-model benchmark?

Fugu Ultra is an orchestration model that routes across a pool, so its score reflects the whole system: selection, delegation, verification, and synthesis. A high number can come from excellent routing as much as any one model's raw ability.

What is SWE-Bench Pro?

A harder variant of SWE-Bench that tests whether a model can resolve real software issues from production repositories. It is a strong signal for agentic coding because tasks require multi-step changes that must pass tests.

Should I trust the GPQA-Diamond score of 95.5?

Treat it as a strong upper-bound signal, not a precise ranking. At the mid-90s, scores are sensitive to setup and possible contamination, so the honest read is that Fugu Ultra appears strong on hard reasoning, pending independent reproduction.

Sources

Content was rephrased for compliance with licensing restrictions. Benchmark figures are vendor-reported by Sakana AI as of June 2026 and have not been independently reproduced. Scores and comparisons may change. Always verify on Sakana's website and run your own evaluation.

Evaluate Fugu Ultra on Your Real Tasks

We will build the eval set, run Fugu Ultra against your incumbents, and hand you quality, latency, and cost numbers you can trust.

Ready to Build Something Great?

Q: What is SWE-Bench Pro?

SWE-Bench Pro is a harder variant of SWE-Bench that tests whether a model can resolve real software engineering issues from production repositories. It is a strong signal for agentic coding ability because the tasks require multi-step changes that must actually pass tests.

Q: Should I trust the GPQA-Diamond score of 95.5?

Treat it as a promising upper-bound signal, not a guarantee. GPQA-Diamond measures graduate-level science reasoning, and very high scores can be sensitive to test setup and contamination. The honest read is that Fugu Ultra appears strong on hard reasoning, pending independent reproduction.

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

connect@lushbinary.com

Fugu Ultra Benchmarks Explained: SWE-Bench, GPQA & More

What This Guide Covers

1The Headline Numbers

2SWE-Bench Pro: Agentic Coding

3TerminalBench 2.1: Command-Line Tasks

4GPQA-Diamond: Hard Science Reasoning

5Why a System Score Is Not a Model Score

6The Caveats a Reviewer Would Raise

7How to Run Your Own Evaluation

8Why Lushbinary for Model Evaluation

❓ Frequently Asked Questions

What are Fugu Ultra's benchmark scores?

Are Fugu Ultra's benchmarks independently verified?

Why is a Fugu Ultra benchmark different from a single-model benchmark?

What is SWE-Bench Pro?

Should I trust the GPQA-Diamond score of 95.5?

Sources

Evaluate Fugu Ultra on Your Real Tasks

Ready to Build Something Great?

Contact Us

Read Benchmarks Critically

One Subscription. Every Flagship AI Model.

More from the Blog

Claude Tag: Anthropic's Always-On AI Teammate in Slack

Seedance 2.5: ByteDance's 30-Second AI Video Model Guide

ContactUs

Our Address

Phone

Email