Logo
Back to Blog
AI & LLMsJune 27, 202611 min read

GPT-5.6 Sol Benchmarks Deep Dive: TerminalBench & Biology

GPT-5.6 Sol posts 88.8% on TerminalBench 2.1 (91.9% in Sol Ultra mode), edging Claude Mythos 5 at 88.0%, and jumps about 9 points on SecureBio biology evals. This deep dive explains what the numbers measure, why sub-point gaps sit within noise, the Sol Ultra compute tradeoff, methodology caveats, and why you should run your own evals.

Lushbinary Team

Lushbinary Team

AI & Cloud Solutions

GPT-5.6 Sol Benchmarks Deep Dive: TerminalBench & Biology

When OpenAI announced GPT-5.6 and its flagship Sol mode on June 26, 2026, the number that traveled fastest was 88.8% on TerminalBench 2.1. It landed just ahead of Claude Mythos 5 at 88.0% and well clear of the publicly available Claude Opus 4.8 at 78.9%. The launch also surfaced a quieter but arguably more consequential set of results: a jump in biology reasoning on the SecureBio evaluations. This piece reads the numbers carefully and separates signal from noise.

The goal here is not to crown a winner. It is to explain what each benchmark measures, which gaps are real, which sit inside measurement noise, and what the biology results imply for both capability and safety. We use only the figures OpenAI and reputable tech press published, and we flag every claim that is reported rather than independently verified.

One honesty note up front: GPT-5.6 shipped as a limited preview, so most teams cannot run their own tests against it yet. That access caveat shapes how much weight any of these numbers should carry in a real model-selection decision.

What This Deep Dive Covers

  1. What GPT-5.6 Was Tested On
  2. TerminalBench 2.1 Results In Full
  3. What TerminalBench 2.1 Measures, And Why 0.8 Points Is Noise
  4. The SecureBio Biology Results
  5. What The Biology Numbers Imply For Safety And Capability
  6. The Sol Ultra Mode Tradeoff
  7. Token Efficiency, As Reported
  8. Methodology Caveats And Running Your Own Evals
  9. How Sol Compares To The Public Field
  10. Why Lushbinary For Eval Harnesses And Model Selection

1What GPT-5.6 Was Tested On

OpenAI framed GPT-5.6 around two themes at launch: agentic engineering and scientific reasoning. The agentic story is anchored by TerminalBench 2.1, a benchmark that runs models through real terminal-driven engineering tasks. The scientific story is anchored by the SecureBio biology evaluations, a set of capability tests that double as a safety signal. The model ships in tiers, with Sol as the flagship, Terra and Luna below it, and Sol Ultra as a compute-intensive mode that pushes scores higher at higher cost.

GPT-5.6 follows GPT-5.5, which shipped April 23, 2026. Sol is priced at $5 input and $30 output per million tokens, Terra at $2.50 and $15, and Luna at $1 and $6. OpenAI also reported that the mid-tier Terra matches Claude Fable 5, both at 84.3% on TerminalBench 2.1, while edging out the prior-generation GPT-5.5 at 83.4%. For pricing strategy across the line, our GPT-5.6 Sol vs Mythos 5 vs Gemini comparison covers access and cost in more depth.

2TerminalBench 2.1 Results In Full

Here is the full field as published at the GPT-5.6 launch, ordered by score. The top of the table is tight, and the spread between the best GPT-5.6 mode and the public reference model is about 13 points.

ModelTerminalBench 2.1Notes
GPT-5.6 Sol Ultra91.9%Compute-intensive mode, top published figure
GPT-5.6 Sol88.8%Flagship default mode
Claude Mythos 588.0%0.8 pt behind Sol, within noise
GPT-5.6 Terra84.3%Mid-tier, tied with Claude Fable 5
Claude Fable 584.3%Tied with Terra at the mid tier
GPT-5.583.4%Prior generation flagship
GPT-5.6 Luna82.5%Budget tier in the GPT-5.6 line
Claude Opus 4.878.9%Generally available public reference
Gemini 3.1 Pro Preview70.7%Lowest of the published field

Reading The Table

The headline rivalry is Sol at 88.8% versus Mythos 5 at 88.0%. The genuinely large gaps are between the flagship modes and the budget and public reference rows: Luna at 82.5% and Opus 4.8 at 78.9%. If you only remember one thing, remember that the top two are a tie and the field below them is real.

OpenAI TerminalBench 2.1 results chart: GPT-5.6 Sol Ultra 91.9%, GPT-5.6 Sol 88.8%, Claude Mythos 5 88.0%, GPT-5.6 Terra and Claude Fable 5 tied at 84.3%, GPT-5.5 83.4%, GPT-5.6 Luna 82.5%, Claude Opus 4.8 78.9%, Gemini 3.1 Pro Preview 70.7%
TerminalBench 2.1 scores. Source: OpenAI, GPT-5.6 announcement.

3What TerminalBench 2.1 Measures, And Why 0.8 Points Is Noise

TerminalBench 2.1 evaluates agentic, terminal-driven engineering: running shell commands, editing files, executing multi-step tasks, and recovering from errors across a session. It is closer to how a coding agent actually works than a single-shot code completion test, which is why OpenAI led with it for Sol.

The catch is that agentic benchmarks are noisy. Scores move with random seeds, harness configuration, tool timeouts, retry policy, and which subset of tasks a run happens to sample. A 0.8 point gap between Sol and Mythos 5 sits comfortably inside the band where two runs of the same model can disagree. Treat it as a statistical tie, not a ranking. The honest read is that both models are at the current frontier for this kind of work.

Noise, In Plain Terms

A sub-1-point gap on an agentic benchmark is not a reliable signal of which model is better. The 3 point lift from Sol to Sol Ultra is more likely to be real, because it comes from a deliberate change in compute rather than run-to-run variance. Size your conclusions to the size of the gap.

Alongside the coding results, OpenAI published exploit-finding benchmarks, ExploitBench and ExploitGym, that probe how well a model discovers and exercises software vulnerabilities. Across this family, GPT-5.6 Sol leads the GPT-5.6 lineup and approaches Mythos-class results, with the smaller Terra and Luna modes trailing the flagship. As with the biology evaluations, these are tracked partly because the same skills carry dual-use risk, so we keep the read qualitative rather than quoting precise figures off the charts.

OpenAI ExploitBench chart plotting cap percent against output tokens, with GPT-5.6 Sol leading the GPT-5.6 family and approaching Claude Mythos 5, ahead of GPT-5.5, GPT-5.4, and Claude Opus 4.8
ExploitBench. Source: OpenAI, GPT-5.6 announcement.
OpenAI ExploitGym chart of intended exploits against output tokens under 2-hour and 6-hour time limits, with GPT-5.6 Sol achieving the highest share, followed by GPT-5.6 Terra and GPT-5.6 Luna
ExploitGym, 2-hour vs 6-hour time limits. Source: OpenAI, GPT-5.6 announcement.

4The SecureBio Biology Results

The more striking story sits in biology. On the SecureBio evaluations, OpenAI reported measurable gains over GPT-5.5, roughly 9 points higher overall. These are capability scores on biology-knowledge tests, rendered here exactly as published.

SecureBio EvaluationGPT-5.6 Sol
Virology Capabilities Test53.5%
Molecular Biology60.0%
Human Pathogen Capabilities68.4%
World-Class Biology68.3%

The roughly 9 point overall lift over GPT-5.5 is the part OpenAI emphasized. The Virology Capabilities Test at 53.5% is the lowest of the four, while Human Pathogen Capabilities at 68.4% and World-Class Biology at 68.3% sit at the top. We report these as published and do not extrapolate beyond them.

OpenAI GeneBench v1 chart of score against output tokens, with GPT-5.6 Sol scoring highest, ahead of GPT-5.6 Terra, GPT-5.5, and GPT-5.6 Luna
GeneBench v1. Source: OpenAI, GPT-5.6 announcement.

5What The Biology Numbers Imply For Safety And Capability

A 9 point jump in biology capability is a double-edged result. On the capability side, it points to a model that is more useful for legitimate scientific work: literature synthesis, hypothesis framing, and helping researchers reason through molecular biology. On the safety side, the same gains are exactly why these evaluations exist. Higher scores on pathogen and virology tests are tracked because they map to dual-use risk, and rising numbers are a signal that warrants more guardrails, not fewer.

This is the most plausible reason the rollout was constrained. The US government requested a limited rollout for GPT-5.6, and OpenAI complied while warning that such restrictions should not become the norm. When a model posts meaningful gains on dual-use biology evaluations, a cautious release path is a reasonable response, even if it frustrates developers waiting for access.

6The Sol Ultra Mode Tradeoff

Sol Ultra is the clearest example of a deliberate compute-for-quality trade in the lineup. It lifts TerminalBench 2.1 from 88.8% to 91.9%, about 3 points, by spending more compute per task. Unlike the sub-1-point gap between Sol and Mythos 5, this gain is large enough and mechanistic enough to take seriously. The question is economic: 3 points of benchmark accuracy is worth a lot on a high-value autonomous task and very little on a routine one. Reserve Ultra for the hardest jobs and let standard Sol carry the rest.

7Token Efficiency, As Reported

OpenAI reported that GPT-5.6 uses roughly 10 to 15% fewer tokens than GPT-5.5 on comparable work. We frame this as a reported improvement rather than an independently verified one, because token counts depend heavily on prompt style, task mix, and how the agent is configured. If it holds on your workload, it partly offsets Sol's premium per-token pricing, since fewer tokens at $5 and $30 per million can land close to a cheaper model that talks more. The only way to know is to measure it on your own traffic.

8Methodology Caveats And Running Your Own Evals

Every number above came from the vendor, run under the vendor's harness on the vendor's task distribution. That is normal and not a criticism, but it is a reason to be careful. Benchmark leaderboards tell you how a model performs on someone else's tasks, not yours. The gap between a leaderboard cell and your production workload is often larger than the gap between two models on that leaderboard.

Run Your Own Evals

Build a small eval harness on 30 to 50 tasks drawn from your real workload, with clear pass-fail criteria. Run each candidate model several times to estimate variance, then compare distributions, not single scores. A sub-1-point benchmark gap should never outrank what your own harness tells you. When a model is gated, as Sol is, lean even harder on a fallback you can actually test.

On the broader industry trend, vendors have reported steady progress on tests like SWE-bench Verified across recent model generations. We do not attach a specific GPT-5.6 SWE-bench number here because our sources did not publish one, and inventing a figure to round out a table is exactly the kind of error a hostile reviewer will catch.

9How Sol Compares To The Public Field

For most teams, the relevant comparison is not Sol versus Mythos 5, since both are gated. It is Sol versus what you can actually call today. The cleanest public reference in the TerminalBench 2.1 table is Claude Opus 4.8 at 78.9%, a generally available model at roughly $5 input and $25 output per million tokens. Sol is about 10 points ahead on this benchmark, but you cannot deploy it broadly yet, which changes the calculus.

That makes Opus 4.8 the practical anchor for production planning right now. For a full breakdown of the generally available options, our Claude Opus 4.8 vs GPT-5.5 benchmarks and pricing comparison digs into the numbers you can act on today.

10Why Lushbinary For Eval Harnesses And Model Selection

Reading a benchmark table is the easy part. Turning it into a model choice that survives contact with production is the work that actually matters. Lushbinary builds eval harnesses on your real tasks, so you can compare candidate models on your workload instead of someone else's leaderboard, and we wire up the routing and fallback logic that keeps you working when a frontier model is gated behind a limited preview.

Whether you are weighing a gated preview like Sol or standardizing on a generally available model like Opus 4.8, we help you measure what counts, control cost, and avoid lock-in as the frontier keeps moving.

Frequently Asked Questions

What did OpenAI actually benchmark GPT-5.6 Sol on?

At the June 26, 2026 announcement, OpenAI led with TerminalBench 2.1 for agentic coding and a set of SecureBio biology evaluations. The headline coding number was 88.8% for Sol, 91.9% for the Sol Ultra mode. On biology, OpenAI reported the Virology Capabilities Test at 53.5%, Molecular Biology at 60.0%, Human Pathogen Capabilities at 68.4%, and World-Class Biology at 68.3%, roughly 9 points above GPT-5.5 overall. We use only these verified figures and keep everything else qualitative.

Is Sol really better than Claude Mythos 5 at coding?

On TerminalBench 2.1, Sol scores 88.8% and Mythos 5 scores 88.0%. That 0.8 point gap is small enough to treat as a tie for practical purposes. A single benchmark run carries variance from seeds, harness configuration, and task sampling, so a sub-1-point difference should not decide your model choice. Run your own evaluation on your workload before concluding either model is meaningfully ahead.

What does Sol Ultra mode actually buy you?

Sol Ultra is a compute-intensive mode. It lifts the TerminalBench 2.1 score from 88.8% to 91.9%, about 3 points, by spending more compute per task. Whether that trade is worth it depends on the value of each completed task and your latency and cost tolerance. For most workloads the standard Sol mode is the sensible default, with Ultra reserved for the hardest jobs.

Can I use GPT-5.6 today?

Not generally. GPT-5.6 launched as a limited preview, and the US government requested a limited rollout. OpenAI complied while publicly warning that such restrictions should not become the norm. Plan for access that can change, and keep a generally available fallback such as Claude Opus 4.8 wired into your stack.

What is the context window of GPT-5.6?

OpenAI has not officially confirmed a context window for GPT-5.6. GPT-5.5 offered up to 1M tokens, and GPT-5.6 is widely expected to match that, but treat the figure as expected and unconfirmed until OpenAI states it.

Why should I run my own evals if OpenAI already published numbers?

Vendor benchmarks are run under the vendor's harness on the vendor's task distribution. Your codebase, prompts, tools, and acceptance criteria differ. The reported token efficiency gain of roughly 10 to 15% over GPT-5.5 is also a reported figure, not an independently verified one. A small in-house eval harness on your real tasks tells you far more than a leaderboard cell.

Sources

Content was rephrased for compliance with licensing restrictions. Pricing and benchmark data sourced from official OpenAI announcements and reputable tech press as of June 27, 2026. Figures may change, always verify with the vendor.

Choose Models On Your Numbers, Not The Leaderboard

Lushbinary builds eval harnesses on your real workload and the routing and fallback logic behind smart model selection, so a gated preview never stalls your roadmap.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

Encrypted in transit · GDPR ready · We never share or sell your data

Subscribe · Newsletter

Ship Better Engineering, Every Week

Practical writing on AI agents, cloud architecture, and product teardowns. Read by builders at startups and Fortune 500s.

  • New deep-dives on AI agents and cloud architecture
  • Engineering teardowns of shipped products
  • No spam, unsubscribe in one click

We respect your inbox. Read our privacy policy.

Exclusive Offer for Lushbinary Readers
WidelAI
WidelAI

One Subscription. Every Flagship AI Model.

Stop juggling multiple AI subscriptions. WidelAI gives you access to Claude, GPT, Gemini, and more - all under a single plan.

Claude Opus & SonnetGPT-5.5 & o3Gemini ProSingle DashboardAPI Access

Use code at checkout for 10% off your subscription:

GPT-5.6 SolBenchmarksTerminalBenchSecureBioOpenAILLM BenchmarksAgentic AISol UltraFrontier ModelsModel EvaluationAI SafetyCoding Benchmarks

ContactUs

Contact us