When OpenAI announced GPT-5.6 and its flagship Sol mode on June 26, 2026, the number that traveled fastest was 88.8% on TerminalBench 2.1. It landed just ahead of Claude Mythos 5 at 88.0% and well clear of the publicly available Claude Opus 4.8 at 78.9%. The launch also surfaced a quieter but arguably more consequential set of results: a jump in biology reasoning on the SecureBio evaluations. This piece reads the numbers carefully and separates signal from noise.
The goal here is not to crown a winner. It is to explain what each benchmark measures, which gaps are real, which sit inside measurement noise, and what the biology results imply for both capability and safety. We use only the figures OpenAI and reputable tech press published, and we flag every claim that is reported rather than independently verified.
One honesty note up front: GPT-5.6 shipped as a limited preview, so most teams cannot run their own tests against it yet. That access caveat shapes how much weight any of these numbers should carry in a real model-selection decision.
What This Deep Dive Covers
- What GPT-5.6 Was Tested On
- TerminalBench 2.1 Results In Full
- What TerminalBench 2.1 Measures, And Why 0.8 Points Is Noise
- The SecureBio Biology Results
- What The Biology Numbers Imply For Safety And Capability
- The Sol Ultra Mode Tradeoff
- Token Efficiency, As Reported
- Methodology Caveats And Running Your Own Evals
- How Sol Compares To The Public Field
- Why Lushbinary For Eval Harnesses And Model Selection
1What GPT-5.6 Was Tested On
OpenAI framed GPT-5.6 around two themes at launch: agentic engineering and scientific reasoning. The agentic story is anchored by TerminalBench 2.1, a benchmark that runs models through real terminal-driven engineering tasks. The scientific story is anchored by the SecureBio biology evaluations, a set of capability tests that double as a safety signal. The model ships in tiers, with Sol as the flagship, Terra and Luna below it, and Sol Ultra as a compute-intensive mode that pushes scores higher at higher cost.
GPT-5.6 follows GPT-5.5, which shipped April 23, 2026. Sol is priced at $5 input and $30 output per million tokens, Terra at $2.50 and $15, and Luna at $1 and $6. OpenAI also reported that the mid-tier Terra matches Claude Fable 5, both at 84.3% on TerminalBench 2.1, while edging out the prior-generation GPT-5.5 at 83.4%. For pricing strategy across the line, our GPT-5.6 Sol vs Mythos 5 vs Gemini comparison covers access and cost in more depth.
2TerminalBench 2.1 Results In Full
Here is the full field as published at the GPT-5.6 launch, ordered by score. The top of the table is tight, and the spread between the best GPT-5.6 mode and the public reference model is about 13 points.
| Model | TerminalBench 2.1 | Notes |
|---|---|---|
| GPT-5.6 Sol Ultra | 91.9% | Compute-intensive mode, top published figure |
| GPT-5.6 Sol | 88.8% | Flagship default mode |
| Claude Mythos 5 | 88.0% | 0.8 pt behind Sol, within noise |
| GPT-5.6 Terra | 84.3% | Mid-tier, tied with Claude Fable 5 |
| Claude Fable 5 | 84.3% | Tied with Terra at the mid tier |
| GPT-5.5 | 83.4% | Prior generation flagship |
| GPT-5.6 Luna | 82.5% | Budget tier in the GPT-5.6 line |
| Claude Opus 4.8 | 78.9% | Generally available public reference |
| Gemini 3.1 Pro Preview | 70.7% | Lowest of the published field |
Reading The Table
The headline rivalry is Sol at 88.8% versus Mythos 5 at 88.0%. The genuinely large gaps are between the flagship modes and the budget and public reference rows: Luna at 82.5% and Opus 4.8 at 78.9%. If you only remember one thing, remember that the top two are a tie and the field below them is real.
3What TerminalBench 2.1 Measures, And Why 0.8 Points Is Noise
TerminalBench 2.1 evaluates agentic, terminal-driven engineering: running shell commands, editing files, executing multi-step tasks, and recovering from errors across a session. It is closer to how a coding agent actually works than a single-shot code completion test, which is why OpenAI led with it for Sol.
The catch is that agentic benchmarks are noisy. Scores move with random seeds, harness configuration, tool timeouts, retry policy, and which subset of tasks a run happens to sample. A 0.8 point gap between Sol and Mythos 5 sits comfortably inside the band where two runs of the same model can disagree. Treat it as a statistical tie, not a ranking. The honest read is that both models are at the current frontier for this kind of work.
Noise, In Plain Terms
A sub-1-point gap on an agentic benchmark is not a reliable signal of which model is better. The 3 point lift from Sol to Sol Ultra is more likely to be real, because it comes from a deliberate change in compute rather than run-to-run variance. Size your conclusions to the size of the gap.
Alongside the coding results, OpenAI published exploit-finding benchmarks, ExploitBench and ExploitGym, that probe how well a model discovers and exercises software vulnerabilities. Across this family, GPT-5.6 Sol leads the GPT-5.6 lineup and approaches Mythos-class results, with the smaller Terra and Luna modes trailing the flagship. As with the biology evaluations, these are tracked partly because the same skills carry dual-use risk, so we keep the read qualitative rather than quoting precise figures off the charts.
4The SecureBio Biology Results
The more striking story sits in biology. On the SecureBio evaluations, OpenAI reported measurable gains over GPT-5.5, roughly 9 points higher overall. These are capability scores on biology-knowledge tests, rendered here exactly as published.
| SecureBio Evaluation | GPT-5.6 Sol |
|---|---|
| Virology Capabilities Test | 53.5% |
| Molecular Biology | 60.0% |
| Human Pathogen Capabilities | 68.4% |
| World-Class Biology | 68.3% |
The roughly 9 point overall lift over GPT-5.5 is the part OpenAI emphasized. The Virology Capabilities Test at 53.5% is the lowest of the four, while Human Pathogen Capabilities at 68.4% and World-Class Biology at 68.3% sit at the top. We report these as published and do not extrapolate beyond them.
5What The Biology Numbers Imply For Safety And Capability
A 9 point jump in biology capability is a double-edged result. On the capability side, it points to a model that is more useful for legitimate scientific work: literature synthesis, hypothesis framing, and helping researchers reason through molecular biology. On the safety side, the same gains are exactly why these evaluations exist. Higher scores on pathogen and virology tests are tracked because they map to dual-use risk, and rising numbers are a signal that warrants more guardrails, not fewer.
This is the most plausible reason the rollout was constrained. The US government requested a limited rollout for GPT-5.6, and OpenAI complied while warning that such restrictions should not become the norm. When a model posts meaningful gains on dual-use biology evaluations, a cautious release path is a reasonable response, even if it frustrates developers waiting for access.
6The Sol Ultra Mode Tradeoff
Sol Ultra is the clearest example of a deliberate compute-for-quality trade in the lineup. It lifts TerminalBench 2.1 from 88.8% to 91.9%, about 3 points, by spending more compute per task. Unlike the sub-1-point gap between Sol and Mythos 5, this gain is large enough and mechanistic enough to take seriously. The question is economic: 3 points of benchmark accuracy is worth a lot on a high-value autonomous task and very little on a routine one. Reserve Ultra for the hardest jobs and let standard Sol carry the rest.
7Token Efficiency, As Reported
OpenAI reported that GPT-5.6 uses roughly 10 to 15% fewer tokens than GPT-5.5 on comparable work. We frame this as a reported improvement rather than an independently verified one, because token counts depend heavily on prompt style, task mix, and how the agent is configured. If it holds on your workload, it partly offsets Sol's premium per-token pricing, since fewer tokens at $5 and $30 per million can land close to a cheaper model that talks more. The only way to know is to measure it on your own traffic.
8Methodology Caveats And Running Your Own Evals
Every number above came from the vendor, run under the vendor's harness on the vendor's task distribution. That is normal and not a criticism, but it is a reason to be careful. Benchmark leaderboards tell you how a model performs on someone else's tasks, not yours. The gap between a leaderboard cell and your production workload is often larger than the gap between two models on that leaderboard.
Run Your Own Evals
Build a small eval harness on 30 to 50 tasks drawn from your real workload, with clear pass-fail criteria. Run each candidate model several times to estimate variance, then compare distributions, not single scores. A sub-1-point benchmark gap should never outrank what your own harness tells you. When a model is gated, as Sol is, lean even harder on a fallback you can actually test.
On the broader industry trend, vendors have reported steady progress on tests like SWE-bench Verified across recent model generations. We do not attach a specific GPT-5.6 SWE-bench number here because our sources did not publish one, and inventing a figure to round out a table is exactly the kind of error a hostile reviewer will catch.
9How Sol Compares To The Public Field
For most teams, the relevant comparison is not Sol versus Mythos 5, since both are gated. It is Sol versus what you can actually call today. The cleanest public reference in the TerminalBench 2.1 table is Claude Opus 4.8 at 78.9%, a generally available model at roughly $5 input and $25 output per million tokens. Sol is about 10 points ahead on this benchmark, but you cannot deploy it broadly yet, which changes the calculus.
That makes Opus 4.8 the practical anchor for production planning right now. For a full breakdown of the generally available options, our Claude Opus 4.8 vs GPT-5.5 benchmarks and pricing comparison digs into the numbers you can act on today.
10Why Lushbinary For Eval Harnesses And Model Selection
Reading a benchmark table is the easy part. Turning it into a model choice that survives contact with production is the work that actually matters. Lushbinary builds eval harnesses on your real tasks, so you can compare candidate models on your workload instead of someone else's leaderboard, and we wire up the routing and fallback logic that keeps you working when a frontier model is gated behind a limited preview.
Whether you are weighing a gated preview like Sol or standardizing on a generally available model like Opus 4.8, we help you measure what counts, control cost, and avoid lock-in as the frontier keeps moving.
Frequently Asked Questions
What did OpenAI actually benchmark GPT-5.6 Sol on?
At the June 26, 2026 announcement, OpenAI led with TerminalBench 2.1 for agentic coding and a set of SecureBio biology evaluations. The headline coding number was 88.8% for Sol, 91.9% for the Sol Ultra mode. On biology, OpenAI reported the Virology Capabilities Test at 53.5%, Molecular Biology at 60.0%, Human Pathogen Capabilities at 68.4%, and World-Class Biology at 68.3%, roughly 9 points above GPT-5.5 overall. We use only these verified figures and keep everything else qualitative.
Is Sol really better than Claude Mythos 5 at coding?
On TerminalBench 2.1, Sol scores 88.8% and Mythos 5 scores 88.0%. That 0.8 point gap is small enough to treat as a tie for practical purposes. A single benchmark run carries variance from seeds, harness configuration, and task sampling, so a sub-1-point difference should not decide your model choice. Run your own evaluation on your workload before concluding either model is meaningfully ahead.
What does Sol Ultra mode actually buy you?
Sol Ultra is a compute-intensive mode. It lifts the TerminalBench 2.1 score from 88.8% to 91.9%, about 3 points, by spending more compute per task. Whether that trade is worth it depends on the value of each completed task and your latency and cost tolerance. For most workloads the standard Sol mode is the sensible default, with Ultra reserved for the hardest jobs.
Can I use GPT-5.6 today?
Not generally. GPT-5.6 launched as a limited preview, and the US government requested a limited rollout. OpenAI complied while publicly warning that such restrictions should not become the norm. Plan for access that can change, and keep a generally available fallback such as Claude Opus 4.8 wired into your stack.
What is the context window of GPT-5.6?
OpenAI has not officially confirmed a context window for GPT-5.6. GPT-5.5 offered up to 1M tokens, and GPT-5.6 is widely expected to match that, but treat the figure as expected and unconfirmed until OpenAI states it.
Why should I run my own evals if OpenAI already published numbers?
Vendor benchmarks are run under the vendor's harness on the vendor's task distribution. Your codebase, prompts, tools, and acceptance criteria differ. The reported token efficiency gain of roughly 10 to 15% over GPT-5.5 is also a reported figure, not an independently verified one. A small in-house eval harness on your real tasks tells you far more than a leaderboard cell.
Sources
- OpenAI: GPT-5.6 announcement and benchmark figures
- Wikipedia: GPT-5.6
- The Verge: OpenAI GPT-5.6 preview and the administration request
- MacRumors: OpenAI GPT-5.6 Sol
Content was rephrased for compliance with licensing restrictions. Pricing and benchmark data sourced from official OpenAI announcements and reputable tech press as of June 27, 2026. Figures may change, always verify with the vendor.
Choose Models On Your Numbers, Not The Leaderboard
Lushbinary builds eval harnesses on your real workload and the routing and fallback logic behind smart model selection, so a gated preview never stalls your roadmap.
Ready to Build Something Great?
Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.
Prefer email? Reach us directly:

