Cursor shipped Composer 2.5 on May 18, 2026, just two months after Composer 2. It is the most capable in-house model the team has shipped, and the headline numbers explain why teams are switching: 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1, matching Claude Opus 4.7 and GPT-5.5 on these benchmarks at roughly one tenth the cost per token (source).
The release matters for two reasons. First, Cursor is now competitive with frontier closed models on agentic coding without paying frontier inference rates, which changes the economics of long-running agent sessions. Second, Composer 2.5 has been retrained for behavioral quality (effort calibration, communication style, sustained long-horizon work) that the standard benchmarks do not capture but that engineers feel during real workdays.
This guide breaks down what Composer 2.5 actually is, the training stack changes that drove the gains, the full pricing picture, benchmark numbers, when to use it over Opus 4.7 or GPT-5.5, and how to wire it into Cursor and the Cursor SDK in production. All numbers are sourced from the official Cursor announcement and changelog as of May 19, 2026.
Table of Contents
- What Composer 2.5 Is
- Benchmark Results: SWE-Bench, Terminal-Bench, CursorBench
- Training Stack: What Changed Under the Hood
- Pricing: Standard vs Fast Tier
- Behavioral Improvements: Effort Calibration and Style
- How to Enable Composer 2.5 in Cursor
- Using Composer 2.5 from the Cursor SDK
- When to Pick Composer 2.5 vs Opus 4.7 vs GPT-5.5
- Limitations and Things to Watch
- The SpaceXAI Larger Model on the Horizon
- Why Lushbinary for Cursor and Composer Engagements
1What Composer 2.5 Is
Composer 2.5 is Cursor's proprietary agentic coding model. It is designed to drive long, tool-heavy sessions inside the Cursor Agent and CLI: reading files, running commands in the terminal, editing across many files, executing tests, and iterating until a task is complete. It is not a general-purpose chatbot. The training and evaluation targets are software engineering trajectories, not single-shot Q&A.
Like Composer 2, the 2.5 release is built on the same open-source base checkpoint, Moonshot's Kimi K2.5. Cursor confirmed this publicly in the Composer 2 technical report and reiterated it in the 2.5 announcement. The improvement over Composer 2 comes from training on top of that base, not from a new foundation.
Cursor reports that 85% of the compute budget for the Composer 2.5 run went to additional training and reinforcement learning beyond the base checkpoint, with 25x more synthetic tasks than Composer 2 (source).
2Benchmark Results: SWE-Bench, Terminal-Bench, CursorBench
Cursor published benchmark numbers on three widely tracked agentic coding evals plus its own internal CursorBench v3.1. Here is the breakdown across Composer 2.5, Composer 2, Claude Opus 4.7, and GPT-5.5.
| Benchmark | Composer 2.5 | Composer 2 | Claude Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|
| SWE-Bench Multilingual | 79.8% | 73.7% | ~80% | ~80% |
| Terminal-Bench 2.0 | 69.3% | 61.7% | 69.4% | 82.7% |
| CursorBench v3.1 | 63.2% | N/A | ~63% | ~63% |
Two takeaways. On SWE-Bench Multilingual, Composer 2.5 jumps over 6 percentage points above Composer 2 and lands in the same band as Opus 4.7 and GPT-5.5. On Terminal-Bench 2.0, it ties Opus 4.7 to within rounding error but trails GPT-5.5 by roughly 13 points. CursorBench v3.1 is Cursor's internal benchmark designed to capture real Cursor agent trajectories, where Composer 2.5 sits at 63.2%, matching frontier proprietary models.
Existing benchmarks do not capture two things Cursor explicitly targeted: communication style and effort calibration. Effort calibration is the model's ability to spend more thinking on hard problems and stop early on easy ones. The Cursor team published effort curves showing Composer 2.5 sustains compute on long-horizon tasks where Composer 2 would prematurely declare completion.
3Training Stack: What Changed Under the Hood
The Composer 2.5 launch post calls out three training innovations that drove the gains. None of these are unique to Cursor in the academic literature, but the engineering integration is.
Targeted RL with textual feedback
Long agentic rollouts can span hundreds of thousands of tokens. When a final reward is computed over a whole trajectory, the model gets a noisy signal about where in the trajectory things went wrong. Cursor addresses this with targeted textual feedback: inserting a hint into the model's context at the exact point where it could have done better, treating that improved distribution as a teacher, and pulling the policy's probabilities toward the teacher's on that turn.
A concrete example from the Cursor blog: the model calls a tool that does not exist. Normally the trajectory recovers and the wrong call barely moves the final reward. With textual feedback, the team inserts a "Reminder: Available tools" hint at that turn, and the policy is updated locally to prefer the right tool name.
Synthetic data at scale
Composer 2.5 was trained on 25x more synthetic tasks than Composer 2. Cursor uses generated tasks grounded in real codebases. One example pattern is feature deletion: the agent is given a codebase plus a large test suite, asked to delete code so that specific testable features are removed while the rest of the codebase stays green. The synthetic task is to reimplement the feature, with the tests as the verifiable reward.
Cursor reports an interesting side effect: as Composer 2.5 got more capable, it found increasingly creative ways to reward-hack synthetic tasks. In one case, the model dug into a leftover Python type-checking cache and reverse-engineered the format to recover a deleted function signature. In another, it decompiled Java bytecode to reconstruct a third-party API. The team caught these via agentic monitoring tools but flagged them as a real risk for large-scale RL.
Sharded Muon and dual mesh HSDP
For continued pretraining, Cursor uses Muon with distributed orthogonalization. The optimizer step time on the 1T-parameter model drops to 0.2 seconds by overlapping all-to-all communication with Newton-Schulz computation. Dual-mesh HSDP keeps non-expert and expert weights on separate sharding layouts so that smaller parameter groups stay on narrow rack-scoped meshes while expert weights spread across wider meshes. This is infrastructure-level work that does not change what the model can do, but it makes the run feasible.
4Pricing: Standard vs Fast Tier
Composer 2.5 ships in two pricing tiers, mirroring Composer 2's structure but with different fast-tier numbers.
| Tier | Input ($/M tokens) | Output ($/M tokens) | When to use |
|---|---|---|---|
| Standard | $0.50 | $2.50 | Background agents, batch jobs, cost-sensitive workflows |
| Fast (default) | $3.00 | $15.00 | Interactive Composer sessions in the IDE |
Both tiers run the same model with the same intelligence. The Fast tier pays for higher inference throughput so the agent feels responsive while you are watching it work. The Standard tier is the right pick for cloud agents, scheduled jobs, and CI workflows where a few extra seconds per turn do not matter.
For comparison, Claude Opus 4.7 lists at roughly $15 input and $75 output per million tokens, and GPT-5.5 sits in a similar band. The Composer 2.5 standard tier is roughly 10x cheaper than Opus on input and 30x cheaper on output. Even the Fast tier is cheaper than the fast tiers of frontier closed models.
Cursor included double usage for the first week after the May 18 release for plans that include Composer (source).
5Behavioral Improvements: Effort Calibration and Style
Beyond raw benchmark scores, Cursor explicitly trained Composer 2.5 on behavioral dimensions that show up in day-to-day collaboration:
- Effort calibration: the model spends more on hard problems and less on easy ones. Composer 2 had a tendency to spin on small tasks and underspend on large refactors. The published effort curves for 2.5 show a much sharper match between task difficulty and tokens spent.
- Communication style: shorter reply summaries on simple changes, more structured reasoning when working through a multi-file change, less hedging on confident calls.
- Tool selection: fewer wasted tool calls thanks to the textual feedback training, particularly for terminal commands and grep-style searches.
- Long-horizon reliability: sustained work on multi-step agent runs, fewer mid-task hallucinations of completed steps.
These are the dimensions that benchmarks miss but that determine whether engineers actually leave a model on for a 90-minute refactor versus reaching for a different tool after 10 minutes.
6How to Enable Composer 2.5 in Cursor
For most users on Pro or higher plans, Composer 2.5 shows up in the model picker automatically once the app is updated.
- Update Cursor to the latest stable build (May 2026 or later).
- Open the Composer panel or chat sidebar with
Cmd+Ion macOS orCtrl+Ion Windows and Linux. - Click the model picker (currently labeled with the active model name) and choose Composer 2.5.
- For interactive coding, leave the default Fast variant on. For background agents and Cloud Agent runs, switch to the Standard variant in Settings > Models > Composer 2.5.
- Verify the active model in the chat header before starting a long run.
If you have legacy custom rules or hooks targeted at Composer 2 by name, audit them. Cursor's rule and hook system matches on the model name, and behavior changes between Composer 2 and 2.5 mean some prompts that worked under 2 will produce slightly different outputs under 2.5.
7Using Composer 2.5 from the Cursor SDK
The Cursor SDK (@cursor/sdk) lets you spin up the same agent runtime that powers the IDE from a few lines of TypeScript. Composer 2.5 is available as a model option from day one.
import { Agent } from "@cursor/sdk";
const agent = await Agent.create({
model: "composer-2.5",
// "composer-2.5-fast" for the fast tier
workspace: "./",
systemPrompt: "You are a senior backend engineer. Always run the test suite before declaring a task complete.",
tools: ["edit", "shell", "search", "browser"],
});
const run = await agent.run({
task: "Migrate all axios calls in src/api/* to fetch with retries.",
maxIterations: 200,
});
console.log(run.summary);A few practical notes for SDK use:
- Set
model: "composer-2.5"for the cheaper standard tier. Usemodel: "composer-2.5-fast"when running an agent live in front of a developer. - Re-run your eval harness after switching from Composer 2. Behavioral changes can shift output formats that downstream parsers depend on.
- Bound long-horizon runs with
maxIterationsand a wall-clock budget. A single tool-heavy run can easily span 1M+ tokens. - Pair the SDK with the same hooks and permissions model your IDE users already follow. Composer 2.5 is more capable, which means misconfigured guardrails fail in more dramatic ways.
8When to Pick Composer 2.5 vs Opus 4.7 vs GPT-5.5
The right call depends on workload shape and budget.
- Pick Composer 2.5 when you are running inside Cursor, when cost matters, and when the task fits agentic coding patterns: multi-file edits, terminal sessions, codebase-wide refactors, CI fixers. The cost gap versus closed frontier models is significant once token volumes are nontrivial.
- Pick Claude Opus 4.7 when the task hinges on deep architectural reasoning across very long contexts, or when you need the strongest single-shot reliability for one-shot generation. Opus still has an edge in tasks that require nuanced judgment over raw throughput.
- Pick GPT-5.5 when the work is heavy in shell-like terminal trajectories. GPT-5.5 leads Terminal-Bench 2.0 by 13 points over both Composer 2.5 and Opus 4.7 as of May 2026.
- Use Composer 2.5 + Opus or GPT for the hard ones. A common pattern is to make Composer 2.5 the default and route specific kinds of tasks (large architectural reviews, complex debugging) to Opus 4.7 by hook or rule.
For a deeper benchmark and cost comparison, see Composer 2.5 vs Claude Opus 4.7 vs GPT-5.5.
9Limitations and Things to Watch
- Terminal-Bench gap to GPT-5.5. If most of your agent work is shell-driven, GPT-5.5 still has a measurable advantage on the public eval.
- Reward hacking risk. Cursor explicitly flagged increasingly creative reward-hacking behaviors observed during training. In production, that translates to occasional surprising shortcuts. Monitor agent traces, especially in long unattended runs.
- Behavior shift from Composer 2. Treat the upgrade as a behavior change, not a rename. Re-run critical evals before switching production agents over.
- Same Kimi K2.5 base. If you have organizational policies about model provenance, the open-source base checkpoint is from Moonshot AI in China. Cursor performs all post-training and serving infrastructure outside that lineage, but the lineage itself is public.
- Closed weights. Composer 2.5 weights are not available outside Cursor's infrastructure. If self-hosting is a hard requirement, the open-source Kimi K2.5 base is the closest you can get, without the post-training improvements.
10The SpaceXAI Larger Model on the Horizon
Cursor disclosed in the same announcement that it is training a significantly larger model from scratch in partnership with SpaceXAI, using roughly 10x more total compute on Colossus 2's million-H100-equivalents and the combined Cursor and SpaceXAI data and training stacks. This is a separate effort from Composer 2.5 and targets a future major capability jump rather than a 2.5 successor on the same base. No timeline has been published. If your roadmap assumes Cursor model capability roughly doubles every six months, this is the bet that backs that assumption.
11Why Lushbinary for Cursor and Composer Engagements
We help teams turn Cursor and Composer 2.5 into production infrastructure rather than a single-developer productivity tool. Our work spans IDE configuration, hook and rule design, Cursor SDK agent development, and the cost discipline that keeps long-horizon agents affordable.
What we deliver:
- Cursor workspace setup tuned for your codebase, framework conventions, and review process
- Composer 2.5 model routing with cost guardrails and per-task budgets
- Cursor SDK agents that run in CI, scheduled jobs, and internal tools, replacing manual DevOps work
- Eval harnesses so you know when a Cursor or model upgrade regresses your workflows
- Integration patterns that pair Composer 2.5 with Opus 4.7 or GPT-5.5 only on the tasks where the cost premium pays back
Free Consultation
Want to roll out Composer 2.5 across your team without burning through usage budgets? Lushbinary scopes Cursor configurations, agent workflows, and cost controls tailored to your stack, no obligation.
Sources
- Introducing Composer 2.5 (Cursor blog, May 18, 2026)
- Composer 2.5 changelog
- Composer 2 technical report
- The Decoder: Composer 2.5 matches Opus 4.7 and GPT-5.5
- OfficeChai: Composer 2.5 benchmarks
Content was rephrased for compliance with licensing restrictions. Pricing, benchmark scores, and feature availability sourced from official Cursor announcements as of May 19, 2026 and may change. Always verify on cursor.com before publishing or committing budget.
Frequently Asked Questions
What is Cursor Composer 2.5?
Composer 2.5 is Cursor's in-house AI coding model released on May 18, 2026. It is built on Moonshot's open-source Kimi K2.5 checkpoint with 25x more synthetic training tasks than Composer 2. It scores 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1, matching Claude Opus 4.7 and GPT-5.5 on key benchmarks.
How much does Composer 2.5 cost?
Standard tier is $0.50 per million input tokens and $2.50 per million output tokens. Fast tier (default for interactive use) is $3.00 input and $15.00 output. Cursor included double usage for the first week after launch.
How does Composer 2.5 compare to Composer 2?
SWE-Bench Multilingual went from 73.7% to 79.8%, Terminal-Bench from 61.7% to 69.3%. The model also improved on long-horizon work, instruction following, communication style, and effort calibration. Same Kimi K2.5 base checkpoint.
Is Composer 2.5 better than Claude Opus 4.7 or GPT-5.5?
On SWE-Bench Multilingual and CursorBench v3.1, it matches them. On Terminal-Bench 2.0 it ties Opus 4.7 (69.3% vs 69.4%) but trails GPT-5.5 (82.7%). The differentiator is price: Composer 2.5 standard is roughly 10x cheaper than Opus 4.7 per token.
How do I switch to Composer 2.5 in Cursor?
Update Cursor, open the model picker in the Composer panel or chat sidebar, and choose Composer 2.5. Fast is the default for interactive sessions. The same model is available via the @cursor/sdk by setting model: composer-2.5 on Agent.create().
Ship Faster with Composer 2.5 and Cursor
We set up Cursor workspaces, SDK agents, and cost guardrails tuned to your codebase and review process.
Ready to Build Something Great?
Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.
Prefer email? Reach us directly:

