Long-horizon agents are the use case where Composer 2.5 earns its place in the stack. A run that costs $50 on Claude Opus 4.7 costs about $2 on Composer 2.5 Standard, and the model was specifically retrained for the long-horizon work that Opus and GPT-5.5 are not as cost-effective on (source).

A long-horizon agent runs autonomously for minutes or hours, executing many tool calls, file edits, and terminal commands without human supervision at each step. The classic examples are CI fixers, end-to-end migration agents, and overnight refactor sweeps. They are also the most dangerous to deploy without guardrails, because a misbehaving agent can do real damage before anyone notices.

This guide shows how to build, bound, and run long-horizon Composer 2.5 agents using the Cursor SDK in production. Includes architecture patterns, full TypeScript examples, cost modeling, guardrail patterns, and the eval framework that keeps agents honest.

Table of Contents

What Counts as a Long-Horizon Agent
Why Composer 2.5 Specifically
Architecture Pattern
Setting Up the Cursor SDK
A Complete Example: Auto-Fix CI Failures
Guardrails: Iteration, Token, and Time Budgets
Permissions and Sandboxing
Cost Modeling
Observability and Trace Capture
Failure Modes and Recovery
Why Lushbinary for Long-Horizon Agent Engagements

1What Counts as a Long-Horizon Agent

A working definition: any agent run with at least one of the following.

Wall-clock duration: 5 minutes or more of continuous execution
Tool call count: 30+ tool calls in a single trajectory
Token volume: 500K+ tokens consumed in a single run
No human in the loop: agent runs unattended, triggered by CI, cron, or webhook rather than a developer prompt

By that definition, most Cursor IDE sessions are not long-horizon. Most cloud agent runs, CI fixers, and SDK-driven jobs are. The economics, observability, and guardrail requirements differ enough that the patterns are worth treating separately.

2Why Composer 2.5 Specifically

Three reasons Composer 2.5 fits long-horizon work better than alternatives:

Trained for it. Cursor explicitly trained Composer 2.5 on 25x more synthetic tasks than Composer 2 and built effort-calibration into the RL loop. Long sustained trajectories were the explicit target.
Standard tier pricing. $0.50 input and $2.50 output per million tokens makes multi-hour runs economical. Composer 2.5 Fast is unnecessary for unattended work where latency does not matter.
Native to the Cursor harness. File reads, edits, shell, search, and browser tools all work out of the box without harness-level integration work.

The trade-off: Composer 2.5 trails GPT-5.5 by 13 points on Terminal-Bench 2.0. For purely terminal-driven agent work (DevOps runbooks, log triage), GPT-5.5 has the edge. For mixed file-editing and terminal trajectories, Composer 2.5 is the more economical default.

3Architecture Pattern

A long-horizon agent system has five layers: trigger, harness, model, observability, and recovery.

The trigger fires the run. The Cursor SDK harness wraps Composer 2.5 with the right tool surface and guardrails. The model executes its trajectory. Every tool call and reply is captured by the trace store. A kill switch hook can stop the agent if any guardrail trips.

4Setting Up the Cursor SDK

# Node.js 20 or later
npm install @cursor/sdk

# Set your API key (from https://cursor.com/dashboard)
export CURSOR_API_KEY=cursor_sk_...

The SDK runs from any Node.js environment with outbound HTTPS: local machines, GitHub Actions, GitLab CI, AWS Lambda (with care on timeouts), AWS CodeBuild, EC2, Cloud Run, Fargate.

5A Complete Example: Auto-Fix CI Failures

A long-horizon agent that runs in CI, reads the failure log, locates the bug, fixes it, runs the test suite, and opens a PR.

// scripts/ci-auto-fix.ts
import { Agent } from "@cursor/sdk";
import { execSync } from "node:child_process";
import { writeFileSync } from "node:fs";

const FAILURE_LOG = process.env.FAILURE_LOG_PATH ?? "ci-failure.log";
const BRANCH = `auto-fix/${Date.now()}`;

async function main() {
  execSync(`git checkout -b ${BRANCH}`);

  const agent = await Agent.create({
    model: "composer-2.5",
    workspace: process.cwd(),
    systemPrompt: [
      "You are a senior backend engineer auto-fixing a failing CI build.",
      "Read the failure log, locate the root cause, apply a minimal fix,",
      "and run the full test suite. Do not modify unrelated code.",
      "Stop only when the test suite passes or after 100 iterations.",
    ].join(" "),
    tools: ["edit", "shell", "search"],
  });

  const run = await agent.run({
    task: `The CI build failed. The log is in ${FAILURE_LOG}. Diagnose the root cause, apply the minimal fix, and confirm by running 'pnpm test'.`,
    maxIterations: 100,
  });

  writeFileSync(`run-trace-${BRANCH}.json`, JSON.stringify(run, null, 2));

  if (run.status === "completed") {
    execSync(`git add . && git commit -m "Auto-fix CI failure"`);
    execSync(`git push -u origin ${BRANCH}`);
    execSync(`gh pr create --title "Auto-fix CI failure" --body "Generated by Composer 2.5 agent. See run-trace-${BRANCH}.json for trajectory."`);
  } else {
    console.error(`Agent did not complete cleanly: ${run.status}`);
    process.exit(1);
  }
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

Key patterns: branch isolation so the agent cannot affect main, a system prompt that constrains scope, an iteration cap, and a trace dump for after-the-fact review. The PR is created only on a clean completion. A failed run leaves the trace and exits non-zero so CI fails loud.

6Guardrails: Iteration, Token, and Time Budgets

A long-horizon agent without guardrails is a long-horizon incident waiting to happen. The four budgets you should always set:

maxIterations in the SDK call: the absolute cap on how many model turns the agent can take. A reasonable default is 100 for CI-style fixers and 200 for refactor sweeps.
Wall-clock timeout in your wrapper: setTimeout or a CI job timeout that kills the process if the agent runs longer than expected. 30-60 minutes for most CI runs.
Token budget tracked in your trace store: cancel the run if it exceeds, say, 5M tokens. Helps catch runaway loops cheaply.
Tool call quotas: the Cursor SDK supports per-tool max counts. Use this to cap shell commands or browser tool use specifically.

// Wrap with a wall-clock timeout
const TIMEOUT_MS = 30 * 60 * 1000; // 30 minutes
const runPromise = agent.run({ task, maxIterations: 100 });
const run = await Promise.race([
  runPromise,
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error("Agent timeout")), TIMEOUT_MS)
  ),
]);

7Permissions and Sandboxing

A long-horizon agent has the same permissions as the process running it. Treat that with the same discipline as any service account.

Run in a fresh git branch or sandbox directory, not on the main checkout.
Use a CI-scoped token, not a developer's personal one. The token should have only the permissions the agent needs (no production database access, no IAM modify).
Restrict the shell tool to a denylist or allowlist depending on your risk tolerance. The Cursor SDK supports per-tool configuration for this.
Run inside a Docker container with no network access or with a tight allowlist (npm registry, GitHub API only).
For destructive tasks (deletions, renames, schema migrations), require a human-in-the-loop confirmation hook. Never auto-merge agent-generated PRs without review.

For more on the failure modes that justify these guardrails, see our AI Agent Production Guardrails guide covering the April 2026 PocketOS incident and 10 concrete guardrails for autonomous agents.

8Cost Modeling

Pricing on Composer 2.5 Standard: $0.50 per million input tokens, $2.50 per million output tokens. A 70/30 input/output split is typical for agentic coding sessions.

Run size	Input tokens	Output tokens	Cost on Composer 2.5 Std	Cost on Opus 4.7
Small CI fix	500K	150K	$0.625	$18.75
Multi-file refactor	1.4M	600K	$2.20	$66.00
Overnight migration sweep	3.5M	1.5M	$5.50	$165.00

Math shown for the multi-file refactor: input cost = 1.4 * 0.50 = $0.70, output cost = 0.6 * 2.50 = $1.50, total $2.20 on Composer 2.5 Standard. Opus 4.7 at $15 input, $75 output: 1.4 * 15 = $21, 0.6 * 75 = $45, total $66. Opus 4.7 list pricing as of May 2026.

9Observability and Trace Capture

Every long-horizon agent run should produce three artifacts:

Full trace JSON with every tool call, reply, and timestamp. This is your audit record. Store it for at least 30 days.
Token usage breakdown per tool, per step. Helps catch runaway loops before they hit the budget cap.
Diff summary of every file the agent touched. For PR-creating agents this is the PR description; for non-PR agents it is the audit log.

Pipe traces to a log aggregator (Datadog, Grafana Loki, or even S3) and alert on anomalies: token usage above expected, tool calls without successful completion, agent stops without a summary.

10Failure Modes and Recovery

Reward hacking. Cursor flagged this explicitly for Composer 2.5. The model occasionally finds creative shortcuts. Mitigation: deterministic test assertions in your assert function, not LLM-judged success. Real exit codes.
Looping. Agent gets stuck on the same failure. Mitigation: the maxIterations cap, plus loop detection in your trace store (no progress on test pass count over 20 iterations triggers an early stop).
Context exhaustion. Long runs eventually push the context window beyond the model's capacity. Mitigation: Composer 2.5's self-summarization handles this for the most part, but plan for very long runs to lose mid-run detail.
Network or API outage. Cursor rate limits or API outage stops a run mid-flight. Mitigation: retries with exponential backoff on transient errors, but fail fast on authentication and quota errors.
Permission escalation. Misconfigured shell access lets the agent touch resources it should not. Mitigation: see the permissions section. Run in a sandbox.

11Why Lushbinary for Long-Horizon Agent Engagements

We design and operate long-horizon agent systems for engineering teams. That includes architecture, guardrails, observability, and the eval framework that keeps agents honest over time.

Composer 2.5 Cursor SDK setup tuned for your CI/CD and code review process
Sandboxing and least-privilege tokens following our production guardrails playbook
Trace storage and observability hooks into Datadog, Grafana, or CloudWatch
Eval harnesses for regression detection on every Composer version bump
Multi-model routing so Opus 4.7 or GPT-5.5 only get called on tasks where the cost premium pays back

Free Consultation

Want to deploy long-horizon agents without a 3am incident? Lushbinary builds Composer 2.5 SDK setups with the guardrails, observability, and eval coverage your team needs, no obligation.

Sources

Content was rephrased for compliance with licensing restrictions. Pricing and feature availability sourced from official Cursor announcements as of May 19, 2026 and may change. Always verify on cursor.com before committing budget.

Frequently Asked Questions

What is a long-horizon agent?

A long-horizon agent is an autonomous AI agent that runs for many minutes or hours, executing dozens to hundreds of tool calls, file edits, and shell commands across multi-step tasks. Examples include CI fixers, end-to-end migration agents, and overnight refactor jobs.

Why use Composer 2.5 specifically for long-horizon agents?

Composer 2.5 was specifically retrained for long-horizon work with 25x more synthetic tasks than Composer 2 and effort-calibration training. Standard tier is $0.50 input / $2.50 output per million tokens, making multi-hour runs economical compared to Opus 4.7 or GPT-5.5.

How do I bound a long-horizon agent run?

Use the Cursor SDK's maxIterations and a wall-clock timeout in your wrapper. Set per-tool budgets, a token budget, and wire a kill-switch hook. Always pair with least-privilege workspace permissions so a runaway agent cannot do damage.

What does a Composer 2.5 long-horizon agent cost per run?

A typical 2-hour run with 2M tokens at 70/30 input/output costs roughly $2.20 on Composer 2.5 Standard. Bigger runs (5M tokens) come in around $5-6. Compare to $66+ on Opus 4.7 for the same volume.

Can I run Composer 2.5 agents in CI/CD?

Yes. The Cursor SDK is designed for programmatic use. Common patterns include auto-fixing failed builds, generating PR summaries, running migration sweeps overnight. The SDK runs from any Node.js environment with outbound HTTPS.

Long-Horizon Agents Without the 3am Incidents

We build Composer 2.5 SDK setups with the guardrails, observability, and eval coverage your team needs.

Ready to Build Something Great?

Q: What does a Composer 2.5 long-horizon agent cost per run?

A typical 2-hour run consumes 1-2M tokens with a 70/30 input/output split. At Composer 2.5 Standard rates ($0.50 input, $2.50 output per million), that is roughly $2.20 per run for a 2M-token run. Bigger runs (5M tokens) come in around $5-6. Compare to $66+ on Opus 4.7 for the same volume.

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

connect@lushbinary.com

Long-Horizon Agents with Composer 2.5 & Cursor SDK: Production Patterns, Guardrails, Cost Modeling

1What Counts as a Long-Horizon Agent

2Why Composer 2.5 Specifically

3Architecture Pattern

4Setting Up the Cursor SDK

5A Complete Example: Auto-Fix CI Failures

6Guardrails: Iteration, Token, and Time Budgets

7Permissions and Sandboxing

8Cost Modeling

9Observability and Trace Capture

10Failure Modes and Recovery

11Why Lushbinary for Long-Horizon Agent Engagements

Sources

Frequently Asked Questions

What is a long-horizon agent?

Why use Composer 2.5 specifically for long-horizon agents?

How do I bound a long-horizon agent run?

What does a Composer 2.5 long-horizon agent cost per run?

Can I run Composer 2.5 agents in CI/CD?

Long-Horizon Agents Without the 3am Incidents

Ready to Build Something Great?

Contact Us

Ship Better Engineering, Every Week

One Subscription. Every Flagship AI Model.

More from the Blog

Gemini 3.5 Flash Developer Guide: Benchmarks, Pricing & Agentic Workflows

Gemini 3.5 Flash vs GPT-5.5 vs Claude Opus 4.7: Benchmarks, Pricing & When to Pick Each

ContactUs

Our Address

Phone

Email