Most models are benchmarked on single-shot answers. Agents do not work that way. A long-running agent might call a model hundreds of times in a single task, accumulate a growing transcript of tool outputs and intermediate decisions, and need every one of those calls to be fast, coherent, and cheap. That is a different optimization target, and it is exactly the one NVIDIA aimed Nemotron 3 Ultra at: the orchestration brain for complex, long-running agent workflows.
This guide is about building on that. We will cover why Ultra fits long-horizon agents, the orchestrator-worker pattern that keeps cost sane, tool calling, using the 1M context for memory, and a concrete orchestration loop you can adapt. For the model fundamentals, start with the Nemotron 3 Ultra developer guide; to run it yourself, see the self-hosting and deployment guide.
What This Guide Covers
1Why Long-Running Agents Need Throughput + Context
Two constraints dominate long-horizon agents, and neither is captured by a single-turn benchmark score:
- Throughput. If each reasoning step takes a long time, a 200-step task is unusable. The win is completing more steps inside the same time budget. Ultra's hybrid architecture, multi-token prediction, and NVFP4 serving target exactly this, with NVIDIA reporting throughput multiples over comparable open models on output-heavy settings.
- Context. An agent that forgets its earlier decisions drifts and contradicts itself. A 1M-token window lets the working state stay in the prompt rather than being lossily summarized away.
Ultra is not the only model you should use in an agent, but it is a strong fit for the role that needs both at once: the planner and verifier that holds the whole task in mind.
2The Orchestrator-Worker Pattern
Running Ultra for every step of an agent is the fastest way to a shocking bill. The pattern NVIDIA designed the family around is a hierarchy:
- Orchestrator (Nemotron 3 Ultra). Decomposes the goal, plans the sequence of subtasks, decides which tools or workers to call, and verifies the results before committing. This is where frontier reasoning earns its cost.
- Workers (Nemotron 3 Super / Nano). Execute well-scoped subtasks: write a function, extract fields from a document, classify a ticket, summarize a source. These run far cheaper and faster.
Because Nano, Super, and Ultra share the same family format and OpenAI-compatible interface, swapping the model on a call is a one-line change. The orchestrator does the expensive thinking once per decision; the workers do the cheap volume. You only pay for 550B-class reasoning where it actually changes the outcome.
3Reasoning Control as a Cost Lever
Ultra exposes a toggleable reasoning mode through the chat template (the enable_thinking flag). Treat it as a per-step decision, not a global setting. Reasoning tokens are output tokens, billed at the higher rate and adding latency, so spend them only where depth matters:
- Reasoning on: planning, architectural choices, verifying a worker's output, resolving conflicting evidence.
- Reasoning off: routing decisions, simple classification, formatting, calling a known tool with obvious arguments.
Driving this flag from the step type, rather than leaving it always-on, is one of the highest-leverage cost optimizations in an Ultra-based agent.
4Tool Calling & the Agent Loop
Ultra is trained for agentic tool use and served over OpenAI-compatible endpoints, so the loop is the familiar one: define tools, let the model propose a call, execute it, append the result, repeat. A minimal orchestration loop in Python:
from openai import OpenAI
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
tools = [{
"type": "function",
"function": {
"name": "search_codebase",
"description": "Search the repository for a symbol or string.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
}]
def run_tool(name, args):
if name == "search_codebase":
return search_codebase(args["query"])
raise ValueError(f"unknown tool: {name}")
messages = [
{"role": "system", "content": "You are an orchestrator. Plan, call tools, verify, then answer."},
{"role": "user", "content": "Find and fix the race condition in the payment worker."},
]
for _ in range(MAX_STEPS):
resp = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=messages,
tools=tools,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
msg = resp.choices[0].message
messages.append(msg)
if not msg.tool_calls:
print(msg.content) # final answer
break
for call in msg.tool_calls:
import json
result = run_tool(call.function.name, json.loads(call.function.arguments))
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})The structure is deliberately boring because that is what survives production. A hard MAX_STEPS ceiling, a step that either calls tools or finishes, and tool results appended back into the same growing context. The 1M window is what lets that context keep growing across a long task without forcing you to summarize prematurely.
5Using the 1M Context for Memory
A large context window is the cheapest form of agent memory: just keep the relevant state in the prompt. With Ultra you can hold the full plan, prior tool outputs, and intermediate conclusions in-context, which sharply reduces the drift that comes from aggressive summarization. But treat the window as working memory, not durable memory:
- In-context (working memory): the current task's plan, recent tool results, and the decisions made so far.
- External store (durable memory): facts, artifacts, and outcomes that must survive across sessions, retrieved on demand and injected when relevant.
Two more practical notes: hosted endpoints sometimes cap context below the full 1M, so confirm your provider's limit; and longer prompts cost more and run slower, so prune stale tool output instead of letting the transcript grow without bound just because it can.
6Model Routing Across the Nemotron 3 Family
A simple router turns the orchestrator-worker idea into code. Classify each step by difficulty and send it to the right tier:
MODEL_BY_TIER = {
"plan_or_verify": "nvidia/nemotron-3-ultra-550b-a55b", # frontier reasoning
"worker": "nvidia/nemotron-3-super-120b-a12b", # substantial subtasks
"cheap": "nvidia/nemotron-3-nano", # high-volume, simple steps
}
def model_for(step_type: str) -> str:
if step_type in ("plan", "verify", "resolve_conflict"):
return MODEL_BY_TIER["plan_or_verify"]
if step_type in ("classify", "format", "route"):
return MODEL_BY_TIER["cheap"]
return MODEL_BY_TIER["worker"]This keeps the expensive model on the critical path only. For a deeper treatment of routing economics across model tiers, see our writing on LLM gateways and model routing. The principle holds for any model family: match the model to the marginal value of the step.
7Evaluation & Guardrails
Long-running agents fail in ways single prompts do not: they loop, they drift, they call the wrong tool with confident arguments. Ship them with the safety rails on:
- Step and budget ceilings. Cap steps, total tokens, and wall-clock time per task so a stuck agent fails fast instead of burning the budget.
- Tool sandboxing. Give each tool least privilege. Require explicit confirmation for irreversible actions such as deletes, deploys, or payments. Never let the agent execute untrusted output as code without isolation.
- Verification steps. Have the orchestrator check worker output against the goal before accepting it, rather than chaining unchecked results.
- Offline evals. Maintain a task suite with known-good outcomes and run it on every prompt or model change. Treat agent quality as a regression-tested artifact, not a vibe.
- Untrusted input. Tool outputs and retrieved documents can carry prompt-injection attempts. Keep instructions and data clearly separated and do not blindly trust retrieved text.
8Why Lushbinary for Agent Engineering
The gap between a demo agent and a dependable one is all in the engineering: routing, reasoning control, tool sandboxing, evaluation, and cost guardrails. Lushbinary builds production agentic systems on open and hosted models alike, including orchestration with Nemotron 3 Ultra as the planner and cheaper workers underneath. We handle the loop, the infrastructure, and the controls that keep it safe and affordable.
Bring us the workflow you want to automate and we will design the agent architecture and ship it to production.
Build Long-Running Agents That Ship
From orchestration and tool design to evals and cost control, Lushbinary turns agent prototypes into production systems. Let's scope yours.
Ready to Build Something Great?
Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.
Prefer email? Reach us directly:

