Logo
Back to Blog
AI & AutomationJune 27, 202612 min read

Build Long-Running AI Agents With Nemotron 3 Ultra

Long-running agents are bottlenecked by throughput and context, not single-shot IQ. NVIDIA built Nemotron 3 Ultra as the orchestration brain for exactly that. This guide covers the orchestrator-worker pattern across the Nemotron 3 family, reasoning control as a cost lever, tool calling with a concrete agent loop, using the 1M context for memory, model routing, and the evals and guardrails that keep agents safe in production.

Lushbinary Team

Lushbinary Team

AI & Cloud Solutions

Build Long-Running AI Agents With Nemotron 3 Ultra

Most models are benchmarked on single-shot answers. Agents do not work that way. A long-running agent might call a model hundreds of times in a single task, accumulate a growing transcript of tool outputs and intermediate decisions, and need every one of those calls to be fast, coherent, and cheap. That is a different optimization target, and it is exactly the one NVIDIA aimed Nemotron 3 Ultra at: the orchestration brain for complex, long-running agent workflows.

This guide is about building on that. We will cover why Ultra fits long-horizon agents, the orchestrator-worker pattern that keeps cost sane, tool calling, using the 1M context for memory, and a concrete orchestration loop you can adapt. For the model fundamentals, start with the Nemotron 3 Ultra developer guide; to run it yourself, see the self-hosting and deployment guide.

1Why Long-Running Agents Need Throughput + Context

Two constraints dominate long-horizon agents, and neither is captured by a single-turn benchmark score:

  • Throughput. If each reasoning step takes a long time, a 200-step task is unusable. The win is completing more steps inside the same time budget. Ultra's hybrid architecture, multi-token prediction, and NVFP4 serving target exactly this, with NVIDIA reporting throughput multiples over comparable open models on output-heavy settings.
  • Context. An agent that forgets its earlier decisions drifts and contradicts itself. A 1M-token window lets the working state stay in the prompt rather than being lossily summarized away.

Ultra is not the only model you should use in an agent, but it is a strong fit for the role that needs both at once: the planner and verifier that holds the whole task in mind.

2The Orchestrator-Worker Pattern

Running Ultra for every step of an agent is the fastest way to a shocking bill. The pattern NVIDIA designed the family around is a hierarchy:

  • Orchestrator (Nemotron 3 Ultra). Decomposes the goal, plans the sequence of subtasks, decides which tools or workers to call, and verifies the results before committing. This is where frontier reasoning earns its cost.
  • Workers (Nemotron 3 Super / Nano). Execute well-scoped subtasks: write a function, extract fields from a document, classify a ticket, summarize a source. These run far cheaper and faster.

Because Nano, Super, and Ultra share the same family format and OpenAI-compatible interface, swapping the model on a call is a one-line change. The orchestrator does the expensive thinking once per decision; the workers do the cheap volume. You only pay for 550B-class reasoning where it actually changes the outcome.

3Reasoning Control as a Cost Lever

Ultra exposes a toggleable reasoning mode through the chat template (the enable_thinking flag). Treat it as a per-step decision, not a global setting. Reasoning tokens are output tokens, billed at the higher rate and adding latency, so spend them only where depth matters:

  • Reasoning on: planning, architectural choices, verifying a worker's output, resolving conflicting evidence.
  • Reasoning off: routing decisions, simple classification, formatting, calling a known tool with obvious arguments.

Driving this flag from the step type, rather than leaving it always-on, is one of the highest-leverage cost optimizations in an Ultra-based agent.

4Tool Calling & the Agent Loop

Ultra is trained for agentic tool use and served over OpenAI-compatible endpoints, so the loop is the familiar one: define tools, let the model propose a call, execute it, append the result, repeat. A minimal orchestration loop in Python:

from openai import OpenAI

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

tools = [{
    "type": "function",
    "function": {
        "name": "search_codebase",
        "description": "Search the repository for a symbol or string.",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
}]

def run_tool(name, args):
    if name == "search_codebase":
        return search_codebase(args["query"])
    raise ValueError(f"unknown tool: {name}")

messages = [
    {"role": "system", "content": "You are an orchestrator. Plan, call tools, verify, then answer."},
    {"role": "user", "content": "Find and fix the race condition in the payment worker."},
]

for _ in range(MAX_STEPS):
    resp = client.chat.completions.create(
        model="nvidia/nemotron-3-ultra-550b-a55b",
        messages=messages,
        tools=tools,
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    msg = resp.choices[0].message
    messages.append(msg)

    if not msg.tool_calls:
        print(msg.content)   # final answer
        break

    for call in msg.tool_calls:
        import json
        result = run_tool(call.function.name, json.loads(call.function.arguments))
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result),
        })

The structure is deliberately boring because that is what survives production. A hard MAX_STEPS ceiling, a step that either calls tools or finishes, and tool results appended back into the same growing context. The 1M window is what lets that context keep growing across a long task without forcing you to summarize prematurely.

5Using the 1M Context for Memory

A large context window is the cheapest form of agent memory: just keep the relevant state in the prompt. With Ultra you can hold the full plan, prior tool outputs, and intermediate conclusions in-context, which sharply reduces the drift that comes from aggressive summarization. But treat the window as working memory, not durable memory:

  • In-context (working memory): the current task's plan, recent tool results, and the decisions made so far.
  • External store (durable memory): facts, artifacts, and outcomes that must survive across sessions, retrieved on demand and injected when relevant.

Two more practical notes: hosted endpoints sometimes cap context below the full 1M, so confirm your provider's limit; and longer prompts cost more and run slower, so prune stale tool output instead of letting the transcript grow without bound just because it can.

6Model Routing Across the Nemotron 3 Family

A simple router turns the orchestrator-worker idea into code. Classify each step by difficulty and send it to the right tier:

MODEL_BY_TIER = {
    "plan_or_verify": "nvidia/nemotron-3-ultra-550b-a55b",   # frontier reasoning
    "worker":         "nvidia/nemotron-3-super-120b-a12b",   # substantial subtasks
    "cheap":          "nvidia/nemotron-3-nano",              # high-volume, simple steps
}

def model_for(step_type: str) -> str:
    if step_type in ("plan", "verify", "resolve_conflict"):
        return MODEL_BY_TIER["plan_or_verify"]
    if step_type in ("classify", "format", "route"):
        return MODEL_BY_TIER["cheap"]
    return MODEL_BY_TIER["worker"]

This keeps the expensive model on the critical path only. For a deeper treatment of routing economics across model tiers, see our writing on LLM gateways and model routing. The principle holds for any model family: match the model to the marginal value of the step.

7Evaluation & Guardrails

Long-running agents fail in ways single prompts do not: they loop, they drift, they call the wrong tool with confident arguments. Ship them with the safety rails on:

  • Step and budget ceilings. Cap steps, total tokens, and wall-clock time per task so a stuck agent fails fast instead of burning the budget.
  • Tool sandboxing. Give each tool least privilege. Require explicit confirmation for irreversible actions such as deletes, deploys, or payments. Never let the agent execute untrusted output as code without isolation.
  • Verification steps. Have the orchestrator check worker output against the goal before accepting it, rather than chaining unchecked results.
  • Offline evals. Maintain a task suite with known-good outcomes and run it on every prompt or model change. Treat agent quality as a regression-tested artifact, not a vibe.
  • Untrusted input. Tool outputs and retrieved documents can carry prompt-injection attempts. Keep instructions and data clearly separated and do not blindly trust retrieved text.

8Why Lushbinary for Agent Engineering

The gap between a demo agent and a dependable one is all in the engineering: routing, reasoning control, tool sandboxing, evaluation, and cost guardrails. Lushbinary builds production agentic systems on open and hosted models alike, including orchestration with Nemotron 3 Ultra as the planner and cheaper workers underneath. We handle the loop, the infrastructure, and the controls that keep it safe and affordable.

Bring us the workflow you want to automate and we will design the agent architecture and ship it to production.

Build Long-Running Agents That Ship

From orchestration and tool design to evals and cost control, Lushbinary turns agent prototypes into production systems. Let's scope yours.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

Encrypted in transit · GDPR ready · We never share or sell your data

Subscribe · Newsletter

Ship Agents That Actually Work

Real patterns for orchestration, tool design, evals, and cost control, from teams putting agents into production.

  • New deep-dives on AI agents and cloud architecture
  • Engineering teardowns of shipped products
  • No spam, unsubscribe in one click

We respect your inbox. Read our privacy policy.

Exclusive Offer for Lushbinary Readers
WidelAI
WidelAI

One Subscription. Every Flagship AI Model.

Stop juggling multiple AI subscriptions. WidelAI gives you access to Claude, GPT, Gemini, and more - all under a single plan.

Claude Opus & SonnetGPT-5.5 & o3Gemini ProSingle DashboardAPI Access

Use code at checkout for 10% off your subscription:

Nemotron 3 UltraAI AgentsAgent OrchestrationNVIDIATool CallingMulti-Agent SystemsModel RoutingLong-Running AgentsAgentic AILLM Agents1M ContextAI Workflows

ContactUs

Contact us