Running an AI agent that learns from experience, remembers across sessions, and costs nothing per token sounds too good to be true. But that's exactly what you get when you pair Hermes Agent with local open-weight models through Ollama.
The problem with most AI agent setups is the recurring API bill. Claude, GPT, and Gemini are powerful, but at $30β$65/month for heavy agent use, costs add up fast. Meanwhile, open-weight models have caught up dramatically: Google's Gemma 4 (released April 2, 2026 under Apache 2.0) and Alibaba's Qwen 3.5 (released February 16, 2026 under Apache 2.0) both deliver frontier-adjacent intelligence that runs entirely on consumer hardware.
This guide walks you through setting up Hermes Agent with both model families via Ollama, compares their strengths for agentic workflows, shows you how to configure multi-model routing, and covers production deployment patterns β all with zero API costs.
π What This Guide Covers
- Why Local Models + Hermes Agent
- Gemma 4 Model Family Overview
- Qwen 3.5 Model Family Overview
- Installing Ollama & Pulling Models
- Setting Up Hermes Agent with Ollama
- Gemma 4 vs Qwen 3.5 for Agentic Workflows
- Multi-Model Routing: Best of Both
- Memory, Skills & the Self-Improving Loop
- Hardware Requirements & Performance Tuning
- Production Deployment Patterns
- Why Lushbinary for Your AI Agent Stack
1Why Local Models + Hermes Agent
Hermes Agent is the self-improving AI agent from Nous Research. Released February 25, 2026 under MIT license, it has crossed 64,000+ GitHub stars and become the go-to framework for developers who want an agent that compounds knowledge over time. As of v0.8 (April 2026), it features pluggable memory backends, 40+ built-in tools, MCP server mode, multi-instance profiles, and six terminal backends.
The key insight: Hermes is model-agnostic. It works with any OpenAI-compatible API endpoint. That means you can swap between cloud providers and local inference with a single config change β no code modifications, no lock-in.
Running local models gives you three advantages that cloud APIs can't match:
- Zero marginal cost β No per-token billing. Run thousands of agent tasks per day without watching a usage dashboard.
- Complete privacy β Your data never leaves your machine. Critical for enterprise workflows, healthcare data, financial analysis, or any sensitive context.
- No rate limits β Cloud APIs throttle heavy agent use. Local inference runs as fast as your hardware allows, with no queuing or 429 errors.
The tradeoff has always been quality. But in April 2026, that gap has narrowed dramatically. Gemma 4 31B ranked #3 among open models on the Arena AI text leaderboard at launch. Qwen 3.5 27B ties GPT-5 mini on SWE-bench Verified at 72.4. These aren't toy models β they're production-grade.
2Gemma 4 Model Family Overview
Gemma 4 is Google DeepMind's fourth-generation open-weight model family, released April 2, 2026 under the Apache 2.0 license β a first for the Gemma family. It spans four sizes designed for different deployment scenarios:
| Variant | Params | Type | VRAM | Best For |
|---|---|---|---|---|
| E2B | 2.3B effective / 5.1B total | Dense + PLE | ~4 GB | Phones, IoT, Raspberry Pi |
| E4B | 4B effective | Dense + PLE | ~8 GB | Laptops, balanced local trial |
| 26B A4B (MoE) | 26B total / 4B active | MoE (128 experts, 8+1 active) | ~20 GB | Sweet spot for local agents |
| 31B Dense | 31B | Dense | ~24 GB | Maximum quality, workstations |
All Gemma 4 variants are multimodal (text + image input, with audio on edge models), support 140+ languages, and offer context windows up to 256K tokens. The architecture uses Per-Layer Embeddings (PLE) on edge models and a shared KV cache design that makes multi-turn conversations memory-efficient.
For Hermes Agent, the 26B MoE variant is the sweet spot: it activates only 4B parameters per token (fast inference) while drawing on 26B total parameters for quality. It fits comfortably on an RTX 4090 or M-series Mac with 24+ GB unified memory.
3Qwen 3.5 Model Family Overview
Qwen 3.5 is Alibaba's latest model family, released February 16, 2026 under Apache 2.0. The flagship 397B MoE model activates only 17B parameters per token and beat Claude 4.5 Opus on the HMMT math benchmark. But the real story for local deployment is the smaller variants:
| Variant | Architecture | VRAM | Context | Best For |
|---|---|---|---|---|
| 0.8B | Dense + DeltaNet | ~2 GB | 262K | Edge, phones |
| 9B | Dense + DeltaNet | ~8 GB | 262K | Laptops, budget GPUs |
| 27B | Dense + DeltaNet | ~20 GB | 262K | Coding, long-context reasoning |
| 35B-A3B (MoE) | MoE + DeltaNet | ~12 GB | 1M | Efficient agentic workflows |
The architectural innovation in Qwen 3.5 is the hybrid Gated DeltaNet attention. It alternates between linear attention layers and full softmax attention in a 3:1 ratio. The linear layers maintain constant memory complexity regardless of sequence length, while the full attention blocks handle precision-critical reasoning. This enables up to 19x faster inference at 256K context compared to standard transformer attention.
The 9B model is particularly impressive β it beats GPT-OSS-120B on multiple knowledge benchmarks despite being 13x smaller. For Hermes Agent, the 27B dense model is the recommended pick: it achieves 72.4 on SWE-bench Verified (tying GPT-5 mini) and handles complex multi-step tool-calling workflows reliably.
π‘ Key Difference
Gemma 4 uses Per-Layer Embeddings (PLE) for parameter efficiency on edge models. Qwen 3.5 uses Gated DeltaNet for inference speed on long contexts. Both are Apache 2.0. Choose based on your primary workload: multimodal + structured output β Gemma 4. Long-context coding + reasoning β Qwen 3.5.
4Installing Ollama & Pulling Models
Ollama is the fastest path to running both model families locally. It handles quantization, model management, and serves an OpenAI-compatible API that Hermes connects to natively. Ollama v0.20.0 added same-day support for Gemma 4 on April 3, 2026.
Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Start the Ollama server
ollama serve
Pull Gemma 4 Models
# Edge model (~8 GB VRAM)
ollama pull gemma4:e4b
# MoE sweet spot (~20 GB VRAM) β recommended for Hermes
ollama pull gemma4:26b
# Maximum quality (~24 GB VRAM)
ollama pull gemma4:31b
Pull Qwen 3.5 Models
# Compact powerhouse (~8 GB VRAM)
ollama pull qwen3.5:9b
# Best for coding & reasoning (~20 GB VRAM) β recommended for Hermes
ollama pull qwen3.5:27b
# MoE variant, efficient (~12 GB VRAM)
ollama pull qwen3.5:35b-a3b
Verify your models are available:
ollama list
# Should show gemma4:26b, qwen3.5:27b, etc.
5Setting Up Hermes Agent with Ollama
Hermes Agent connects to Ollama through its custom endpoint configuration. The setup takes under two minutes.
Step 1: Install Hermes Agent
# Linux / macOS / WSL2
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# Reload your shell
source ~/.bashrc Β Β # or source ~/.zshrc
Step 2: Configure Ollama as Provider
Run the model selection wizard:
hermes model
# Select "More providersβ¦"
# Select "Custom endpoint (enter URL manually)"
# API base URL: http://127.0.0.1:11434/v1
# API key: (leave blank)
# Hermes auto-detects your downloaded models
# Select gemma4:26b or qwen3.5:27b
β οΈ Context Length Requirement
Hermes Agent requires a minimum of 64K tokens of context. When running local models, set the context size explicitly: ollama run gemma4:26b --ctx-size 65536. Ollama auto-detects this during Hermes setup, but verify if you see startup errors.
Step 3: Start Chatting
hermes
# You'll see a welcome banner showing your model,
# available tools, and loaded skills
β― What can you help me with?
That's it. Hermes is now running entirely on your machine with zero API costs. The agent has access to web search, file operations, terminal commands, and more out of the box.
Quick Model Switching
Want to try a different model? Switch without restarting:
# Switch to Qwen 3.5 27B
hermes model
# Select qwen3.5:27b from detected models
# Or use the /model slash command inside chat
/model
6Gemma 4 vs Qwen 3.5 for Agentic Workflows
Both model families are strong for agent use, but they make different tradeoffs. Here's how they compare on the dimensions that matter most for Hermes Agent workflows:
| Dimension | Gemma 4 (26B MoE) | Qwen 3.5 (27B) |
|---|---|---|
| Function calling | β Native, reliable structured output | β Strong, occasional format drift on complex chains |
| Coding | Good (competitive at ~30B class) | β Excellent (72.4 SWE-bench Verified) |
| Long context | 256K tokens | β 262Kβ1M tokens (DeltaNet, 19x faster at 256K) |
| Multimodal | β Vision + audio (edge models) | Vision (Qwen3.5-Omni adds audio/video) |
| Inference speed | β Fast (only 4B active params in MoE) | Good (27B dense, but DeltaNet helps at long context) |
| VRAM | ~20 GB | ~20 GB |
| Multilingual | β 140+ languages | β Strong multilingual (CJK especially) |
| License | Apache 2.0 | Apache 2.0 |
When to Choose Gemma 4
- Your agent workflows rely heavily on structured JSON output and function calling β Gemma 4's native function calling is more consistent
- You need multimodal input (analyzing images, screenshots, documents) as part of agent tasks
- You want the fastest inference at the ~20 GB VRAM tier (MoE activates only 4B params)
- You're building agents that serve multilingual users across 140+ languages
When to Choose Qwen 3.5
- Your agent does heavy coding tasks β code generation, refactoring, debugging, PR reviews
- You need long-context reasoning over large codebases, documents, or conversation histories (up to 1M tokens with the 35B-A3B MoE variant)
- You want the strongest math and reasoning at the ~27B parameter class
- You're working primarily with Chinese, Japanese, or Korean content where Qwen has a natural advantage
π― Our Recommendation
For most Hermes Agent users, start with Gemma 4 26B MoE for general-purpose agent tasks (fast, reliable function calling, good quality). Switch to Qwen 3.5 27B for coding-heavy sessions. Use multi-model routing (next section) to get the best of both automatically.
7Multi-Model Routing: Best of Both
One of Hermes Agent's most powerful features is its ability to switch models on the fly. With both Gemma 4 and Qwen 3.5 pulled in Ollama, you can configure multi-instance profiles to route different task types to the optimal model.
Profile-Based Routing
Hermes v0.6.0 introduced multi-instance profiles. You can run separate agent instances, each configured with a different model:
# Create a Gemma 4 profile for general tasks
hermes profile create general --model gemma4:26b
# Create a Qwen 3.5 profile for coding tasks
hermes profile create coder --model qwen3.5:27b
# Run the general agent
hermes --profile general
# Run the coding agent in another terminal
hermes --profile coder
Hybrid Local + Cloud Routing
For the best cost-quality balance, use local models for routine tasks and fall back to a cloud model for complex reasoning:
π Local (90% of tasks)
- Gemma 4 26B for file ops, web search, scheduling
- Qwen 3.5 27B for code generation, debugging
- Cost: $0/month
βοΈ Cloud (10% of tasks)
- Claude or GPT for complex multi-step reasoning
- Fallback when local model struggles
- Cost: $2β5/month typical
8Memory, Skills & the Self-Improving Loop
Hermes Agent's self-improving loop is what sets it apart from every other agent framework. It works identically with local models β the learning system is model-agnostic.
How the Learning Loop Works
- Task completion β After finishing a complex task, Hermes analyzes what it did and extracts reusable patterns
- Skill creation β It writes these patterns as Markdown skill files stored in
~/.hermes/skills/ - Skill loading β When a similar task comes up, Hermes loads the relevant skill into context before starting
- Self-evaluation β Every 15 tasks, the agent evaluates its performance and refines its approach
- Skill improvement β Skills that produce better outcomes get updated; underperforming ones get revised
Four-Tier Memory Architecture
As of v0.7.0, Hermes uses a four-tier memory system designed to maximize provider-side prompt caching:
Working Memory
Current conversation context. Fits within the model's context window.
Session Memory
SQLite + FTS5 full-text search. Persists across restarts within a session.
Long-Term Memory
LLM-summarized history. Recalled via semantic search when relevant.
Skill Memory
Markdown skill files. Loaded into context when matching tasks are detected.
With v0.7.0's pluggable memory backends, you can swap in vector stores, Honcho, or custom databases for the long-term memory tier. This is especially useful when running local models, since you can keep everything on your own infrastructure.
Installing Community Skills
Hermes has a growing skills ecosystem. You can browse and install community-contributed skills:
# Search for skills
hermes skills search kubernetes
hermes skills search react --source skills-sh
# Install a skill
hermes skills install openai/skills/k8s
hermes skills install official/security/1password
9Hardware Requirements & Performance Tuning
Running Hermes with local models requires enough VRAM (or unified memory on Apple Silicon) to hold the model weights plus the KV cache for the context window. Here's a practical hardware guide:
| Hardware | Memory | Recommended Models | Tokens/sec (approx) |
|---|---|---|---|
| MacBook Air M2/M3 (8 GB) | 8 GB unified | Gemma 4 E4B, Qwen 3.5 9B | 15β25 tok/s |
| MacBook Pro M3/M4 (16β24 GB) | 16β24 GB unified | Gemma 4 26B MoE, Qwen 3.5 27B | 10β20 tok/s |
| RTX 4090 (24 GB) | 24 GB VRAM | Gemma 4 31B, Qwen 3.5 27B | 25β40 tok/s |
| Mac Studio M4 Ultra (64β192 GB) | 64+ GB unified | All variants, multiple simultaneous | 20β35 tok/s |
Performance Tuning Tips
- Context size tradeoff: Larger context windows consume more memory. Start with
--ctx-size 65536(64K, Hermes minimum) and increase only if your workflows need it. - Quantization: Ollama defaults to Q4_K_M quantization, which is a good balance. For maximum quality on capable hardware, use
ollama pull gemma4:31b-fp16. - GPU offloading: On systems with limited VRAM, Ollama automatically splits layers between GPU and CPU. More GPU layers = faster inference.
- Concurrent models: If running both Gemma 4 and Qwen 3.5 for multi-model routing, ensure you have enough total memory for both. Ollama keeps recently used models in memory.
- Batch size: For agent workflows (single-user, sequential), the default batch size is fine. Don't increase it unless you're serving multiple users.
10Production Deployment Patterns
Running Hermes with local models in production requires a few additional considerations beyond the basic setup.
Pattern 1: Personal VPS Agent
The simplest production setup. Run Hermes + Ollama on a GPU VPS and connect via messaging platforms:
# On your VPS (e.g., Lambda Labs, Vast.ai, RunPod)
# 1. Install Ollama + pull models
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:26b
ollama pull qwen3.5:27b
# 2. Install Hermes
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# 3. Configure Ollama endpoint
hermes setup
# 4. Connect messaging (Telegram, Discord, etc.)
hermes gateway setup
# 5. Start the gateway (runs 24/7)
hermes gateway start
Cost: A GPU VPS with 24 GB VRAM (enough for Gemma 4 26B or Qwen 3.5 27B) runs $0.30β$0.80/hour on spot instances, or roughly $50β$150/month for always-on. Compare that to $30β$65/month in API costs for heavy cloud model usage β and you get unlimited tokens.
Pattern 2: Sandboxed Docker Deployment
For security, run Hermes in a Docker container with sandboxed terminal access:
# Configure Docker sandboxing
hermes config set terminal.backend docker
# Or use SSH for remote server isolation
hermes config set terminal.backend ssh
Pattern 3: MCP Server Mode
Hermes v0.6.0 added native MCP server capabilities. You can expose Hermes as an MCP server that other tools (VS Code, Kiro, JetBrains) can connect to:
pip install 'hermes-agent[acp]'
hermes acp
Pattern 4: Scheduled Automation
Hermes can set up cron jobs that run automatically via the messaging gateway. This is powerful with local models because there's no per-invocation cost:
# Inside Hermes chat:
β― Every morning at 9am, check Hacker News for AI news
Β Β and send me a summary on Telegram.
# Hermes creates a cron job automatically
# Runs via the gateway with zero API cost
0System Architecture
11Why Lushbinary for Your AI Agent Stack
Setting up Hermes Agent with local models is straightforward for a single developer. But deploying it as part of a production system β with proper security, monitoring, multi-model routing, and integration with your existing infrastructure β requires expertise across AI, DevOps, and application architecture.
At Lushbinary, we've deployed Hermes Agent, Gemma 4, and Qwen 3.5 in production environments for clients across industries. Our team specializes in:
- Local AI agent architecture β Designing multi-model routing systems that balance cost, quality, and latency
- GPU infrastructure β Selecting and configuring the right hardware (cloud GPU, on-prem, Apple Silicon) for your workload
- Hermes customization β Building custom skills, MCP integrations, and messaging gateway configurations
- Security hardening β Docker sandboxing, SSH isolation, and access control for production agent deployments
- AWS deployment β Running Ollama + Hermes on EC2 GPU instances with auto-scaling and cost optimization
π Free Consultation
Want to deploy a self-improving AI agent that runs on your own infrastructure with zero API costs? Lushbinary specializes in local AI agent deployments with Hermes, Gemma 4, and Qwen 3.5. We'll scope your project, recommend the right model and hardware configuration, and give you a realistic timeline β no obligation.
β Frequently Asked Questions
Can Hermes Agent run with local models like Gemma 4 and Qwen 3.5?
Yes. Hermes Agent supports any OpenAI-compatible endpoint, including Ollama. You can run Gemma 4 (E4B, 26B, or 31B) or Qwen 3.5 (9B, 27B, or 35B-A3B) locally via Ollama and point Hermes to http://127.0.0.1:11434/v1 as a custom endpoint.
What hardware do I need to run Hermes with Gemma 4 or Qwen 3.5 locally?
Gemma 4 E4B needs ~8 GB VRAM, Gemma 4 26B MoE needs ~20 GB, and Gemma 4 31B Dense needs ~24 GB. Qwen 3.5 9B needs ~8 GB, Qwen 3.5 27B needs ~20 GB, and Qwen 3.5 35B-A3B MoE needs ~12 GB. An M-series Mac with 16-32 GB unified memory or an RTX 4090 covers most variants.
Which model is better for Hermes Agent: Gemma 4 or Qwen 3.5?
Gemma 4 31B excels at structured output, function calling, and multimodal tasks. Qwen 3.5 27B is stronger at long-context reasoning (up to 1M tokens with hybrid DeltaNet attention) and coding benchmarks. For agentic workflows, Gemma 4 26B MoE offers the best balance of speed and quality at lower VRAM cost.
How does Hermes Agent's self-improving loop work with local models?
Hermes creates reusable skill documents from completed tasks, stores them as Markdown files, and loads them when similar tasks arise. Every 15 tasks it self-evaluates. This works identically with local models β the learning loop is model-agnostic and runs entirely on your machine.
Is running Hermes Agent with local models free?
Yes. Hermes Agent is MIT-licensed and free. Ollama is free. Gemma 4 is Apache 2.0 and Qwen 3.5 is Apache 2.0. The only cost is your hardware electricity. A $5/month VPS can run smaller variants like Gemma 4 E4B or Qwen 3.5 9B.
π Sources
- Google AI β Gemma 4 Release Notes
- Hugging Face β Gemma 4 Blog
- Nous Research β Hermes Agent Quickstart
- Ollama β Hermes Agent Integration
- Hugging Face β Qwen 3.5 Architecture Deep Dive
Content was rephrased for compliance with licensing restrictions. Model specifications sourced from official release documentation as of April 2026. Pricing and availability may change β always verify on the vendor's website.
Deploy a Self-Improving AI Agent on Your Infrastructure
Let Lushbinary set up Hermes Agent with Gemma 4 and Qwen 3.5 on your hardware β zero API costs, complete privacy, and an agent that gets smarter every day.
Build Smarter, Launch Faster.
Book a free strategy call and explore how LushBinary can turn your vision into reality.

