Mistral AI has been on a tear. From scrappy European startup to a $14 billion valuation in under three years, the Paris-based lab has consistently shipped models that punch above their weight class. Their latest release, Mistral Medium 3.5, might be their most ambitious yet: a dense 128B parameter model that unifies instruction-following, reasoning, and coding into a single set of weights, with a 256K context window and multimodal vision capabilities.
What makes Medium 3.5 interesting is the "merged model" approach. Instead of shipping separate models for chat, reasoning, and code (like Magistral and Devstral before it), Mistral collapsed everything into one model with configurable reasoning effort per request. The same model can fire off a quick chat reply or grind through a complex agentic coding session.
In this guide, we cover everything developers need to know: the architecture, benchmark results, API integration, pricing, self-hosting options, and how Medium 3.5 compares to the competition. Whether you're evaluating it for production workloads or exploring open-weight alternatives to proprietary models, this is the complete reference.
What This Guide Covers
1Architecture & Key Features
Mistral Medium 3.5 is what Mistral calls their first "flagship merged model." The core idea: instead of maintaining separate model weights for instruction-following (Medium 3.1), reasoning (Magistral), and coding (Devstral 2), they trained a single dense 128B model that handles all three. This replaces three models with one, simplifying deployment and reducing the operational overhead of routing between specialized models.
| Spec | Detail |
|---|---|
| Parameters | 128B dense (all active per inference) |
| Context Window | 256K tokens |
| Architecture | Dense transformer (not MoE) |
| Modalities | Text + image input, text output |
| Reasoning | Configurable per request (none / high) |
| Function Calling | Native, with JSON output |
| Languages | English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic, and more |
| License | Modified MIT (open weights, commercial use with revenue exceptions) |
| Min Self-Host GPUs | 4 GPUs (e.g., 4x H100 80GB) |
The vision encoder deserves special mention. Mistral trained it from scratch to handle variable image sizes and aspect ratios, rather than forcing images into fixed resolutions. This means the model can process everything from tall screenshots to wide panoramic images without distortion or information loss.
Key Architectural Decision
Unlike Mistral Large 3 (which uses a 675B MoE architecture with 41B active parameters), Medium 3.5 is fully dense. Every parameter is active on every inference pass. This makes it simpler to deploy and more predictable in latency, but requires more compute per token than a comparable MoE model.
2Benchmark Results
Mistral published benchmark results across agentic coding, instruction-following, reasoning, and general tasks. The headline numbers are strong, particularly on coding and agentic benchmarks where Medium 3.5 outperforms its predecessors and several larger models.
Agentic & Coding Benchmarks
| Benchmark | Medium 3.5 | Devstral 2 | Qwen3.5 397B |
|---|---|---|---|
| SWE-Bench Verified | 77.6% | ~72% | ~74% |
| Tau3-Telecom | 91.4% | - | - |
The 77.6% SWE-Bench Verified score is particularly notable. This benchmark tests whether a model can resolve real GitHub issues by generating correct patches. For context, Gemini 3.1 Pro Preview leads the overall leaderboard at 78.8%, putting Medium 3.5 in competitive territory with Google's flagship. The Tau3-Telecom score of 91.4% demonstrates strong agentic capabilities in domain-specific tool-use scenarios.
On instruction-following and reasoning benchmarks, Medium 3.5 shows strong results across the board. Mistral positions it as a unified replacement for their previous specialized models, and the benchmarks support that claim. It handles math, coding, and general knowledge tasks without the quality drop you might expect from a jack-of-all-trades model.
3API Integration & Code Examples
Mistral's API is OpenAI-compatible, which means most existing OpenAI client libraries work with a simple base URL swap. Here's how to get started with the Python SDK:
from openai import OpenAI
client = OpenAI(
api_key="your-mistral-api-key",
base_url="https://api.mistral.ai/v1"
)
response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in 3 sentences."}
],
temperature=0.3,
)
print(response.choices[0].message.content)Function Calling
Medium 3.5 supports native function calling with JSON output. The model reliably selects the right tool and structures arguments correctly, which is critical for agentic workflows:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}
]
response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools,
tool_choice="auto",
temperature=0.1,
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name, tool_call.function.arguments)4Configurable Reasoning Effort
One of Medium 3.5's standout features is per-request reasoning effort configuration. You can toggle between fast instant replies and deep reasoning mode, letting the same model handle both quick chat responses and complex multi-step problems.
reasoning_effort="none"
- Fast responses, lower latency
- Best for simple Q&A, classification, extraction
- Temperature: 0.0 - 0.7
- Lower token usage per request
reasoning_effort="high"
- Extended thinking, higher accuracy
- Best for coding, math, agentic tasks
- Temperature: 0.7 recommended
- Higher token usage but better results
# Quick chat reply - no reasoning overhead
fast_response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[{"role": "user", "content": "What is Python?"}],
temperature=0.1,
extra_body={"reasoning_effort": "none"},
)
# Complex coding task - full reasoning
deep_response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[{"role": "user", "content": "Refactor this module to use dependency injection..."}],
temperature=0.7,
extra_body={"reasoning_effort": "high"},
)This dual-mode approach is practical for production systems. You can route simple customer queries through the fast path and reserve reasoning mode for complex tasks, all without switching models or managing multiple deployments.
5Vision & Multimodal Capabilities
Medium 3.5 includes a vision encoder trained from scratch. Unlike models that bolt on vision as an afterthought, Mistral designed this encoder to handle variable image sizes and aspect ratios natively. This means you can send screenshots, documents, charts, and photos without worrying about resolution constraints.
response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the architecture in this diagram."},
{
"type": "image_url",
"image_url": {"url": "https://example.com/arch-diagram.png"}
}
]
}
],
temperature=0.3,
)Vision use cases that work well with Medium 3.5 include document parsing and OCR, chart and graph interpretation, UI screenshot analysis, visual QA for product images, and code screenshot understanding. The variable aspect ratio support is particularly useful for document processing where pages come in different sizes.
6Pricing & Cost Analysis
Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens through the Mistral API. Here's how that stacks up against the competition:
| Model | Input / 1M | Output / 1M |
|---|---|---|
| Mistral Medium 3.5 | $1.50 | $7.50 |
| Mistral Medium 3 | $0.40 | $2.00 |
| Mistral Large 3 | $0.50 | $1.50 |
| GPT-4o | $2.50 | $10.00 |
| Claude Sonnet 4 | $3.00 | $15.00 |
Medium 3.5 sits in a middle tier: more expensive than the budget Medium 3 ($0.40/$2.00) but cheaper than GPT-4o and Claude Sonnet 4. The pricing reflects its position as a unified model that replaces multiple specialized models. For teams currently running separate Magistral and Devstral deployments, consolidating to Medium 3.5 could simplify billing even if the per-token cost is higher.
Cost Optimization Tip
Use reasoning_effort="none" for simple tasks to reduce output token usage. Reserve reasoning_effort="high" for complex coding and reasoning tasks. This can cut your average cost per request significantly without sacrificing quality where it matters.
7Self-Hosting Options
Mistral released Medium 3.5 as open weights on Hugging Face under a modified MIT license. The license allows commercial use but includes revenue-based exceptions for very large companies. For most startups and mid-size businesses, it's effectively open source.
Deployment Options
vLLM (Recommended)
Production-ready inference with tensor parallelism. Requires vLLM nightly, mistral_common >= 1.11.1, and transformers >= 5.4.0.
Min: 4x H100 80GB with TP=4
SGLang
Alternative inference engine with day-zero support via dedicated Docker images for Hopper and Blackwell GPUs.
Requires transformers >= 5.4.0
Ollama
Simplified local deployment. Good for development and testing, though performance may lag behind vLLM for production workloads.
GGUF quantized versions available via Unsloth
NVIDIA NIM
Containerized inference microservice for enterprise deployment. Available on build.nvidia.com for prototyping.
GPU-accelerated endpoints
Mistral also released an EAGLE model to speed up local inference with vLLM and SGLang. EAGLE uses speculative decoding to predict multiple tokens ahead, reducing latency without sacrificing quality. If you're self-hosting, this is worth enabling.
8Mistral Vibe & Remote Agents
Medium 3.5 is the default model powering Mistral Vibe, Mistral's CLI coding agent. With the Medium 3.5 release, Vibe gained a major new capability: remote agents. Coding sessions can now run in the cloud asynchronously, with multiple sessions running in parallel.
- Cloud execution: Sessions run on Mistral's infrastructure, not your laptop. Start a task and walk away.
- Parallel agents: Run multiple coding sessions simultaneously. No more being the bottleneck.
- Session teleportation: Move a local CLI session to the cloud mid-task, preserving history and state.
- GitHub integration: Agents can open pull requests when done. You review the result, not every keystroke.
- Le Chat integration: Start coding tasks from the web interface, running on the same remote runtime.
Vibe also introduced a new Work mode in Le Chat, powered by Medium 3.5. This is an agentic mode for complex multi-step tasks: research, analysis, cross-tool actions, inbox triage, and report generation. It connects to email, calendars, Jira, Slack, and other tools, working through tasks until completion.
9How It Compares to GPT-4o, Claude & DeepSeek
Here's a practical comparison of Medium 3.5 against the models developers are most likely evaluating:
| Feature | Medium 3.5 | GPT-4o | Claude Sonnet 4 |
|---|---|---|---|
| Parameters | 128B dense | Undisclosed | Undisclosed |
| Context | 256K | 128K | 200K |
| Open Weights | Yes | No | No |
| Self-Hostable | 4 GPUs | No | No |
| Vision | Yes | Yes | Yes |
| Input / 1M tokens | $1.50 | $2.50 | $3.00 |
| Data Sovereignty | EU-based, self-host option | US-based | US-based |
Medium 3.5's strongest differentiators are open weights and data sovereignty. For European companies subject to GDPR, or any organization that needs to keep data on-premises, the ability to self-host a frontier-class model on four GPUs is a significant advantage that neither OpenAI nor Anthropic can match.
On raw capability, Medium 3.5 is competitive but not dominant. GPT-4o and Claude Sonnet 4 still edge ahead on certain reasoning and creative writing tasks. But for coding, agentic workflows, and multilingual applications, Medium 3.5 holds its own while offering flexibility that closed models simply cannot.
10Why Lushbinary for Your AI Integration
Integrating a model like Mistral Medium 3.5 into production requires more than API calls. You need proper model routing, fallback strategies, cost monitoring, and infrastructure that scales. At Lushbinary, we specialize in building AI-powered applications that work reliably in production.
- Multi-model architectures: We design systems that route between Mistral, OpenAI, and Anthropic models based on task complexity and cost targets.
- Self-hosting on AWS: We deploy open-weight models on EC2 GPU instances with vLLM, auto-scaling, and monitoring.
- Agentic workflows: We build production-grade AI agents with function calling, tool use, and human-in-the-loop approval flows.
- Cost optimization: We implement semantic caching, prompt compression, and tiered model routing to keep API costs under control.
Free Consultation
Want to integrate Mistral Medium 3.5 into your product or evaluate it against your current model stack? Lushbinary will scope your project, recommend the right architecture, and give you a realistic timeline - no obligation.
Frequently Asked Questions
What is Mistral Medium 3.5 and how big is it?
Mistral Medium 3.5 is a dense 128B parameter model from Mistral AI with a 256K context window. It merges instruction-following, reasoning, and coding into a single set of weights, replacing both Mistral Medium 3.1 and Magistral in Le Chat.
How much does Mistral Medium 3.5 API cost?
Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens through the Mistral API. Open weights are available on Hugging Face under a modified MIT license for self-hosting.
What benchmarks does Mistral Medium 3.5 achieve?
Mistral Medium 3.5 scores 77.6% on SWE-Bench Verified for coding tasks and 91.4% on the Tau3-Telecom agentic benchmark. It outperforms Devstral 2 and models like Qwen3.5 397B on agentic coding benchmarks.
Can I self-host Mistral Medium 3.5?
Yes. Mistral Medium 3.5 is released as open weights under a modified MIT license. It can run self-hosted on as few as four GPUs using vLLM, SGLang, or Ollama. NVIDIA NIM containers are also available for enterprise deployment.
Does Mistral Medium 3.5 support vision and images?
Yes. Mistral Medium 3.5 includes a vision encoder trained from scratch that handles variable image sizes and aspect ratios. It accepts both text and image inputs with text output, enabling document parsing, visual QA, and image analysis.
Sources
- Mistral AI - Remote agents in Vibe, Powered by Mistral Medium 3.5
- Hugging Face - Mistral Medium 3.5 128B Model Card
- Mistral AI - API Pricing
Content was rephrased for compliance with licensing restrictions. Benchmark data and pricing sourced from official Mistral AI documentation and Hugging Face model cards as of April 2026. Pricing and benchmarks may change - always verify on the vendor's website.
Build with Mistral Medium 3.5
Need help integrating Mistral Medium 3.5 into your product, self-hosting on your infrastructure, or designing a multi-model AI architecture? Let's talk.
Ready to Build Something Great?
Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

