Mistral AI has been on a tear. From scrappy European startup to a $14 billion valuation in under three years, the Paris-based lab has consistently shipped models that punch above their weight class. Their latest release, Mistral Medium 3.5, might be their most ambitious yet: a dense 128B parameter model that unifies instruction-following, reasoning, and coding into a single set of weights, with a 256K context window and multimodal vision capabilities.

What makes Medium 3.5 interesting is the "merged model" approach. Instead of shipping separate models for chat, reasoning, and code (like Magistral and Devstral before it), Mistral collapsed everything into one model with configurable reasoning effort per request. The same model can fire off a quick chat reply or grind through a complex agentic coding session.

In this guide, we cover everything developers need to know: the architecture, benchmark results, API integration, pricing, self-hosting options, and how Medium 3.5 compares to the competition. Whether you're evaluating it for production workloads or exploring open-weight alternatives to proprietary models, this is the complete reference.

What This Guide Covers

Architecture & Key Features
Benchmark Results
API Integration & Code Examples
Configurable Reasoning Effort
Vision & Multimodal Capabilities
Pricing & Cost Analysis
Self-Hosting Options
Mistral Vibe & Remote Agents
How It Compares to GPT-4o, Claude & DeepSeek
Why Lushbinary for Your AI Integration

1Architecture & Key Features

Mistral Medium 3.5 is what Mistral calls their first "flagship merged model." The core idea: instead of maintaining separate model weights for instruction-following (Medium 3.1), reasoning (Magistral), and coding (Devstral 2), they trained a single dense 128B model that handles all three. This replaces three models with one, simplifying deployment and reducing the operational overhead of routing between specialized models.

Spec	Detail
Parameters	128B dense (all active per inference)
Context Window	256K tokens
Architecture	Dense transformer (not MoE)
Modalities	Text + image input, text output
Reasoning	Configurable per request (none / high)
Function Calling	Native, with JSON output
Languages	English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic, and more
License	Modified MIT (open weights, commercial use with revenue exceptions)
Min Self-Host GPUs	4 GPUs (e.g., 4x H100 80GB)

The vision encoder deserves special mention. Mistral trained it from scratch to handle variable image sizes and aspect ratios, rather than forcing images into fixed resolutions. This means the model can process everything from tall screenshots to wide panoramic images without distortion or information loss.

Key Architectural Decision

Unlike Mistral Large 3 (which uses a 675B MoE architecture with 41B active parameters), Medium 3.5 is fully dense. Every parameter is active on every inference pass. This makes it simpler to deploy and more predictable in latency, but requires more compute per token than a comparable MoE model.

2Benchmark Results

Mistral published benchmark results across agentic coding, instruction-following, reasoning, and general tasks. The headline numbers are strong, particularly on coding and agentic benchmarks where Medium 3.5 outperforms its predecessors and several larger models.

Agentic & Coding Benchmarks

Benchmark	Medium 3.5	Devstral 2	Qwen3.5 397B
SWE-Bench Verified	77.6%	~72%	~74%
Tau3-Telecom	91.4%	-	-

The 77.6% SWE-Bench Verified score is particularly notable. This benchmark tests whether a model can resolve real GitHub issues by generating correct patches. For context, Gemini 3.1 Pro Preview leads the overall leaderboard at 78.8%, putting Medium 3.5 in competitive territory with Google's flagship. The Tau3-Telecom score of 91.4% demonstrates strong agentic capabilities in domain-specific tool-use scenarios.

On instruction-following and reasoning benchmarks, Medium 3.5 shows strong results across the board. Mistral positions it as a unified replacement for their previous specialized models, and the benchmarks support that claim. It handles math, coding, and general knowledge tasks without the quality drop you might expect from a jack-of-all-trades model.

3API Integration & Code Examples

Mistral's API is OpenAI-compatible, which means most existing OpenAI client libraries work with a simple base URL swap. Here's how to get started with the Python SDK:

from openai import OpenAI

client = OpenAI(
    api_key="your-mistral-api-key",
    base_url="https://api.mistral.ai/v1"
)

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in 3 sentences."}
    ],
    temperature=0.3,
)

print(response.choices[0].message.content)

Function Calling

Medium 3.5 supports native function calling with JSON output. The model reliably selects the right tool and structures arguments correctly, which is critical for agentic workflows:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
    temperature=0.1,
)

tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name, tool_call.function.arguments)

4Configurable Reasoning Effort

One of Medium 3.5's standout features is per-request reasoning effort configuration. You can toggle between fast instant replies and deep reasoning mode, letting the same model handle both quick chat responses and complex multi-step problems.

reasoning_effort="none"

Fast responses, lower latency
Best for simple Q&A, classification, extraction
Temperature: 0.0 - 0.7
Lower token usage per request

reasoning_effort="high"

Extended thinking, higher accuracy
Best for coding, math, agentic tasks
Temperature: 0.7 recommended
Higher token usage but better results

# Quick chat reply - no reasoning overhead
fast_response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "What is Python?"}],
    temperature=0.1,
    extra_body={"reasoning_effort": "none"},
)

# Complex coding task - full reasoning
deep_response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "Refactor this module to use dependency injection..."}],
    temperature=0.7,
    extra_body={"reasoning_effort": "high"},
)

This dual-mode approach is practical for production systems. You can route simple customer queries through the fast path and reserve reasoning mode for complex tasks, all without switching models or managing multiple deployments.

5Vision & Multimodal Capabilities

Medium 3.5 includes a vision encoder trained from scratch. Unlike models that bolt on vision as an afterthought, Mistral designed this encoder to handle variable image sizes and aspect ratios natively. This means you can send screenshots, documents, charts, and photos without worrying about resolution constraints.

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the architecture in this diagram."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/arch-diagram.png"}
                }
            ]
        }
    ],
    temperature=0.3,
)

Vision use cases that work well with Medium 3.5 include document parsing and OCR, chart and graph interpretation, UI screenshot analysis, visual QA for product images, and code screenshot understanding. The variable aspect ratio support is particularly useful for document processing where pages come in different sizes.

6Pricing & Cost Analysis

Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens through the Mistral API. Here's how that stacks up against the competition:

Model	Input / 1M	Output / 1M
Mistral Medium 3.5	$1.50	$7.50
Mistral Medium 3	$0.40	$2.00
Mistral Large 3	$0.50	$1.50
GPT-4o	$2.50	$10.00
Claude Sonnet 4	$3.00	$15.00

Medium 3.5 sits in a middle tier: more expensive than the budget Medium 3 ($0.40/$2.00) but cheaper than GPT-4o and Claude Sonnet 4. The pricing reflects its position as a unified model that replaces multiple specialized models. For teams currently running separate Magistral and Devstral deployments, consolidating to Medium 3.5 could simplify billing even if the per-token cost is higher.

Cost Optimization Tip

Use reasoning_effort="none" for simple tasks to reduce output token usage. Reserve reasoning_effort="high" for complex coding and reasoning tasks. This can cut your average cost per request significantly without sacrificing quality where it matters.

7Self-Hosting Options

Mistral released Medium 3.5 as open weights on Hugging Face under a modified MIT license. The license allows commercial use but includes revenue-based exceptions for very large companies. For most startups and mid-size businesses, it's effectively open source.

Deployment Options

vLLM (Recommended)

Production-ready inference with tensor parallelism. Requires vLLM nightly, mistral_common >= 1.11.1, and transformers >= 5.4.0.

Min: 4x H100 80GB with TP=4

SGLang

Alternative inference engine with day-zero support via dedicated Docker images for Hopper and Blackwell GPUs.

Requires transformers >= 5.4.0

Ollama

Simplified local deployment. Good for development and testing, though performance may lag behind vLLM for production workloads.

GGUF quantized versions available via Unsloth

NVIDIA NIM

Containerized inference microservice for enterprise deployment. Available on build.nvidia.com for prototyping.

GPU-accelerated endpoints

Mistral also released an EAGLE model to speed up local inference with vLLM and SGLang. EAGLE uses speculative decoding to predict multiple tokens ahead, reducing latency without sacrificing quality. If you're self-hosting, this is worth enabling.

8Mistral Vibe & Remote Agents

Medium 3.5 is the default model powering Mistral Vibe, Mistral's CLI coding agent. With the Medium 3.5 release, Vibe gained a major new capability: remote agents. Coding sessions can now run in the cloud asynchronously, with multiple sessions running in parallel.

Cloud execution: Sessions run on Mistral's infrastructure, not your laptop. Start a task and walk away.
Parallel agents: Run multiple coding sessions simultaneously. No more being the bottleneck.
Session teleportation: Move a local CLI session to the cloud mid-task, preserving history and state.
GitHub integration: Agents can open pull requests when done. You review the result, not every keystroke.
Le Chat integration: Start coding tasks from the web interface, running on the same remote runtime.

Vibe also introduced a new Work mode in Le Chat, powered by Medium 3.5. This is an agentic mode for complex multi-step tasks: research, analysis, cross-tool actions, inbox triage, and report generation. It connects to email, calendars, Jira, Slack, and other tools, working through tasks until completion.

9How It Compares to GPT-4o, Claude & DeepSeek

Here's a practical comparison of Medium 3.5 against the models developers are most likely evaluating:

Feature	Medium 3.5	GPT-4o	Claude Sonnet 4
Parameters	128B dense	Undisclosed	Undisclosed
Context	256K	128K	200K
Open Weights	Yes	No	No
Self-Hostable	4 GPUs	No	No
Vision	Yes	Yes	Yes
Input / 1M tokens	$1.50	$2.50	$3.00
Data Sovereignty	EU-based, self-host option	US-based	US-based

Medium 3.5's strongest differentiators are open weights and data sovereignty. For European companies subject to GDPR, or any organization that needs to keep data on-premises, the ability to self-host a frontier-class model on four GPUs is a significant advantage that neither OpenAI nor Anthropic can match.

On raw capability, Medium 3.5 is competitive but not dominant. GPT-4o and Claude Sonnet 4 still edge ahead on certain reasoning and creative writing tasks. But for coding, agentic workflows, and multilingual applications, Medium 3.5 holds its own while offering flexibility that closed models simply cannot.

10Why Lushbinary for Your AI Integration

Integrating a model like Mistral Medium 3.5 into production requires more than API calls. You need proper model routing, fallback strategies, cost monitoring, and infrastructure that scales. At Lushbinary, we specialize in building AI-powered applications that work reliably in production.

Multi-model architectures: We design systems that route between Mistral, OpenAI, and Anthropic models based on task complexity and cost targets.
Self-hosting on AWS: We deploy open-weight models on EC2 GPU instances with vLLM, auto-scaling, and monitoring.
Agentic workflows: We build production-grade AI agents with function calling, tool use, and human-in-the-loop approval flows.
Cost optimization: We implement semantic caching, prompt compression, and tiered model routing to keep API costs under control.

Free Consultation

Want to integrate Mistral Medium 3.5 into your product or evaluate it against your current model stack? Lushbinary will scope your project, recommend the right architecture, and give you a realistic timeline - no obligation.

Frequently Asked Questions

What is Mistral Medium 3.5 and how big is it?

Mistral Medium 3.5 is a dense 128B parameter model from Mistral AI with a 256K context window. It merges instruction-following, reasoning, and coding into a single set of weights, replacing both Mistral Medium 3.1 and Magistral in Le Chat.

How much does Mistral Medium 3.5 API cost?

Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens through the Mistral API. Open weights are available on Hugging Face under a modified MIT license for self-hosting.

What benchmarks does Mistral Medium 3.5 achieve?

Mistral Medium 3.5 scores 77.6% on SWE-Bench Verified for coding tasks and 91.4% on the Tau3-Telecom agentic benchmark. It outperforms Devstral 2 and models like Qwen3.5 397B on agentic coding benchmarks.

Can I self-host Mistral Medium 3.5?

Yes. Mistral Medium 3.5 is released as open weights under a modified MIT license. It can run self-hosted on as few as four GPUs using vLLM, SGLang, or Ollama. NVIDIA NIM containers are also available for enterprise deployment.

Does Mistral Medium 3.5 support vision and images?

Yes. Mistral Medium 3.5 includes a vision encoder trained from scratch that handles variable image sizes and aspect ratios. It accepts both text and image inputs with text output, enabling document parsing, visual QA, and image analysis.

Sources

Content was rephrased for compliance with licensing restrictions. Benchmark data and pricing sourced from official Mistral AI documentation and Hugging Face model cards as of April 2026. Pricing and benchmarks may change - always verify on the vendor's website.

Build with Mistral Medium 3.5

Need help integrating Mistral Medium 3.5 into your product, self-hosting on your infrastructure, or designing a multi-model AI architecture? Let's talk.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Mistral Medium 3.5 Developer Guide: Architecture, API, Benchmarks & Pricing