What is Qwen 3.5 and who made it?

Qwen 3.5 is Alibaba Cloud's latest open-weight AI model family, released February 16, 2026. The flagship model has 397 billion total parameters with only 17 billion active per token using a Mixture-of-Experts (MoE) architecture. It's licensed under Apache 2.0.

How does Qwen 3.5 compare to Claude Opus 4.6 and GPT-5.2?

Qwen 3.5 is competitive with frontier closed models. It leads on instruction following (IFBench 76.5 vs GPT-5.2's 75.4), multimodal vision (MathVision 88.6 vs GPT-5.2's 83.0), and multilingual tasks. It trails on math competitions (AIME26 91.3 vs GPT-5.2's 96.7) and agentic coding (SWE-bench 76.4 vs Claude's 80.9). It's roughly 13x cheaper than Claude Opus 4.6 via API.

Can I run Qwen 3.5 locally?

Yes. All Qwen 3.5 models are open-weight under Apache 2.0. The small models (0.8B-9B) run on laptops. The 27B model needs a 24GB GPU. The flagship 397B model requires ~214GB RAM with Q4 quantization. You can run them via Ollama, llama.cpp, vLLM, or Hugging Face Transformers.

What sizes does the Qwen 3.5 model family include?

Qwen 3.5 spans three series: Flagship (397B-A17B), Medium (27B dense, 35B-A3B MoE, 122B-A10B MoE), and Small (0.8B, 2B, 4B, 9B). All support thinking/non-thinking modes, vision, and 201 languages.

Alibaba dropped Qwen 3.5 on February 16, 2026, and it immediately reshaped the open-source AI landscape. The flagship model packs 397 billion parameters into a sparse Mixture-of-Experts architecture that activates only 17 billion per forward pass, delivering frontier-level performance at a fraction of the compute cost. All open-weight, all Apache 2.0 licensed.

This guide covers the full Qwen 3.5 family: architecture, benchmarks against Claude Opus 4.6 and GPT-5.2, the complete model lineup from 0.8B to 397B, local setup instructions, API access, and practical guidance on when to choose Qwen 3.5 over closed alternatives.

📋 Table of Contents

1.The Qwen 3.5 Model Family
2.Architecture: MoE, Native Multimodal & FP8
3.Benchmark Breakdown vs Frontier Models
4.Agentic Capabilities & Tool Use
5.The Small Model Revolution (0.8B–9B)
6.Running Qwen 3.5 Locally
7.API Access & Pricing
8.When to Choose Qwen 3.5
9.Why Lushbinary for Your AI Integration

1The Qwen 3.5 Model Family

Qwen 3.5 isn't a single model. It's a family spanning multiple sizes, released in three waves over two weeks:

Series	Models	Released
Flagship	Qwen3.5-397B-A17B (397B total, 17B active)	Feb 16, 2026
Medium	Qwen3.5-27B (dense), 35B-A3B, 122B-A10B	Feb 24, 2026
Small	Qwen3.5-0.8B, 2B, 4B, 9B	Mar 2, 2026

All models share the same core innovations: native multimodal training (text + vision fused from the start), support for 201 languages and dialects, thinking and non-thinking inference modes, and Apache 2.0 licensing. The vocabulary expanded from 150K to 250K tokens compared to Qwen3, improving encoding efficiency by 10–60% across most languages.

The Qwen family has crossed 600 million downloads on Hugging Face, with over 170,000 derivative models. Over 40% of all new model derivatives on Hugging Face are now Qwen-based.

2Architecture: MoE, Native Multimodal & FP8

Qwen 3.5 is built on what Alibaba calls "Qwen3-Next," fusing two design approaches that are usually separate: linear attention via Gated Delta Networks and a sparse Mixture-of-Experts system. The result is 397 billion total parameters with just 17 billion active per token.

Three architectural upgrades define this generation:

Natively multimodal. Unlike models that bolt a vision encoder onto a language model, Qwen 3.5 was trained with early text-vision fusion from the start. It processes text, images, and video within one unified system. On MathVision it scores 88.6, beating GPT-5.2's 83.0 and Gemini 3 Pro's 86.6.
201 languages and dialects, up from 119 in Qwen3. This is a direct play for Southeast Asia, South Asia, the Middle East, and Africa. AI Singapore chose Qwen over Meta's Llama and Google's Gemma as the foundation for its regional language model.
Native FP8 training pipeline that applies low-precision computing to activations, MoE routing, and matrix operations. The result is roughly 50% activation memory reduction and over 10% speedup while scaling stably to tens of trillions of training tokens.

The efficiency gains are dramatic. Decoding throughput is 8.6x faster than Qwen3-Max at 32K context and 19x faster at 256K context. Compared to the previous generation Qwen3-235B-A22B, it's 3.5x faster at standard context lengths.

The model ships in two variants: Qwen3.5-397B-A17B is the open-weight release on Hugging Face and ModelScope. Qwen3.5-Plus is the hosted version on Alibaba Cloud Model Studio with a 1M context window and built-in tools including search and code interpreter.

3Benchmark Breakdown vs Frontier Models

Qwen 3.5 doesn't sweep every category, but it's consistently competitive across a breadth that few models match. Here's how it stacks up:

🧮 Reasoning & Math

Benchmark	Qwen 3.5	GPT-5.2	Claude 4.6	Gemini 3 Pro
AIME 2026	91.3	96.7	93.3	88.0
HMMT Feb 2025	94.8	93.0	—	—
GPQA Diamond	81.0	78.8	—	80.5
IFBench	76.5	75.4	58.0	—
MultiChallenge	67.6	57.9	54.2	—

Qwen 3.5 leads on instruction following (IFBench 76.5, the highest of any model) and complex multi-step challenges (MultiChallenge 67.6). It trails GPT-5.2 on pure math competition performance but holds its own on broader reasoning tasks.

💻 Coding

Benchmark	Qwen 3.5	GPT-5.2	Claude 4.6	Gemini 3 Pro
SWE-bench Verified	76.4	80.0	80.9	76.2
SWE-bench Multilingual	72.0	72.0	—	—
LiveCodeBench v6	83.6	—	—	—
SecCodeBench	68.3	68.7	68.6	—
Terminal-Bench 2.0	52.5	—	—	—

Claude Opus 4.6 maintains a clear edge in agentic coding (SWE-bench 80.9), but Qwen 3.5 is competitive on multilingual coding and security-focused tasks. The Terminal-Bench 2.0 score of 52.5 is a massive jump from Qwen3-Max-Thinking's 22.5.

👁️ Vision & Multimodal

Benchmark	Qwen 3.5	GPT-5.2	Gemini 3 Pro
MMMU	85.0	—	—
MathVision	88.6	83.0	86.6
OmniDocBench	90.8	—	—
OCRBench	93.1	—	—
NOVA-63 (Multilingual)	59.1	—	—

This is where Qwen 3.5 flexes hardest. As a natively multimodal model, it dominates visual benchmarks. If your workload involves document understanding, chart reading, or visual reasoning, Qwen 3.5 is arguably the strongest open-weight model available today.

4Agentic Capabilities & Tool Use

Alibaba built Qwen 3.5 for the agentic paradigm: AI systems that don't just answer questions but independently plan, execute multi-step tasks, call tools, and interact with real-world interfaces. The post-training approach scaled reinforcement learning across virtually all RL tasks and environments, prioritizing difficulty and generalizability.

Agentic Benchmark	Qwen 3.5	Claude 4.6	GPT-5.2
Tau2-Bench	86.7	91.6	—
BrowseComp	78.6	84.0	—
BrowseComp-zh	70.3	62.4	76.1
BFCL-V4 (122B-A10B)	72.2	—	55.5*

* GPT-5 mini score for BFCL-V4

Qwen 3.5 supports three inference modes:

Auto: adaptive thinking with tool use, the default for most agentic workflows
Thinking: deep reasoning for hard problems with full chain-of-thought
Fast: instant responses without chain-of-thought overhead

The model is compatible with OpenClaw, Claude Code, Cline, and Alibaba's own Qwen Code. The Qwen team also demonstrated visual agents that autonomously control smartphones, complete desktop workflows, and solve spatial reasoning tasks using code execution.

5The Small Model Revolution (0.8B–9B)

Don't sleep on the small series. The Qwen3.5-9B model matches or surpasses GPT-OSS-120B (a model 13x its size) across multiple benchmarks:

Benchmark	Qwen3.5-9B	GPT-OSS-120B
GPQA Diamond	81.7	71.5
HMMT Feb 2025	83.2	76.7
MMMU-Pro	70.1	59.7
ERQA	55.5	44.3

The medium series is equally impressive. The Qwen3.5-35B-A3B (only 3B active parameters) outperforms the previous-gen 235B flagship. The 27B dense model ties GPT-5 mini on SWE-bench Verified at 72.4. And the 122B-A10B scores 72.2 on BFCL-V4 for function calling, outperforming GPT-5 mini by 30%.

The Qwen3.5-35B-A3B runs on GPUs with as little as 8 GB VRAM. The 0.8B model needs just 2GB. These are genuinely useful AI models on a laptop or phone, not toy demos.

6Running Qwen 3.5 Locally

All Qwen 3.5 models are open-weight and can be run locally. Here are the hardware requirements and setup options:

Hardware Requirements

Model	Min RAM/VRAM	Recommended Setup
0.8B	2GB	Any modern laptop
2B	4GB	Any modern laptop
4B	6GB	Laptop with 8GB+ RAM
9B	8GB	16GB laptop or 10GB GPU
27B (Q4)	20GB	24GB GPU (RTX 4090, A6000)
35B-A3B	8GB	24GB GPU or M-series Mac
122B-A10B	~40GB	Multi-GPU or 64GB+ Mac
397B-A17B (Q4)	~214GB	256GB M3 Ultra or multi-GPU

Option 1: Ollama (Easiest)

# Install from ollama.com, then:

# Small models (run on almost anything)
ollama run qwen3.5:0.8b
ollama run qwen3.5:9b

# Medium models
ollama run qwen3.5:27b
ollama run qwen3.5:35b-a3b

# Flagship (needs serious hardware)
ollama run qwen3.5

Option 2: llama.cpp (More Control)

# Download quantized model
huggingface-cli download \
  unsloth/Qwen3.5-397B-A17B-GGUF \
  --include "Qwen3.5-397B-A17B-UD-Q4_K_XL*" \
  --local-dir ./qwen3.5-397b

# Start server
llama-server \
  --model ./qwen3.5-397b/Qwen3.5-397B-A17B-UD-Q4_K_XL.gguf \
  --threads 32 \
  --ctx-size 16384 \
  --port 8080

Option 3: Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-27B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain MoE architectures"}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

7API Access & Pricing

If you don't want to self-host, Qwen 3.5 is available through multiple cloud providers:

Alibaba Cloud Model Studio

Qwen3.5-Plus with 1M context, built-in tools

Azure AI Foundry

All medium series models available

NVIDIA NIM

GPU-accelerated endpoints on build.nvidia.com

Hugging Face Inference

All model sizes via Inference Endpoints

Alibaba Cloud's Qwen3.5-Plus API is priced at approximately 0.8 RMB per million tokens (~$0.11/M tokens), making it one of the cheapest frontier-class APIs available. For comparison, that's roughly 13x cheaper than Claude Opus 4.6 via API.

The API follows OpenAI-compatible conventions, making integration straightforward for teams already working with similar tool-calling patterns. Base URLs are region-specific: Beijing for domestic China, Singapore for APAC, and Virginia for US workloads.

# OpenAI-compatible API call

curl https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-plus",
    "messages": [
      {"role": "user", "content": "Explain MoE architectures"}
    ]
  }'

8When to Choose Qwen 3.5

After reviewing the benchmarks and real-world capabilities, here's the honest breakdown:

Choose Claude Opus 4.6 if:

Strongest agentic capabilities, best SWE-bench coding, 1M token context, enterprise reliability. Leads on Tau2-Bench and BrowseComp.

Choose GPT-5.2 / 5.3 if:

Top-tier math reasoning (AIME 2026 leader), strong coding, deep OpenAI ecosystem integration.

Choose Gemini 3 Pro / 3.1 if:

Highest composite intelligence score, strong multimodal within Google's ecosystem.

Choose Qwen 3.5 if:

Open-weight with commercial freedom, strongest vision/multimodal among open models, exceptional instruction following, 201 languages, or you need to run locally without API costs. ~13x cheaper than Claude via API.

The practical takeaway: The Manus agent already routes tasks between Claude and Qwen per step based on the workload. That pattern of model-aware routing, not single-provider dependency, is where production AI is heading. Test Qwen alongside your current model stack.

9Why Lushbinary for Your AI Integration

At Lushbinary, we help teams integrate the right AI models for their specific use case. Whether you're building with Qwen 3.5 for cost-effective multilingual AI, deploying self-hosted models for data sovereignty, or architecting multi-model routing pipelines, we've done it before.

Model selection & benchmarking: we test models against your actual workload, not synthetic benchmarks
Self-hosted deployment: Qwen on your infrastructure with proper GPU provisioning, quantization tuning, and monitoring
Multi-model architectures: routing between Qwen, Claude, and GPT based on task type, cost, and latency requirements
Fine-tuning & RAG: custom Qwen fine-tunes for domain-specific tasks with retrieval-augmented generation pipelines

The gap between open and closed models is closing fast. Qwen 3.5 represents a genuine inflection point for open-weight AI. If you're ready to build with it, we're ready to help.

Build Smarter, Launch Faster.

Q: What is the Qwen 3.5 API pricing?

Qwen3.5-Plus (the hosted 397B model) is available on Alibaba Cloud Model Studio at approximately 0.8 RMB per million tokens (~$0.11/M tokens), making it one of the cheapest frontier-class APIs available. Self-hosting eliminates per-token costs entirely.

Book a free strategy call and explore how LushBinary can turn your vision into reality.

Qwen 3.5 Developer Guide: Benchmarks, Architecture & How to Integrate the 397B Open-Weight Model

📋 Table of Contents

1The Qwen 3.5 Model Family

2Architecture: MoE, Native Multimodal & FP8

3Benchmark Breakdown vs Frontier Models

🧮 Reasoning & Math

💻 Coding

👁️ Vision & Multimodal

4Agentic Capabilities & Tool Use

5The Small Model Revolution (0.8B–9B)

6Running Qwen 3.5 Locally

Hardware Requirements

Option 1: Ollama (Easiest)

Option 2: llama.cpp (More Control)

Option 3: Hugging Face Transformers

7API Access & Pricing

8When to Choose Qwen 3.5

9Why Lushbinary for Your AI Integration

Build Smarter, Launch Faster.

Contact Us

More from the Blog

Gemini 3.1 Pro: What's New, Benchmark Results & Developer Guide

Meta Ray-Ban Glasses Developer Features: Complete Guide for Gen 1 & Gen 2

ContactUs

Our Address

Phone

Email