Open-weight models stopped being the budget alternative in 2026. They became the default. In June alone, Z.ai shipped GLM 5.2, a 744-billion-parameter model that now leads the Artificial Analysis Intelligence Index among open weights and beats GPT-5.5 on several long-horizon coding benchmarks at roughly one-sixth the price. MiniMax released M3 with a 1-million-token context window and native multimodality. DeepSeek, Moonshot, and Alibaba all have current contenders. The frontier is no longer a closed club.
That abundance creates a new problem. With five or six genuinely capable open-weight models, all MIT or Apache licensed, all priced a fraction of the closed frontier, the question is no longer "should we use open weights?" It is "which one, for which job?" Picking the wrong model means overpaying by 5x, hitting a context ceiling mid-task, or self-hosting a 1.6-trillion-parameter monster you did not need.
This guide compares the leading open-weight models as of June 2026 on intelligence, cost, context, licensing, and multimodality, then gives you a decision framework for what to choose when. All figures are drawn from official model cards and vendor pricing pages, with sources listed at the end.
Table of Contents
- 1.The June 2026 open-weight landscape
- 2.Head-to-head comparison table
- 3.GLM 5.2: the new open-weight leader
- 4.DeepSeek V4 Pro and Flash: the price floor
- 5.MiniMax M3: 1M context and native multimodality
- 6.Kimi K2.6 and K2.7 Code: agentic coding
- 7.Qwen 3.6: the local-friendly all-rounder
- 8.Decision framework: what to choose when
- 9.Self-hosting vs API: cost and data sovereignty
- 10.Frequently asked questions
1The June 2026 Open-Weight Landscape
Five families dominate the practical open-weight conversation right now. Each took a different bet, and those bets are what make the "what to choose" question answerable.
- GLM 5.2 (Z.ai / Zhipu, June 13, 2026) bet on raw intelligence and long-horizon coding. It is the new open-weight leader on the Artificial Analysis Intelligence Index.
- DeepSeek V4 (April 24, 2026) bet on price and algorithmic reasoning, shipping Pro and Flash variants that reset the cost floor and lead on competitive-programming benchmarks.
- MiniMax M3 (June 1, 2026) bet on a cheap 1-million-token context plus native multimodality in a single open-weight model.
- Kimi K2.6 and K2.7 Code (Moonshot) bet on agentic workloads: long-horizon task completion and coding-specialized variants.
- Qwen 3.6 (Alibaba) bet on accessibility, a compact Mixture-of-Experts design that runs on a single GPU with strong tool calling and vision.
The headline number
On the Artificial Analysis Intelligence Index v4.1 (June 2026), GLM 5.2 scores 51 and ranks 5th overall, ahead of MiniMax M3 (44) and DeepSeek V4 Pro (44). On the GDPval-AA v2 work benchmark, GLM 5.2 scores 1524, MiniMax M3 1418, and DeepSeek V4 Pro 1328. Open weights are now competitive with proprietary frontier models.
2Head-to-Head Comparison Table
API prices are per million tokens (input / output) at standard, non-promotional rates as of June 2026. Self-host VRAM is the approximate total budget at serving precision, including weights, KV cache, and overhead, not weights alone.
| Model | Params (active) | Context | License | API $/1M (in/out) | Multimodal |
|---|---|---|---|---|---|
| GLM 5.2 | 744B MoE | 1M | MIT | $1.40 / $4.40 | Text |
| DeepSeek V4 Pro | 1.6T (49B) | 1M | MIT | $1.74 / $3.48 | Text |
| DeepSeek V4 Flash | 284B (13B) | 1M | MIT | $0.14 / $0.28 | Text |
| MiniMax M3 | MoE (MSA) | 1M | Modified MIT | $0.60 / $2.40 | Yes (native) |
| Kimi K2.6 | ~1T MoE | 256K | Apache 2.0 | ~$0.60 blended | Text |
| Qwen 3.6 (35B-A3B) | 35B (3B) | 256K | Apache 2.0 | Local / low | Yes (vision) |
DeepSeek V4 Pro often runs a promotional rate near $0.435 / $0.87 per million tokens, and MiniMax M3 launched with a 50% promo at $0.30 / $1.20. Promotions change, so the table uses standard pricing. Always confirm the current rate on the vendor pricing page.
3GLM 5.2: The New Open-Weight Leader
Z.ai released GLM 5.2 on June 13, 2026: a 744-billion-parameter Mixture-of-Experts model with a stable 1-million-token context window, MIT-licensed weights, and a pitch aimed squarely at long-horizon coding. It is the first open-weight model to top the Artificial Analysis Intelligence Index v4.1 at 51, placing 5th overall against the closed frontier.
The benchmark story is the most interesting part. GLM 5.2 beats GPT-5.5 on several long-horizon coding benchmarks while costing roughly one-sixth as much: a combined $5.80 per million tokens versus around $35 for GPT-5.5. Against Claude Opus 4.8 it is closer than any open-weight model has been, landing within about a point on FrontierSWE (74.4 vs 75.1) and MCP-Atlas (76.8 vs 77.8), and winning AIME 2026, IMOAnswerBench, and Terminal-Bench 2.1 under its best harness, at $1.40/$4.40 against Opus 4.8's $5/$25.
Choose GLM 5.2 when
You want the strongest open-weight model for long, complex coding and agentic tasks, you value the clean MIT license, and you can afford its higher per-token price (still far below the closed frontier). It is the default pick if you are replacing GPT-5.5 or Claude on a coding workload and want to cut the bill.
For the full architecture and access breakdown, see our GLM 5.2 developer guide and the head-to-head GLM 5.2 vs Claude Opus 4.8 vs GPT-5.5 comparison.
4DeepSeek V4 Pro and Flash: The Price Floor
DeepSeek shipped V4 Pro and V4 Flash on April 24, 2026, both MIT licensed with a native 1-million-token context window and up to 384K max output. V4 Pro is a 1.6-trillion-parameter MoE with 49B active per forward pass; V4 Flash is a leaner 284B with 13B active, built for high-volume, cost-sensitive work.
DeepSeek's edge is algorithmic reasoning and price. V4 leads competitive-programming and math benchmarks among open weights: LiveCodeBench 93.5% (number one globally), Codeforces 3206, HMMT 95.2%, and GPQA 90.1%. V4 Flash posts 79% on SWE-bench Verified at $0.14/$0.28 per million tokens, which is close to two orders of magnitude cheaper than GPT-5.5 on output.
Choose DeepSeek V4 when
Cost is the deciding factor. Use V4 Flash for high-volume routing, classification, and straightforward coding where $0.14/$0.28 changes your unit economics. Use V4 Pro when you want maximum algorithmic-reasoning quality and a 1M context at the lowest price in its tier.
See the V4 Pro vs Flash breakdown to pick the right variant.
5MiniMax M3: 1M Context and Native Multimodality
MiniMax M3 (June 1, 2026) is the open-weight model to reach for when you need a long context, images, and a low bill at the same time. It pairs frontier-level coding with a 1-million-token context window and native multimodality, all under a modified-MIT license you can self-host.
Its standout engineering is MiniMax Sparse Attention (MSA), which delivers roughly 15.6x faster decoding and 9.7x faster prefill at 1M context compared to MiniMax M2. That is what makes a 1M window affordable to actually use rather than a spec-sheet number. Standard pricing is $0.60/$2.40 per million tokens (cached input $0.12), with launch promos seen as low as $0.30/$1.20.
Choose MiniMax M3 when
You need native image input, a true 1M context that is cheap to run, or both: document understanding, multimodal agents, and long-context retrieval-light workflows. It is the best value-per-capability model in the multimodal category.
Full details are in our MiniMax M3 developer guide.
6Kimi K2.6 and K2.7 Code: Agentic Coding
Moonshot's Kimi line is built for agents. K2.6 is a roughly 1-trillion-parameter MoE with a 256K context, tuned for long-horizon task completion and multi-step tool use rather than single-shot answers. The newer K2.7 Code (June 13, 2026) is a coding-specialized variant that cuts thinking tokens by about 30% versus K2.6, which directly lowers the cost of long agent runs.
Where GLM 5.2 wins on benchmark-topping intelligence and DeepSeek on price, Kimi wins on agentic stability: recoverable failure modes, consistent tool calling across long sessions, and strong real-world software-engineering performance. The Apache 2.0 license is also the most permissive of the group for commercial redistribution.
Choose Kimi when
You are running autonomous, long-horizon coding agents that need to stay coherent across hundreds of steps. Reach for K2.7 Code specifically when the workload is code-heavy and token cost over long runs matters.
For setup and benchmarks, see the Kimi K2.7 Code developer guide.
7Qwen 3.6: The Local-Friendly All-Rounder
Alibaba's Qwen 3.6 is the model you run yourself. The popular 35B-A3B configuration is a Mixture-of-Experts design with only 3B active parameters, so it serves quickly on a single high-memory GPU, and quantized GGUF builds run on consumer hardware. It is Apache 2.0 licensed, multimodal with vision, and strong at tool calling, which makes it a reliable backbone for local agents and edge deployments.
Qwen 3.6 will not top the intelligence leaderboards against 744B and 1.6T giants, and that is the point. For local development, on-device features, privacy-sensitive prototyping, and tool-calling agents that do not need frontier reasoning, a small model that runs anywhere beats a giant you have to rent. Alibaba also offers a larger Qwen 3.7-Max tier for hours-long, thousands-of-tool-call workloads, though that tier is API-first rather than openly downloadable.
Choose Qwen 3.6 when
You want a model that runs locally on one GPU or even a laptop, you need vision and solid tool calling, and the task does not demand frontier-level reasoning. It is the best fit for local dev, edge, and privacy-first prototypes.
See the Qwen 3.6 developer guide for self-hosting specifics.
8Decision Framework: What to Choose When
Start from the constraint that matters most for your workload: intelligence, cost, context, deployment, or agent stability. Each one points cleanly at a different model.
A few practical rules sharpen the map:
- Default to GLM 5.2 for serious coding agents unless cost or local deployment forces another choice. It is the highest open-weight intelligence available.
- Route by difficulty. Send easy, high-volume calls to DeepSeek V4 Flash and hard calls to GLM 5.2. A simple model-routing gateway can cut spend by half without quality loss.
- Pick by context shape, not just size. If you feed images or whole documents, MiniMax M3's native multimodal 1M context beats bolting retrieval onto a text-only model.
- Match the license to your distribution. MIT (GLM, DeepSeek) and Apache 2.0 (Kimi, Qwen) are both permissive; check the modified-MIT terms on MiniMax M3 if you plan to redistribute.
For a broader agent-focused ranking, see our roundup of the best open-source LLMs for AI agents.
9Self-Hosting vs API: Cost and Data Sovereignty
The whole point of open weights is that you have a choice the closed frontier does not offer: run it yourself. But self-hosting a frontier model is a real infrastructure commitment, not a free lunch. Total VRAM is weights plus KV cache for your context length plus runtime overhead, never weights alone.
GLM 5.2 at 744B needs roughly 800GB of VRAM at serving precision, which is about eight H200 GPUs. DeepSeek V4 Pro at 1.6T is heavier still. DeepSeek V4 Flash (13B active) and Qwen 3.6 (3B active) are the accessible end: Flash fits a small multi-GPU box and Qwen 3.6 runs on a single GPU or a quantized laptop build. As a rule, use a hosted API until your token volume or compliance rules justify the hardware.
Data sovereignty caveat
GLM, DeepSeek, MiniMax, Kimi, and Qwen are all Chinese-vendor models. Using their hosted APIs can route your data through servers subject to China's National Intelligence Law. The open weights themselves are MIT or Apache licensed, so self-hosting inside your own boundary removes that data-path concern while keeping the cost and control upside. For regulated or sensitive workloads, this is often the deciding factor.
For the full VRAM math, GPU sizing, and the break-even point against API pricing, see our GLM 5.2 self-hosting guide and our enterprise open-weight evaluation guide.
10Frequently Asked Questions
What is the best open-weight AI model in June 2026?
GLM 5.2 is the leading open-weight model on the Artificial Analysis Intelligence Index v4.1, scoring 51 (5th overall) and topping MiniMax M3 and DeepSeek V4 Pro (both 44). It is the strongest pick for long-horizon coding. But the best model depends on the job: DeepSeek V4 Flash wins on cost, MiniMax M3 on cheap 1M-context multimodality, and Qwen 3.6 on local single-GPU work.
Which open-weight model is the cheapest to run via API?
DeepSeek V4 Flash at $0.14 per million input tokens and $0.28 per million output tokens is the cheapest high-quality option, with a 1M context window and MIT license. MiniMax M3 at standard $0.60/$2.40 is the cheapest frontier-class model with native multimodality.
Which open-weight model has the largest context window?
Among the mainstream contenders, GLM 5.2, DeepSeek V4, and MiniMax M3 all offer a 1-million-token context window. Llama 4 Scout goes further with a 10-million-token window, but trails the others on coding and reasoning quality. Kimi and Qwen 3.6 sit at 256K.
Are open-weight models good enough to replace GPT-5.5 or Claude Opus 4.8?
For many coding and agentic workloads, yes. GLM 5.2 beats GPT-5.5 on several long-horizon coding benchmarks at roughly one-sixth the API cost, and trails Claude Opus 4.8 by about a point on FrontierSWE and MCP-Atlas while costing far less. The frontier closed models still lead on the hardest reasoning and the broadest multimodal tasks.
Should I self-host an open-weight model or use a hosted API?
Use a hosted API until volume or data-residency rules justify self-hosting. A 744B model like GLM 5.2 needs roughly 800GB of VRAM (about 8x H200) at serving precision. Self-hosting wins when you need full data control, predictable cost at very high token volume, or air-gapped deployment.
Does using a Chinese open-weight model raise compliance concerns?
Using a Chinese vendor's hosted API (GLM, DeepSeek, MiniMax, Kimi, Qwen) can route data through servers subject to China's National Intelligence Law. The open weights themselves are MIT or Apache licensed, so self-hosting inside your own boundary removes that data-path concern while keeping the cost and control benefits.
11Why Lushbinary for Your Open-Weight AI Build
Choosing the model is the easy part. The hard part is the system around it: routing easy calls to a cheap model and hard ones to GLM 5.2, sizing GPUs for self-hosting, keeping data inside your boundary, and wiring models into reliable agents. That is the work we do.
Lushbinary builds production AI systems on open-weight models: model-routing gateways that cut spend, self-hosted deployments on vLLM and SGLang with honest VRAM math, multimodal pipelines, and agentic coding workflows that ship. We help you pick the right model for each job and build the infrastructure that makes it dependable.
🚀 Free Consultation
Not sure which open-weight model fits your workload or whether to self-host? Lushbinary specializes in AI systems built on open weights. We'll review your use case, recommend the right model and deployment, and give you a realistic cost and timeline estimate with no obligation.
12Sources
- Artificial Analysis: GLM 5.2 Intelligence Index and GDPval-AA v2
- Artificial Analysis: open-source model comparison
- DeepSeek API pricing documentation
- MiniMax M3 launch notes (Fireworks AI)
- GLM 5.2 vs Claude Opus 4.8 benchmark breakdown (LLM Stats)
Content was rephrased for compliance with licensing restrictions. Benchmark scores and model specifications sourced from official model cards and Artificial Analysis as of June 2026. API pricing sourced from official vendor pricing pages as of June 2026. Prices and benchmarks change frequently, often with promotional rates, so always verify current figures on the vendor's website before committing.
Build on the Right Open-Weight Model
Tell us your workload and constraints. We'll recommend the model, the deployment, and the architecture, then build it with you.
Ready to Build Something Great?
Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.
Prefer email? Reach us directly:

